This repository contains code for academic researchers to link SafeGraph Brand data to commonly used public company unique identifiers. This notebook is designed to be a starting point, and is subject to several limitations (Limitations).
- SafeGraph Brand data — Available via Dewey Data
- Compustat Fundamentals Annual — Available via WRDS
- This notebook requires WRDS API access, but can be easily modified to read in local files
- Using Python on WRDS Platform
- SEC EDGAR Index Files — Scraped from SEC Website
- This notebook uses typical Python libraries, including pandas, numpy, and fuzzywuzzy.
- This notebook requires WRDS API access, but can be easily modified to read in local files
- Required code modifications from the user:
- path to the user's local directory
- the user's name and email address (for SEC EDGAR scraping)
- Optional code modifications from the user:
min_score— The minimum fuzzy match score to accept- The default score in the function
find_best_matchis 85
- The default score in the function
- Separate the SafeGraph Brand data into two datasets:
sgTickerData— Contains all SafeGraph Brand data that has aSTOCK_SYMBOL(ticker)sgNoTickerData— Contains all SafeGraph Brand data that does not have aSTOCK_SYMBOL(ticker)
- Match SafeGraph Brand data to Compustat data using the
STOCK_SYMBOL(ticker) - If there is no ticker match, or if there is no CIK (SEC unique identifier), we use SEC EDGAR data to fuzzy match the SafeGraph Brand to the SEC Company Name.
- For SafeGraph Brands that do not have a
STOCK_SYMBOL(ticker), we use SEC EDGAR data to fuzzy match the SafeGraph Brand to the SEC Company Name. - Check if there are any SafeGraph Brands without a match that have a
PARENT_SAFEGRAPH_BRAND_IDwith a public company match.
- Notebook Setup (import packages, set directory, define functions)
- Identify unique
gvkey, tic, conm, cikobservations in Compustat - Identify
SAFEGRAPH_BRAND_IDobservations that have aSTOCK_SYMBOL(ticker) - Match SafeGraph Data with Compustat Data on Ticker (
STOCK_SYMBOLandtic) - Fuzzy Match with SEC data (Because some Compustat companies do not have CIK (SEC unique identifier), and because some SafeGraph brands may not have a Compustat ticker, we next use SEC EDGAR data to match on company name.)
- Pull EDGAR data from SEC for 2019 - 2024, identify reporting companies
- Fuzzy match
BRAND_NAMEtoCompany Name— Default minimum score is 85 (out of 100)
- Fuzzy Match SafeGraph Brands without a
STOCK_SYMBOLto SEC Company Name - Merge matches with all data on
PARENT_SAFEGRAPH_BRAND_ID
-
This linking process relies on SafeGraph and Compustat identification of stock tickers.
- Tickers change over time as a result of events such as mergers, acquisitions, and company re-brands.
- Neither SafeGraph nor Compustat provide historical ticker information (i.e. the ticker identified by each data source is the ticker as of a particular point in time.)
- Compustat ticker is the most current ticker for the company.
- The exact date of SafeGraph's ticker identification is not known.
- If a researcher is interested in a more complete and accuarate tracking of ticker changes overtime, there may be event data available from other sources (e.g. NYSE)
-
How SafeGraph identifies parent companies is not known. Given the vast number of brands and sometimes complex holding company structures, it's possible that there are brands that are not identified with a public company that are in fact owned by a public company. The researcher should be aware of this possibility when planning their research design.
- For example, if a researcher is interested in looking at the foot traffic of public companies, and there were unidentified brand relationships, the foot traffic of the public company may be understated.
-
SafeGraph does not take into account the time-series of parent company changes.
- For example, Tiffany & Co. (a publicly traded company) was acquired by LVMH in January 2021 — SafeGraph reflects the current structure, listing LVMH as the parent company of the Tiffany & Co. brand.
-
This notebook utilizes fuzzy matching on company names. The threshold in the
find_best_matchfunction is set to 85 by default. The exact matching success rate has not been audited in detail formatch_type == 'company name'. The researcher may modify this threshold considering their tolerance for matching errors.