How to add 3 million real companies to your empty Cloudflare D1 database?

April 22, 2024 by Onni Hakala

Getting started with empty database for a new product

I have recently been building a new tool to help people to find new workplaces. In order to make it easier to add current workplaces and show related links for the jobs I have found I needed to find a dataset which contains company’s name and main domain and other interesting facts like social media links.

I found out that the CrunchBase is perfect for this and they offer very basic data set for free which is amazing. By importing this dataset into my application I can then build features like auto-complete later on and make it easier to match the open jobs to the actual companies.

During this process I also learned that Cloudflare D1 doesn’t support importing large (>500Mb) database dumps but this can be circumvented by splitting the dataset into smaller patches.

The steps I used with my MacBook

You probably need to change few lines if you use something else than MacOS

Create a CrunchBase user and obtain an API key under Integrations -> CrunchBase API. The basic API is free which is amazing.
You should now have a CrunchBase API key and you can export it in command line:
```
# Replace XXXXXX... with your key
export CB_API_KEY=XXXXXX
```

Download organizations.csv & people.csv files from CrunchBase CSV export:

mkdir company-data
cd company-data
curl -O "https://api.crunchbase.com/odm/v4/odm.tar.gz?user_key=$CB_API_KEY"
tar -xvf odm.tar.gz

Install duckdb & jq and query the organizations.csv for an example

brew install duckdb jq
duckdb --json -c "SELECT * FROM 'organizations.csv' WHERE domain = 'meta.com'" | jq
[
  {
    "uuid": "df662812-7f97-0b43-9d3e-12f64f504fbb",
    "name": "Meta",
    "type": "organization",
    "primary_role": "company",
    "cb_url": "https://www.crunchbase.com/organization/facebook?utm_source=crunchbase&utm_medium=export&utm_campaign=odm_csv",
    "domain": "meta.com",
    "homepage_url": "https://meta.com",
    "logo_url": "https://images.crunchbase.com/image/upload/t_cb-default-original/whm4ed1rrc8skbdi3biv",
    "facebook_url": "https://www.facebook.com/Meta",
    "twitter_url": "https://www.twitter.com/Meta",
    "linkedin_url": "https://www.linkedin.com/company/meta",
    "combined_stock_symbols": "nasdaq:META",
    "city": "Menlo Park",
    "region": "California",
    "country_code": "USA",
    "short_description": "Meta is a social technology company that enables people to connect, find communities, and grow businesses."
  }
]

Optional: Remove extra details which are not needed and convert the large csv to ZSTD compressed parquet formatted binary file

# Removes extra data to make this data set smaller. You don't need to do this and can just directly use the `organizations.csv` as source in the step 6.
duckdb -c "
    COPY (
        SELECT
            name,
            domain,
            short_description,
            combined_stock_symbols,
            REPLACE(SPLIT_PART(cb_url, '?', 1),'https://www.crunchbase.com/organization/','') as crunchbase_id,
            REPLACE(logo_url,'https://images.crunchbase.com/image/upload/t_cb-default-original/','') as crunchbase_logo_key,
            RTRIM(REGEXP_REPLACE(facebook_url,'https?://www.facebook.com/',''),'/') as facebook_page_name,
            SPLIT_PART(REGEXP_REPLACE(linkedin_url,'https?://www.linkedin.com/company/',''),'/',1) as linkedin_company_page,
            REGEXP_REPLACE(twitter_url,'https?://(www\.)?twitter.com/','') as twitter_username,
            city,
            region,
            country_code
        FROM 'organizations.csv'
        WHERE linkedin_url IS NOT NULL
        AND primary_role = 'company'
    ) TO 'companies-min.zstd.parquet'
    (FORMAT PARQUET, CODEC 'zstd');
"

Convert the parquet format into sqlite database and name the dataset as crunchbase_companies

# Convert dataset into sqlite database
duckdb -c "
    INSTALL sqlite;
    LOAD sqlite;
    ATTACH 'companies.db' (TYPE SQLITE);
    USE companies;
    CREATE TABLE crunchbase_companies AS FROM 'companies-min.zstd.parquet';
"
# Dump the sqlite file into SQL without transactions and PRAGMA commands
sqlite3 companies.db .dump > companies.sql

Install GNU utils and split the file to multiple files. This also removes all SQL transactions because Cloudflare D1 doesn’t support them out of the box. I had few problems with Cloudflare D1 when I tried to import the 1.1GB dataset directly and splitting it into separate files helped.

brew install coreutils
# Take all table creating commands into separate file
grep -E '^CREATE TABLE' companies.sql > create-tables.sql
# Split all insert into commands into multiple separate files with 100k lines
grep -E '^INSERT INTO' companies.sql | gsplit -l 100000 --additional-suffix=.sql -d - insert-into-

Import SQL which creates all of the tables first

wrangler d1 create your-database
wrangler d1 execute your-database --yes --remote --file=create-tables.sql

Import all of the created data into the database in separate processes using xargs

find . -name 'insert-into-*.sql' | xargs -P $(sysctl -n hw.ncpu) -I{} wrangler d1 execute your-database --yes --remote --file={}

Conclusion

I hope this is going to be useful for others who want to import company data or who will encounter the import size limitations of Cloudflare D1 💪. If you know better way to upload large dataset or other great public datasets regarding public company data let me know.