How to add 3 million real companies to your empty Cloudflare D1 database?

How to add 3 million real companies to your empty Cloudflare D1 database?

by Onni Hakala

Getting started with empty database for a new product

I have recently been building a new tool to help people to find new workplaces. In order to make it easier to add current workplaces and show related links for the jobs I have found I needed to find a dataset which contains company’s name and main domain and other interesting facts like social media links.

I found out that the CrunchBase is perfect for this and they offer very basic data set for free which is amazing. By importing this dataset into my application I can then build features like auto-complete later on and make it easier to match the open jobs to the actual companies.

During this process I also learned that Cloudflare D1 doesn’t support importing large (>500Mb) database dumps but this can be circumvented by splitting the dataset into smaller patches.

The steps I used with my MacBook

You probably need to change few lines if you use something else than MacOS

  1. Create a CrunchBase user and obtain an API key under Integrations -> CrunchBase API. The basic API is free which is amazing.

  2. You should now have a CrunchBase API key and you can export it in command line:

    # Replace XXXXXX... with your key
    export CB_API_KEY=XXXXXX
  3. Download organizations.csv & people.csv files from CrunchBase CSV export:

    mkdir company-data
    cd company-data
    curl -O "https://api.crunchbase.com/odm/v4/odm.tar.gz?user_key=$CB_API_KEY"
    tar -xvf odm.tar.gz
  4. Install duckdb & jq and query the organizations.csv for an example

    brew install duckdb jq
    duckdb --json -c "SELECT * FROM 'organizations.csv' WHERE domain = 'meta.com'" | jq
    [
      {
        "uuid": "df662812-7f97-0b43-9d3e-12f64f504fbb",
        "name": "Meta",
        "type": "organization",
        "primary_role": "company",
        "cb_url": "https://www.crunchbase.com/organization/facebook?utm_source=crunchbase&utm_medium=export&utm_campaign=odm_csv",
        "domain": "meta.com",
        "homepage_url": "https://meta.com",
        "logo_url": "https://images.crunchbase.com/image/upload/t_cb-default-original/whm4ed1rrc8skbdi3biv",
        "facebook_url": "https://www.facebook.com/Meta",
        "twitter_url": "https://www.twitter.com/Meta",
        "linkedin_url": "https://www.linkedin.com/company/meta",
        "combined_stock_symbols": "nasdaq:META",
        "city": "Menlo Park",
        "region": "California",
        "country_code": "USA",
        "short_description": "Meta is a social technology company that enables people to connect, find communities, and grow businesses."
      }
    ]
  5. Optional: Remove extra details which are not needed and convert the large csv to ZSTD compressed parquet formatted binary file

    # Removes extra data to make this data set smaller. You don't need to do this and can just directly use the `organizations.csv` as source in the step 6.
    duckdb -c "
        COPY (
            SELECT
                name,
                domain,
                short_description,
                combined_stock_symbols,
                REPLACE(SPLIT_PART(cb_url, '?', 1),'https://www.crunchbase.com/organization/','') as crunchbase_id,
                REPLACE(logo_url,'https://images.crunchbase.com/image/upload/t_cb-default-original/','') as crunchbase_logo_key,
                RTRIM(REGEXP_REPLACE(facebook_url,'https?://www.facebook.com/',''),'/') as facebook_page_name,
                SPLIT_PART(REGEXP_REPLACE(linkedin_url,'https?://www.linkedin.com/company/',''),'/',1) as linkedin_company_page,
                REGEXP_REPLACE(twitter_url,'https?://(www\.)?twitter.com/','') as twitter_username,
                city,
                region,
                country_code
            FROM 'organizations.csv'
            WHERE linkedin_url IS NOT NULL
            AND primary_role = 'company'
        ) TO 'companies-min.zstd.parquet'
        (FORMAT PARQUET, CODEC 'zstd');
    "
  6. Convert the parquet format into sqlite database and name the dataset as crunchbase_companies

    # Convert dataset into sqlite database
    duckdb -c "
        INSTALL sqlite;
        LOAD sqlite;
        ATTACH 'companies.db' (TYPE SQLITE);
        USE companies;
        CREATE TABLE crunchbase_companies AS FROM 'companies-min.zstd.parquet';
    "
    # Dump the sqlite file into SQL without transactions and PRAGMA commands
    sqlite3 companies.db .dump > companies.sql
  7. Install GNU utils and split the file to multiple files. This also removes all SQL transactions because Cloudflare D1 doesn’t support them out of the box. I had few problems with Cloudflare D1 when I tried to import the 1.1GB dataset directly and splitting it into separate files helped.

    brew install coreutils
    # Take all table creating commands into separate file
    grep -E '^CREATE TABLE' companies.sql > create-tables.sql
    # Split all insert into commands into multiple separate files with 100k lines
    grep -E '^INSERT INTO' companies.sql | gsplit -l 100000 --additional-suffix=.sql -d - insert-into-
  8. Import SQL which creates all of the tables first

    wrangler d1 create your-database
    wrangler d1 execute your-database --yes --remote --file=create-tables.sql
  9. Import all of the created data into the database in separate processes using xargs

    find . -name 'insert-into-*.sql' | xargs -P $(sysctl -n hw.ncpu) -I{} wrangler d1 execute your-database --yes --remote --file={}

Conclusion

I hope this is going to be useful for others who want to import company data or who will encounter the import size limitations of Cloudflare D1 💪. If you know better way to upload large dataset or other great public datasets regarding public company data let me know.