Benchmarking datasets

When you provision a Lakehouse node, it comes preconfigured to point to a public S3 bucket in AWS us-east-1 that contains sample benchmarking datasets.

Note

Accessing these datasets can incur costs for data transfer and be subject to cross-region latencies.

You can query tables in these datasets by referencing them with their schema name.

Schema nameDataset
tpcds_sf_1TPC-DS, Scale Factor 1
tpcds_sf_10TPC-DS, Scale Factor 10
tpcds_sf_100TPC-DS, Scale Factor 100
tpcds_sf_1000TPC-DS, Scale Factor 1000
tpch_sf_1TPC-H, Scale Factor 1
tpch_sf_10TPC-H, Scale Factor 10
tpch_sf_100TPC-H, Scale Factor 100
tpch_sf_1000TPC-H, Scale Factor 1000
clickbenchClickBench, 100 million rows
brc_1bBillion row challenge
Notes about ClickBench data

Data columns (EventData) are integers, not dates.

You must quote ClickBench column names because they contain uppercase letters, but unquoted identifiers in Postgres are case-insensitive. For example:

select "Title" from clickbench.hits;

🚫 select Title from clickbench.hits;


Could this page be better? Report a problem or suggest an addition!