TPCH Dataset Example

The TPC-H dataset and system is used to benchmark computer and database systems. This dataset is chosen as an example here because it represents sample data with real-world relevance and allows complex queries answering critical business questions.

Setting up the dataset

First, create the TPC-H dataset as described here: https://github.com/sunblaze-ucb/elastic-sensitivity-experiments/tree/master/data

Load the dataset into a SQL database, e.g. your local Postgres. Make sure to remember the database, schema and user and password settings for the created SQL data.

The schema, i.e. the create script could look like this:

CREATE TABLE REGION (
  R_REGIONKEY integer NOT NULL,
  R_NAME text NOT NULL,
  R_COMMENT text,
  PRIMARY KEY (R_REGIONKEY)
);

CREATE TABLE NATION (
  N_NATIONKEY integer NOT NULL,
  N_NAME text NOT NULL,
  N_REGIONKEY integer NOT NULL,
  N_COMMENT text,
  PRIMARY KEY (N_NATIONKEY),
  FOREIGN KEY (N_REGIONKEY) REFERENCES REGION (R_REGIONKEY)
);

CREATE TABLE CUSTOMER (
  C_CUSTKEY integer NOT NULL,
  C_NAME text NOT NULL,
  C_ADDRESS text NOT NULL,
  C_NATIONKEY integer NOT NULL,
  C_PHONE text NOT NULL,
  C_ACCTBAL integer NOT NULL,
  C_MKTSEGMENT text NOT NULL,
  C_COMMENT text NOT NULL,
  PRIMARY KEY (C_CUSTKEY),
  FOREIGN KEY (C_NATIONKEY) REFERENCES NATION (N_NATIONKEY)
);

CREATE TABLE ORDERS (
  O_ORDERKEY integer NOT NULL,
  O_CUSTKEY integer NOT NULL,
  O_ORDERSTATUS text NOT NULL,
  O_TOTALPRICE integer NOT NULL,
  O_ORDERDATE date NOT NULL,
  O_ORDERPRIORITY text NOT NULL,
  O_CLERK text NOT NULL,
  O_SHIPPRIORITY integer NOT NULL,
  O_COMMENT text NOT NULL,
  PRIMARY KEY (O_ORDERKEY),
  FOREIGN KEY (O_CUSTKEY) REFERENCES CUSTOMER (C_CUSTKEY)
);

CREATE TABLE PART (
  P_PARTKEY integer NOT NULL,
  P_NAME text NOT NULL,
  P_MFGR text NOT NULL,
  P_BRAND text NOT NULL,
  P_TYPE text NOT NULL,
  P_SIZE integer NOT NULL,
  P_CONTAINER text NOT NULL,
  P_RETAILPRICE integer NOT NULL,
  P_COMMENT text NOT NULL,
  PRIMARY KEY (P_PARTKEY)
);

CREATE TABLE SUPPLIER (
  S_SUPPKEY integer NOT NULL,
  S_NAME text NOT NULL,
  S_ADDRESS text NOT NULL,
  S_NATIONKEY integer NOT NULL,
  S_PHONE text NOT NULL,
  S_ACCTBAL integer NOT NULL,
  S_COMMENT text NOT NULL,
  PRIMARY KEY (S_SUPPKEY),
  FOREIGN KEY (S_NATIONKEY) REFERENCES NATION (N_NATIONKEY)
);

CREATE TABLE PARTSUPP (
  PS_PARTKEY integer NOT NULL,
  PS_SUPPKEY integer NOT NULL,
  PS_AVAILQTY integer NOT NULL,
  PS_SUPPLYCOST integer NOT NULL,
  PS_COMMENT text NOT NULL,
  PRIMARY KEY (PS_PARTKEY,PS_SUPPKEY),
  FOREIGN KEY (PS_SUPPKEY) REFERENCES SUPPLIER (S_SUPPKEY),FOREIGN KEY (PS_PARTKEY) REFERENCES PART (P_PARTKEY)
);

CREATE TABLE LINEITEM (
  L_ORDERKEY integer NOT NULL,
  L_PARTKEY integer NOT NULL,
  L_SUPPKEY integer NOT NULL,
  L_LINENUMBER integer NOT NULL,
  L_QUANTITY integer NOT NULL,
  L_EXTENDEDPRICE integer NOT NULL,
  L_DISCOUNT integer NOT NULL,
  L_TAX integer NOT NULL,
  L_RETURNFLAG text NOT NULL,
  L_LINESTATUS text NOT NULL,
  L_SHIPDATE date NOT NULL,
  L_COMMITDATE date NOT NULL,
  L_RECEIPTDATE date NOT NULL,
  L_SHIPINSTRUCT text NOT NULL,
  L_SHIPMODE text NOT NULL,
  L_COMMENT text NOT NULL,
  PRIMARY KEY (L_ORDERKEY,L_LINENUMBER),
  FOREIGN KEY (L_ORDERKEY) REFERENCES ORDERS (O_ORDERKEY),FOREIGN KEY (L_PARTKEY, L_SUPPKEY) REFERENCES PARTSUPP (PS_PARTKEY, PS_SUPPKEY)
);

Login to DQ0 as a "Data Owner" user to set up this dataset for usage on the DQ0 platform. Now use the DQ0 dataset editor to register and define this dataset to the platform.

Specifiy the source of the dataset as a database connection string:

Dataset source

And define the properties according to the column descriptions above.

A valid metadata definition for the Census dataset, that can be inspected or edited in the "Summary" tab of the data set editor, looks like this:

name: dq0_test_tpch
description: some description
type: postgresql
tags: ""
privacy_column: l_orderkey
metadata_is_public: true
public:
  connection: user:password@localhost:5432/tpch
  privacy_level: 2
  lineitem:
    row_privacy: false
    rows: 1325478
    max_ids: 5
    sample_max_ids: true
    censor_dims: true
    clamp_counts: true
    clamp_columns: true
    budget_epsilon: 1
    budget_delta: 1
    l_orderkey:
      private_id: true
      type: int
    l_partkey:
      type: int
    l_suppkey:
      type: int
    l_linenumber:
      type: int
    l_quantity:
      type: float
    l_extendedprice:
      type: float
      bounded: true
      use_auto_bounds: true
      auto_bounds_prob: 0.9
      lower: 1
      upper: 2
    l_discount:
      type: float
    l_tax:
      type: float
    l_returnflag:
      type: string
      cardinality: 100
    l_linestatus:
      type: string
      cardinality: 100
    l_shipdate:
      type: string
      cardinality: 100
    l_commitdate:
      type: string
      cardinality: 100
    l_receiptdate:
      type: string
      cardinality: 100
    l_shipinstruct:
      type: string
      cardinality: 100
    l_shipmode:
      type: string
      cardinality: 100
    l_comment:
      type: string
      cardinality: 100
  nation:
    row_privacy: false
    rows: 1000
    max_ids: 1
    sample_max_ids: true
    censor_dims: true
    clamp_counts: false
    clamp_columns: true
    n_nationkey:
      private_id: true
      type: int
    n_name:
      type: string
      cardinality: 100
    n_regionkey:
      type: int
    n_comment:
      type: string
      cardinality: 100
  orders:
    row_privacy: false
    rows: 1000
    max_ids: 1
    sample_max_ids: true
    censor_dims: true
    clamp_counts: false
    clamp_columns: true
    o_orderkey:
      private_id: true
      type: int
    o_custkey:
      type: int
    o_orderstatus:
      type: string
      cardinality: 100
    o_totalprice:
      type: float
    o_orderdate:
      type: string
      cardinality: 100
    o_orderpriority:
      type: string
      cardinality: 100
    o_clerk:
      type: string
      cardinality: 100
    o_shippriority:
      type: int
    o_comment:
      type: string
      cardinality: 100
  region:
    row_privacy: false
    rows: 1000
    max_ids: 1
    sample_max_ids: true
    censor_dims: true
    clamp_counts: false
    clamp_columns: true
    r_regionkey:
      private_id: true
      type: int
    r_name:
      type: string
      cardinality: 100
    r_comment:
      type: string
      cardinality: 100
  customer:
    row_privacy: false
    rows: 1000
    max_ids: 1
    sample_max_ids: true
    censor_dims: true
    clamp_counts: false
    clamp_columns: true
    c_custkey:
      private_id: true
      type: int
    c_name:
      type: string
      cardinality: 100
    c_address:
      type: string
      cardinality: 100
    c_nationkey:
      type: int
      cardinality: 100
    c_phone:
      type: string
      cardinality: 100
    c_acctbal:
      type: float
    c_mktsegment:
      type: string
      cardinality: 100
    c_comment:
      type: string
      cardinality: 100
  partsupp:
    row_privacy: false
    rows: 1000
    max_ids: 1
    sample_max_ids: true
    censor_dims: true
    clamp_counts: false
    clamp_columns: true
    ps_partkey:
      private_id: true
      type: int
    ps_suppkey:
      type: int
    ps_availqty:
      type: int
    ps_supplycost:
      type: float
    ps_comment:
      type: string
      cardinality: 100
  supplier:
    row_privacy: false
    rows: 1000
    max_ids: 1
    sample_max_ids: true
    censor_dims: true
    clamp_counts: false
    clamp_columns: true
    s_suppkey:
      private_id: true
      type: int
    s_name:
      type: string
      cardinality: 100
    s_address:
      type: string
      cardinality: 100
    s_nationkey:
      type: int
    s_phone:
      type: string
      cardinality: 100
    s_acctbal:
      type: float
    s_comment:
      type: string
      cardinality: 100
  part:
    row_privacy: false
    rows: 1000
    max_ids: 1
    sample_max_ids: true
    censor_dims: true
    clamp_counts: false
    clamp_columns: true
    p_partkey:
      private_id: true
      type: int
    p_name:
      type: string
      cardinality: 100
    p_mfgr:
      type: string
      cardinality: 100
    p_brand:
      type: string
      cardinality: 100
    p_type:
      type: string
      cardinality: 100
    p_size:
      type: int
    p_container:
      type: string
      cardinality: 100
    p_retailprice:
      type: float
    p_comment:
      type: string
      cardinality: 100

Learn more about editing datasets here: Datasets

Execute a SQL query

To work with this dataset, now login as a "Data Science" user to DQ0.

Start by selecting a workspace:

workspaces

Create a project folder with the MLProject template wizard by clicking on the button "Add Project From Template" in the workspace screen. The template chooser looks as follows:

!template chooser

Select the Query template.

Next, go Experiments to create a default experiment to run queries in.

Finally navigate to the Query page and select the dataset on the right:

Query

In the query editor insert the following SQL query:

SELECT l_returnflag, l_linestatus, count(*) AS count_order
FROM lineitem
WHERE l_shipdate <= date('1998-12-01')
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus

For the "Private Column" insert l_orderkey.

Set the "Epsilon" and "Tau" values according to your needs. The higher these values the more accurate your results will be, but the more privacy budget will be consumed once the results are released.

Hit "Run" to execute the query.

Inspect run status and query results

Click on the query run or navigate to the run through the "Runs" or "Experiment" screens to inspect the results. Go to the "Files" tab to browse the query statement and result. Remeber that only approved results can be inspected.

Run 1

Monitor the activity & release results

On the "Data Owner" side you can inspect the latest activity and manage approvals if a query run result is requested to be released by a "Data Science" user.

audit list