TPCH Dataset Example
The TPC-H dataset and system is used to benchmark computer and database systems. This dataset is chosen as an example here because it represents sample data with real-world relevance and allows complex queries answering critical business questions.
Setting up the dataset
First, create the TPC-H dataset as described here: https://github.com/sunblaze-ucb/elastic-sensitivity-experiments/tree/master/data
Load the dataset into a SQL database, e.g. your local Postgres. Make sure to remember the database, schema and user and password settings for the created SQL data.
The schema, i.e. the create script could look like this:
CREATE TABLE REGION (
R_REGIONKEY integer NOT NULL,
R_NAME text NOT NULL,
R_COMMENT text,
PRIMARY KEY (R_REGIONKEY)
);
CREATE TABLE NATION (
N_NATIONKEY integer NOT NULL,
N_NAME text NOT NULL,
N_REGIONKEY integer NOT NULL,
N_COMMENT text,
PRIMARY KEY (N_NATIONKEY),
FOREIGN KEY (N_REGIONKEY) REFERENCES REGION (R_REGIONKEY)
);
CREATE TABLE CUSTOMER (
C_CUSTKEY integer NOT NULL,
C_NAME text NOT NULL,
C_ADDRESS text NOT NULL,
C_NATIONKEY integer NOT NULL,
C_PHONE text NOT NULL,
C_ACCTBAL integer NOT NULL,
C_MKTSEGMENT text NOT NULL,
C_COMMENT text NOT NULL,
PRIMARY KEY (C_CUSTKEY),
FOREIGN KEY (C_NATIONKEY) REFERENCES NATION (N_NATIONKEY)
);
CREATE TABLE ORDERS (
O_ORDERKEY integer NOT NULL,
O_CUSTKEY integer NOT NULL,
O_ORDERSTATUS text NOT NULL,
O_TOTALPRICE integer NOT NULL,
O_ORDERDATE date NOT NULL,
O_ORDERPRIORITY text NOT NULL,
O_CLERK text NOT NULL,
O_SHIPPRIORITY integer NOT NULL,
O_COMMENT text NOT NULL,
PRIMARY KEY (O_ORDERKEY),
FOREIGN KEY (O_CUSTKEY) REFERENCES CUSTOMER (C_CUSTKEY)
);
CREATE TABLE PART (
P_PARTKEY integer NOT NULL,
P_NAME text NOT NULL,
P_MFGR text NOT NULL,
P_BRAND text NOT NULL,
P_TYPE text NOT NULL,
P_SIZE integer NOT NULL,
P_CONTAINER text NOT NULL,
P_RETAILPRICE integer NOT NULL,
P_COMMENT text NOT NULL,
PRIMARY KEY (P_PARTKEY)
);
CREATE TABLE SUPPLIER (
S_SUPPKEY integer NOT NULL,
S_NAME text NOT NULL,
S_ADDRESS text NOT NULL,
S_NATIONKEY integer NOT NULL,
S_PHONE text NOT NULL,
S_ACCTBAL integer NOT NULL,
S_COMMENT text NOT NULL,
PRIMARY KEY (S_SUPPKEY),
FOREIGN KEY (S_NATIONKEY) REFERENCES NATION (N_NATIONKEY)
);
CREATE TABLE PARTSUPP (
PS_PARTKEY integer NOT NULL,
PS_SUPPKEY integer NOT NULL,
PS_AVAILQTY integer NOT NULL,
PS_SUPPLYCOST integer NOT NULL,
PS_COMMENT text NOT NULL,
PRIMARY KEY (PS_PARTKEY,PS_SUPPKEY),
FOREIGN KEY (PS_SUPPKEY) REFERENCES SUPPLIER (S_SUPPKEY),FOREIGN KEY (PS_PARTKEY) REFERENCES PART (P_PARTKEY)
);
CREATE TABLE LINEITEM (
L_ORDERKEY integer NOT NULL,
L_PARTKEY integer NOT NULL,
L_SUPPKEY integer NOT NULL,
L_LINENUMBER integer NOT NULL,
L_QUANTITY integer NOT NULL,
L_EXTENDEDPRICE integer NOT NULL,
L_DISCOUNT integer NOT NULL,
L_TAX integer NOT NULL,
L_RETURNFLAG text NOT NULL,
L_LINESTATUS text NOT NULL,
L_SHIPDATE date NOT NULL,
L_COMMITDATE date NOT NULL,
L_RECEIPTDATE date NOT NULL,
L_SHIPINSTRUCT text NOT NULL,
L_SHIPMODE text NOT NULL,
L_COMMENT text NOT NULL,
PRIMARY KEY (L_ORDERKEY,L_LINENUMBER),
FOREIGN KEY (L_ORDERKEY) REFERENCES ORDERS (O_ORDERKEY),FOREIGN KEY (L_PARTKEY, L_SUPPKEY) REFERENCES PARTSUPP (PS_PARTKEY, PS_SUPPKEY)
);
Login to DQ0 as a "Data Owner" user to set up this dataset for usage on the DQ0 platform. Now use the DQ0 dataset editor to register and define this dataset to the platform.
Specifiy the source of the dataset as a database connection string:
And define the properties according to the column descriptions above.
A valid metadata definition for the Census dataset, that can be inspected or edited in the "Summary" tab of the data set editor, looks like this:
name: dq0_test_tpch
description: some description
type: postgresql
tags: ""
privacy_column: l_orderkey
metadata_is_public: true
public:
connection: user:password@localhost:5432/tpch
privacy_level: 2
lineitem:
row_privacy: false
rows: 1325478
max_ids: 5
sample_max_ids: true
censor_dims: true
clamp_counts: true
clamp_columns: true
budget_epsilon: 1
budget_delta: 1
l_orderkey:
private_id: true
type: int
l_partkey:
type: int
l_suppkey:
type: int
l_linenumber:
type: int
l_quantity:
type: float
l_extendedprice:
type: float
bounded: true
use_auto_bounds: true
auto_bounds_prob: 0.9
lower: 1
upper: 2
l_discount:
type: float
l_tax:
type: float
l_returnflag:
type: string
cardinality: 100
l_linestatus:
type: string
cardinality: 100
l_shipdate:
type: string
cardinality: 100
l_commitdate:
type: string
cardinality: 100
l_receiptdate:
type: string
cardinality: 100
l_shipinstruct:
type: string
cardinality: 100
l_shipmode:
type: string
cardinality: 100
l_comment:
type: string
cardinality: 100
nation:
row_privacy: false
rows: 1000
max_ids: 1
sample_max_ids: true
censor_dims: true
clamp_counts: false
clamp_columns: true
n_nationkey:
private_id: true
type: int
n_name:
type: string
cardinality: 100
n_regionkey:
type: int
n_comment:
type: string
cardinality: 100
orders:
row_privacy: false
rows: 1000
max_ids: 1
sample_max_ids: true
censor_dims: true
clamp_counts: false
clamp_columns: true
o_orderkey:
private_id: true
type: int
o_custkey:
type: int
o_orderstatus:
type: string
cardinality: 100
o_totalprice:
type: float
o_orderdate:
type: string
cardinality: 100
o_orderpriority:
type: string
cardinality: 100
o_clerk:
type: string
cardinality: 100
o_shippriority:
type: int
o_comment:
type: string
cardinality: 100
region:
row_privacy: false
rows: 1000
max_ids: 1
sample_max_ids: true
censor_dims: true
clamp_counts: false
clamp_columns: true
r_regionkey:
private_id: true
type: int
r_name:
type: string
cardinality: 100
r_comment:
type: string
cardinality: 100
customer:
row_privacy: false
rows: 1000
max_ids: 1
sample_max_ids: true
censor_dims: true
clamp_counts: false
clamp_columns: true
c_custkey:
private_id: true
type: int
c_name:
type: string
cardinality: 100
c_address:
type: string
cardinality: 100
c_nationkey:
type: int
cardinality: 100
c_phone:
type: string
cardinality: 100
c_acctbal:
type: float
c_mktsegment:
type: string
cardinality: 100
c_comment:
type: string
cardinality: 100
partsupp:
row_privacy: false
rows: 1000
max_ids: 1
sample_max_ids: true
censor_dims: true
clamp_counts: false
clamp_columns: true
ps_partkey:
private_id: true
type: int
ps_suppkey:
type: int
ps_availqty:
type: int
ps_supplycost:
type: float
ps_comment:
type: string
cardinality: 100
supplier:
row_privacy: false
rows: 1000
max_ids: 1
sample_max_ids: true
censor_dims: true
clamp_counts: false
clamp_columns: true
s_suppkey:
private_id: true
type: int
s_name:
type: string
cardinality: 100
s_address:
type: string
cardinality: 100
s_nationkey:
type: int
s_phone:
type: string
cardinality: 100
s_acctbal:
type: float
s_comment:
type: string
cardinality: 100
part:
row_privacy: false
rows: 1000
max_ids: 1
sample_max_ids: true
censor_dims: true
clamp_counts: false
clamp_columns: true
p_partkey:
private_id: true
type: int
p_name:
type: string
cardinality: 100
p_mfgr:
type: string
cardinality: 100
p_brand:
type: string
cardinality: 100
p_type:
type: string
cardinality: 100
p_size:
type: int
p_container:
type: string
cardinality: 100
p_retailprice:
type: float
p_comment:
type: string
cardinality: 100
Learn more about editing datasets here: Datasets
Execute a SQL query
To work with this dataset, now login as a "Data Science" user to DQ0.
Start by selecting a workspace:
Create a project folder with the MLProject template wizard
by clicking on the button "Add Project From Template" in the workspace screen. The template chooser looks as follows:
Select the Query
template.
Next, go Experiments
to create a default experiment to run queries in.
Finally navigate to the Query
page and select the dataset on the right:
In the query editor insert the following SQL query:
SELECT l_returnflag, l_linestatus, count(*) AS count_order
FROM lineitem
WHERE l_shipdate <= date('1998-12-01')
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus
For the "Private Column" insert l_orderkey
.
Set the "Epsilon" and "Tau" values according to your needs. The higher these values the more accurate your results will be, but the more privacy budget will be consumed once the results are released.
Hit "Run" to execute the query.
Inspect run status and query results
Click on the query run or navigate to the run through the "Runs" or "Experiment" screens to inspect the results. Go to the "Files" tab to browse the query statement and result. Remeber that only approved results can be inspected.
Monitor the activity & release results
On the "Data Owner" side you can inspect the latest activity and manage approvals if a query run result is requested to be released by a "Data Science" user.