Datasets
Usage and Selection
DQ0 is a Machine Learning and Analytics Environment. So everything you do on DQ0, you do with and on data. A dataset in DQ0 can be any collection of data that can be logically grouped into a describable set. A DQ0 dataset comes with its own metadata description that is explained in more detail below. Datasets can be database tables, CSV or Parquet files, images or documents or really anything you want to work with. DQ0 comes with connectors for Big Data protocols like S3, BigQuery, Snowflake and numerous other data connection interfaces.
Usually, in DQ0, datasets are defined by data owners that are allowed to see the data. Data Scientists on the other hand can inspect information about the available datasets and select datasets for their analytic or machine learning tasks.
When starting a new machine learning run, the dataset(s) that is/are used during model training for example can be selected through a dropdown box:
Likewise for SQL Query runs:
You can select one or more datasets in these dropdown selection boxes.
Privacy Levels & Budget
With DQ0 data scientists can work with data that are not allowed to see. Therefore, DQ0 comes with strong data protection mechanisms built-in. Nevertheless, different datasets have different protection needs, thus, datasets in DQ0 can have one of three so called "privacy levels" depending on their degree of secrecy.
Highest | High | Public |
---|---|---|
"Highest" will provide best data protection. Runs on "highest" protected datasets will never reveal anaything about the data unless you as a data owner explicitely approve certain results. | "High" will apply the same privacy guarantees and also DQ0's Differential Privacy based data protection mechanisms but will provide secure logging and secure metrics for runs automatically. | "Public" should only be used for non-sensitive datasets. When data scientists run jobs on public datasets DQ0 acts as a general compute or machine learning development platform. |
Also, according to DQ0's Differential Privacy implementation (non-public) datasets have a so called privacy budget. Each published query and analytics machine learlning run will reduce the available privacy budget. Analysts shall carefully consider how much budget they want to consume for a certain analytics result, data owners must ensure privacy budget bounds are reasonably chosen for each dataset. DQ0 comes with a sophisticated release and approval system that also manages the datasets privacy budgets. Only when an analytic result (e.g. the result of an SQL query or a final trained machine learning model) are released through this system budget is actually consumed.
Edit and Add Datasets
Data Owner users can add and edit datasets.
To create a new dataset navigate to the "Datasets" page and click on "Add Dataset" on the upper right:
To edit an existing dataset first select the dataset in the list and then click the edit pen button on the upper right in the datasets detail view:
Defining a dataset comprises five steps:
- Set a name and description for the new data source
- Define the data source connection
- Control the privacy level and settings
- Add details like tables and columns for tabular data
- Check everything and submit
Name and Description
Provide a name and a telling description for the dataset. Additionally, you can add tags (to later group or reference datasets) and toggle the metadata visibility. "Is this Metadata public" means that anybody can see the datasets metadata if this is turned on. When unsure, leave this option to false.
Source
Select a type for this dataset from the give dropdown list. Depending on this type fill in the schema name and the connection information to access this dataset from within your system.
You can add additional schemas by clicking the "Add Schema" button. This might be useful for joint datasets that span multiple databases or sources.
Use the "Create this dataset as a public sample from an existing dataset?" toggle to define this dataset as a public sample dataset for a given existing dataset. This is usefule to provide limited, non-sensitive sample data for exploration. If this option is selected use the dropdown box below to select the dataset that is referenced by this public copy.
Privacy
Set the privacy settings for each schema and table of your dataset here. As described above there are three privacy levels to reflect the desired protection level of your dataset. If unsure always stick with the default "Highest" setting.
On the right side you can allow or disallow synthetic data generation for this dataset.
Also, for each table you must set the provided epsilon and delta privacy budget for this table. See Differential Privacy for more information on that.
Details
For tabular data use the details tab to define the table and column properties of your dataset. Use the "Add Column" and "Add Table" buttons at the bottom to add more tables or columns respectively.
For a more detailed discussion of the individual schema, table, and column properties see the Metadata section below.
Summary
The summary tab provides an overview of the most important information about the dataset. Click the "Submit" button to save your changes to the DQ0 platform.
The summary tab also allows you to delete the dataset or edit the datasets metadata directly by clicking the "Delete" and "Edit Yaml" buttons respectively. In Metadata edit mode use can update the metadata information directly in the YAML description format. Use the "Update from Yaml" button to update the dataset description from your changes. Note that these changes will stored only after you pressed the "Submit" button.
Metadata
Datasets are completely described by metadata. In fact the above mentioned YAML metadata is the exact definition of your dataset.
The metadata properties include:
- name
- description
- type
- schemas
- table level information, with
- privacy budget
- column names
- other misc information for loading
- row level information, if applicable, with
- row level privacy
- columns, if applicable, with
- type
- lower and upper bound (DP)
- privacy constraints
- masking: regex masks away groups, shows anything else
- DP privacy ID info
Metadata is stored in DQ0's central database. It can be defined in the web application or via yaml files. The inner definition in "Database" shall be compatible with smartnoise metadata. A yaml metadata definition can look like this:
name: 'sample data 1'
description: 'some description'
type: 'CSV'
DatabaseSchema1:
connection: 'user@db'
privacy_level: 2
Table1:
synth_allowed: true
budget_epilon: 1000
budget_delta: 0.0001
sep: ';'
decimal: ','
na_values:
capital-gain: 99999
capital-loss: 99999
hours-per-week: 99
native-country: '?'
occupation: '?'
workclass: '?'
row_privacy: true
rows: 1000
max_ids: 1
sample_max_ids: true
use_dpsu: false
clamp_counts: true
clamp_columns: true
censor_dims: false
tau: 100
user_id:
private_id: true
type: int
weight:
type: float
bounded: true
lower: 0.0
upper: 100.0
selectable: true
height:
type: float
bounded: true
use_auto_bounds: true
auto_bounds_prob: 0.9
auto_lower: 1.0
auto_upper: 98.0
name:
type: string
email:
type: string
mask: '(.*)@(.*).{3}$'
occupation:
type: string
cardinality: 3
allowed_values: 's1,s2,s3'