Datasets

Usage and Selection

DQ0 is a Machine Learning and Analytics Environment. So everything you do on DQ0, you do with and on data. A dataset in DQ0 can be any collection of data that can be logically grouped into a describable set. A DQ0 dataset comes with its own metadata description that is explained in more detail below. Datasets can be database tables, CSV or Parquet files, images or documents or really anything you want to work with. DQ0 comes with connectors for Big Data protocols like S3, BigQuery, Snowflake and numerous other data connection interfaces.

Usually, in DQ0, datasets are defined by data owners that are allowed to see the data. Data Scientists on the other hand can inspect information about the available datasets and select datasets for their analytic or machine learning tasks.

When starting a new machine learning run, the dataset(s) that is/are used during model training for example can be selected through a dropdown box:

Select a dataset

Likewise for SQL Query runs:

Select a dataset

You can select one or more datasets in these dropdown selection boxes.

Privacy Levels & Budget

With DQ0 data scientists can work with data that are not allowed to see. Therefore, DQ0 comes with strong data protection mechanisms built-in. Nevertheless, different datasets have different protection needs, thus, datasets in DQ0 can have one of three so called "privacy levels" depending on their degree of secrecy.

Highest	High	Public
"Highest" will provide best data protection. Runs on "highest" protected datasets will never reveal anaything about the data unless you as a data owner explicitely approve certain results.	"High" will apply the same privacy guarantees and also DQ0's Differential Privacy based data protection mechanisms but will provide secure logging and secure metrics for runs automatically.	"Public" should only be used for non-sensitive datasets. When data scientists run jobs on public datasets DQ0 acts as a general compute or machine learning development platform.

Also, according to DQ0's Differential Privacy implementation (non-public) datasets have a so called privacy budget. Each published query and analytics machine learlning run will reduce the available privacy budget. Analysts shall carefully consider how much budget they want to consume for a certain analytics result, data owners must ensure privacy budget bounds are reasonably chosen for each dataset. DQ0 comes with a sophisticated release and approval system that also manages the datasets privacy budgets. Only when an analytic result (e.g. the result of an SQL query or a final trained machine learning model) are released through this system budget is actually consumed.

Edit and Add Datasets

Data Owner users can add and edit datasets.

To create a new dataset navigate to the "Datasets" page and click on "Add Dataset" on the upper right:

Add Dataset

To edit an existing dataset first select the dataset in the list and then click the edit pen button on the upper right in the datasets detail view:

Edit Dataset

Defining a dataset comprises five steps:

Set a name and description for the new data source
Define the data source connection
Control the privacy level and settings
Add details like tables and columns for tabular data
Check everything and submit

Name and Description

Dataset description

Provide a name and a telling description for the dataset. Additionally, you can add tags (to later group or reference datasets) and toggle the metadata visibility. "Is this Metadata public" means that anybody can see the datasets metadata if this is turned on. When unsure, leave this option to false.

Source

Dataset source

Select a type for this dataset from the give dropdown list. Depending on this type fill in the schema name and the connection information to access this dataset from within your system.

You can add additional schemas by clicking the "Add Schema" button. This might be useful for joint datasets that span multiple databases or sources.

Use the "Create this dataset as a public sample from an existing dataset?" toggle to define this dataset as a public sample dataset for a given existing dataset. This is usefule to provide limited, non-sensitive sample data for exploration. If this option is selected use the dropdown box below to select the dataset that is referenced by this public copy.

Privacy

Dataset privacy

Set the privacy settings for each schema and table of your dataset here. As described above there are three privacy levels to reflect the desired protection level of your dataset. If unsure always stick with the default "Highest" setting.

On the right side you can allow or disallow synthetic data generation for this dataset.

Also, for each table you must set the provided epsilon and delta privacy budget for this table. See Differential Privacy for more information on that.

Details

Dataset details

For tabular data use the details tab to define the table and column properties of your dataset. Use the "Add Column" and "Add Table" buttons at the bottom to add more tables or columns respectively.

For a more detailed discussion of the individual schema, table, and column properties see the Metadata section below.

Summary

Dataset summary

The summary tab provides an overview of the most important information about the dataset. Click the "Submit" button to save your changes to the DQ0 platform.

The summary tab also allows you to delete the dataset or edit the datasets metadata directly by clicking the "Delete" and "Edit Yaml" buttons respectively. In Metadata edit mode use can update the metadata information directly in the YAML description format. Use the "Update from Yaml" button to update the dataset description from your changes. Note that these changes will stored only after you pressed the "Submit" button.

Metadata

Datasets are completely described by metadata. In fact the above mentioned YAML metadata is the exact definition of your dataset.

The metadata properties include:

name
description
type
schemas
table level information, with
- privacy budget
- column names
- other misc information for loading
row level information, if applicable, with
- row level privacy
columns, if applicable, with
- type
- lower and upper bound (DP)
- privacy constraints
- masking: regex masks away groups, shows anything else
- DP privacy ID info

Metadata is stored in DQ0's central database. It can be defined in the web application or via yaml files. The inner definition in "Database" shall be compatible with smartnoise metadata. A yaml metadata definition can look like this:

name: 'sample data 1'
description: 'some description'
type: 'CSV'
DatabaseSchema1:
    connection: 'user@db'
    privacy_level: 2
    Table1:
        synth_allowed: true
        budget_epilon: 1000
        budget_delta: 0.0001
        sep: ';'
        decimal: ','
        na_values:
            capital-gain: 99999
            capital-loss: 99999
            hours-per-week: 99
            native-country: '?'
            occupation: '?'
            workclass: '?'
        row_privacy: true
        rows: 1000
        max_ids: 1
        sample_max_ids: true
        use_dpsu: false
        clamp_counts: true
        clamp_columns: true
        censor_dims: false
        tau: 100
        user_id:
            private_id: true
            type: int
        weight:
            type: float
            bounded: true
            lower: 0.0
            upper: 100.0
            selectable: true
        height:
            type: float
            bounded: true
            use_auto_bounds: true
            auto_bounds_prob: 0.9
            auto_lower: 1.0
            auto_upper: 98.0
        name:
            type: string
        email:
            type: string
            mask: '(.*)@(.*).{3}$'
        occupation:
            type: string
            cardinality: 3
            allowed_values: 's1,s2,s3'