CLI
The Command Line Interface is the program to communicate with the DQ0 instance. Please refer to the CLI Installation Manual to find out how to install the CLI.
Introduction
The DQ0 Data Science CLI program is executed in a terminal session (i.e. bash) as "dq0" (assuming that the path to the CLI installation is added to your PATH). All commands of the DQ0 Data Owner CLI follow the form dq0 [context] [command] [arguments]
.
Example:
dq0 user login
to login to the DQ0 instancedq0 data list
to list available datasets
First Steps
When the CLI is used for the first time, the DQ0 data quarantine instance to be used must first be registered. This is done with the following command:
dq0 proxy add --scheme https --hostname [URL] --port [PORT]
Example for https://dq0.io:8000:
dq0 proxy add --hostname dq0.io --port 8000
Ask your DQ0 administrator about the URL and port of your instance.
Help
You can always run the DQ0 CLI with the -h
or --help
argument to find out about the individual commands:
Examples:
dq0 -h
dq0 data -h
Login
In order to communicate with the DQ0 instance, you have to log in, i.e. authorize and authenticate.
If you are not yet registered, you can do this directly via the CLI using the following command:
dq0 user register
You will then be asked for a user name (email address) and password.
In the course of this registration, the CLI also creates an SSH key pair (private and public key), which is used to encrypt the communication with DQ0.
Note: All communication with the DQ0 instance is encrypted end-to-end. You can therefore only communicate with the instance from the computer with which you did the registration.
The registration request must first be confirmed by your DQ0 administrator. Only then can you log in with your chosen credentials using the following command:
dq0 user login
After successful login, the session is valid for 30 days.
Projects
Everything you do with DQ0 is organized by projects. To create a new project call
dq0 project create [PROJECT-NAME]
Example:
dq0 project create My-Project
This will create a new folder in your local directory called "My-Project". This new folder contains a meta file for project manangement and some templates to help you get started with DQ0 development.
To get a list of your available projects use
dq0 project list
Example response:
+--------------------------------------+--------------+-------------+---------+------+----------------+----------------+---------------------------+----------------+
| PROJECTUUID | PROJECTNAME | EXPERIMENTS | COMMITS | RUNS | DATASETS | MODELS | UPDATEDAT | LOCALAVAILABLE |
| | | | | | (USED/CREATED) | (USED/CREATED) | | |
+--------------------------------------+--------------+-------------+---------+------+----------------+----------------+---------------------------+----------------+
| afff0f3f-6299-450c-9ac8-69ebcf49d23d | DemoXYZ | 1 | 1 | 1 | 1 / 1 | 0 / 1 | 2021-01-08T11:02:19+01:00 | true |
+--------------------------------------+--------------+-------------+---------+------+----------------+----------------+---------------------------+----------------+
# Total Items: 1, page: 1, pageSize: 100
Info about one project
dq0 project info --project-path=[PATH-TO-PROJECT-FOLDER]
You can omit the project-path
argument if you change to your project directory. This is true for all commands where you need a project-uuid
or project-path
argument.
As projects are created locally on your machine and the project's code is managed by your external versioning control system (e.g. your company's git repositories) you need to sync the project's content with DQ0 before you are able to start runs (e.g. training jobs) on the DQ0 instance.
To sync a project with the DQ0 instance use
dq0 project deploy --project-path=[PATH-TO-PROJECT-FOLDER]
or inside the project's directory:
dq0 project deploy
Data
This section describes how you can manage data sets available the DQ0 instance.
List data sets
Use the following command to display a list of all available records:
dq0 data list
A response can look like this:
+----+--------------------------------------+--------------+------------+----------------------+-------------+---------------------------+
| ID | DATAUUID | DATANAME | TYPE | DESCRIPTION | PERMISSIONS | UPDATEDAT |
+----+--------------------------------------+--------------+------------+----------------------+-------------+---------------------------+
| 1 | 81e497ef-c37d-41f3-8381-dcd6c268a7fd | Test dataset | PostgreSQL | Some test data | | 2021-01-08T10:59:43+01:00 |
| 2 | 802bf101-9087-4856-8399-506d7728ab70 | Census | CSV | Description | | 2021-01-07T14:41:40+01:00 |
+----+--------------------------------------+--------------+------------+----------------------+-------------+---------------------------+
The data set with the name "Census" is of type "CSV" (comma separated values file); it has the ID "2" and the UUID (universally unique identifier) "802bf101-9087-4856-8399-506d7728ab70".
Data set info
You can use the following command to display detailed information, including access statistics, about a data record:
dq0 data info --data-uuid UUID
or
dq0 data info --data-id ID
Example:
dq0 data info --data-uuid 802bf101-9087-4856-8399-506d7728ab70
Example response (in JSON format):
{
"commit_uuid": "602d2329-c7ab-44b6-a6ff-6bea996ce41b",
"data_uuid": "802bf101-9087-4856-8399-506d7728ab70",
"data_name": "Census",
"data_type": "CSV",
"data_description": "Description",
"privacy_budget": {
"initial": 100,
"current": 79.69,
},
"data_usage": 89,
"data_size": 1000,
"data_meta": "base64encoded-metadata",
"created_at": 1610026900,
"updated_at": 1610026900
}
Attach Data Sets to Projects
If you want to train a model on a sensitive data set inside the DQ0 quarantine there are two important prerequisites:
- Your model code needs to use the DQ0 SDK methods to read the selected data sets at runtime (i.e. use the
dq0.sdk.data
data source classes and theread()
function). - The DQ0 platform needs to know which data source shall be connected to the runtime. Therefore, an available data set needs to be attached to your project.
To attach a data set copy the data sets UUID (from the data list
or data info
command) and use it in the following command:
dq0 project attach --project-path=[PATH-TO-PROJECT-FOLDER] --data-uuid=[DATA-UUID]
or inside the project directory:
dq0 project attach --data-uuid=[DATA-UUID]
Use the detach command to remove a data set from a project:
dq0 project detach --data-uuid=[DATA-UUID]
Experiments & Commits
Experiments are there to organize your attempts to create good models. Create a new experiment whenever you want to go a different route. You can create as many experiments as you like. Experiments belong to one project, can have different parameters and entry points and contain many runs, i.e. parametrized experiment executions.
Create a new experiment with:
dq0 experiment create [NAME] [--project-path=[PATH-TO-PROJECT-FOLDER]]
Delete an existing experiment with:
dq0 experiment delete --experiment-uuid=[UUID]
Rename an experiment:
dq0 experiment update --experiment-uuid=[UUID] --experiment-name=[NEW-NAME]
Get all available experiments of the project:
dq0 experiment list [--project-path=[PATH-TO-PROJECT-FOLDER]]
Example response:
+--------------------------------------+---------+----------+-------+---------------------------+
| UUID | NAME | #COMMITS | #RUNS | UPDATEDAT |
+--------------------------------------+---------+----------+-------+---------------------------+
| d7a3d540-15cc-48f7-92da-caca9dfe20aa | Default | 1 | 1 | 2021-01-07T14:42:08+01:00 |
+--------------------------------------+---------+----------+-------+---------------------------+
Info for one specific experiment:
dq0 experiment info --experiment-uuid=[UUID]
Running a training job
Before running a training job, make sure your code is in sync with the DQ0 platform instance. To sync your code run:
dq0 project deploy [--project-path=[PATH-TO-PROJECT-FOLDER]]
This command will return a commit ID that you can use to start the run. Example project deploy response:
{
"message": "project successfully deployed with new commit uuid: c99eb85d-f39d-4362-8640-9f981ede687d"
}
The latest commit is stored in your local project metadata automatically.
To start the train job use:
dq0 commit run [--project-path=[PATH-TO-PROJECT-FOLDER]]
With arguments:
dq0 commit run [ARG1]=[VAL1] [ARG2]=[VAL2] --mlproject-entry-point=[ENTRY_POINT]
Track your runs with
dq0 run list [--project-path=[PATH-TO-PROJECT-FOLDER]]
and
dq0 job info --job-uuid=[JOB-UUID]
To inspect the job's results, use the artifact commands:
dq0 artifact tree-structure --run-uuid=[JOB-UUID] --level=5
dq0 artifact download --run-uuid=[RUN-UUID] --path=[ARTIFACT-PATH] --download-path=[LOCAL-DOWNLOAD-PATH]
Example:
dq0 artifact download --run-uuid=ae38b2aa-4976-4155-a51b-897bbbb93a1c --path=path/to/artifact --download-path=/path/to/your/local/file.txt
Queries
To send a query you must specify the query, the used datasets and additional parameters:
dq0 query create --datasets=[DATASET1-NAME] --query='[QUERY-STRING]' [--project-path=[PATH-TO-PROJECT-FOLDER]]
Example:
dq0 query create --datasets=dataset1 --query='SELECT COUNT(*) FROM db;'
You can also point to a yaml file containing the query string. Example:
dq0 query create --datasets=dataset1,dataset2 --query-path=/path/to/query.yaml
Get information about a running query job with
dq0 query info --query-uuid=[JOB-UUID]
Example output:
{
"user_id": 2,
"user_name": "12@gradient0.com",
"job_uuid": "4298634d-727d-48dd-96a9-29f8ce2f563b",
"job_name": "Query Run",
"job_type": "query.run",
"job_logs": "2021-01-11T14:58:42Z | dq0.sql.runner | INFO | [__KEYWORD_STARTED__] Started with args: ...",
"job_progress": 1,
"job_state": "finished",
"created_at": 1610377119,
"updated_at": 1610377125
}
Get the query results (once released) with:
dq0 query result --query-uuid=[JOB-UUID]
Edit Data (DATA OWNER ROLE)
Data Metadata
The data_meta
field contains a (base64 encoded) string of the data set's metadata definition. Data metadata is defined in Yaml format and looks like this:
name: 'Census'
description: 'Description'
type: 'CSV'
connection: '/path/to/data/census.csv'
privacy_budget: 100
privacy_budget_interval_days: 30
synth_allowed: true
privacy_level: 2
Census:
table:
censor_dims: true
clamp_columns: false
clamp_counts: false
max_ids: 10
row_privacy: false
rows: 150
sample_max_ids: true
tau: 0
age:
type: int
bounded: true
lower: 0
upper: 100
use_auto_bounds: false
auto_bounds_prob: 0.9
id:
private_id: true
type: int
workclass:
cardinality: 9
allowed_values: 'Private,Self-emp-not-inc,...'
type: string
selectable: false
email:
type: string
mask: '(.*)@(.*).{3}$'
The metadata definition borrows some of the privacy properties from open dp (or more precisely, the metadata is a superset of open dp's defintion): smartnoise metadata
name
: The name of the data set.description
: Data set descriptionconnection
: Connection URI, file path for CSVs, DB connection string for SQLtype
: Data set typeprivacy_budget
: Privacy budget property. The privacy budget limits the maximum allowed information to be published about this data set.privacy_budget_interval_days
: Reset the privacy budget after this amount of days. Default is 0 (no reset).synth_allowed
: true to allow synthesized data for exploration. The DQ0 data synthesizer can be a powerful tool to learn more about data sets without consuming (more) privacy budget.privacy_level
: 0, 1, 2 in ascending order of privacy protection. Use 0 for public data sets, 1 more non-private data sets, and 2 for private data sets.schema (Census)
: Name of the database
Table level properties:
row_privacy
: Tells the system to treat each row as being a single individual. This is common with social science datasets. Default is false.rows
: Number of rowsmax_ids
: Specifies how many rows each unique user can appear in. If any user appears in more rows than specified, the system will randomly sample to enforce this limit (see sample_max_ids). Default is 1.sample_max_ids
: If the data curator can be certain that each user appears at most max_ids times in the table, this setting can be enabled to skip the reservoir sampling step. Default is true.censor_dims
: Drops GROUP BY output rows that might reveal the presence of individuals in the database. For example, a query doing GROUP BY over last names would reveal the existence of an individual with a rare last name. Data owners may override this setting if the dimensions are public or non-sensitive. Default is true.clamp_counts
: Differentially private counts can sometimes be negative. Setting this option to True will clamp negative counts to be 0. Does not affect privacy, but may impact utility. Default is false.clamp_columns
: By default, the system clamps all input data to ensure that it falls within the lower and upper bounds specified for that column. If the data curator can be certain that the data never fall outside the specified ranges, this step can be disabled. Default is true.use_dpsu
: Tells the system to use Differential Private Set Union for censoring of rare dimensions. Does not impact privacy. Default is false.tau
: Privacy thresholding value. Group sizes below this value are considered private and won't answer. Default is 0 (disabled).
Column level properties:
type
: This type attribute indicates the simple type for all values in the column. Type may be one of "int", "float", "string", "boolean", or "date". The "date" type includes date or time types. This property is required.private_key
: Indicates that this column is the private identifier (e.g. "UserID", "Household"). Only columns which have private_id set to ‘true’ are treated as individuals subject to privacy protection. Default is false.selectable
: Set to true to allow this column to be selectable outside private aggregations. Default is false.lower
: Valid on numeric columns. Specifies the lower bound for values in this column.upper
: Valid on numeric columns. Specifies the upper bound for values in this column.use_auto_bounds
: DQ0 provides a mechanism to calculate reasonable bounds automaticaly. Set this to true to use the calculated values (stored in the additional propertiesauto_lower
andauto_upper
) instead of the manual ones. Default is false.auto_bounds_prob
: For auto bound calculation: the probability of not selecting false positives.cardinality
: This is an optional hint, valid on columns intended to be used as categories or keys in a GROUP BY. Specifies the approximate number of distinct keys in this column.allowed_values
: An optional propertiy for string type columns. List of strings (comma-seperated) indicating the allowed values this column can have.mask
: Valid on string columns. Can be used to mask returned values, e.g. to hide parts of e-mail addresses etc.
Add or Update a data set
To add a new data set to DQ0 use the:
dq0 data add --meta-path=/path/to/my_config.yaml
where my_config.yaml
contains a data set definition in the above metadata format.
Update an existing data set with:
dq0 data update --data-uuid=[UUID] --meta-path=/path/to/my_config.yaml
Remove a data set
To remove an existing data set from DQ0 use:
dq0 data remove --data-uuid=[UUID]
Audits (DATA OWNER ROLE)
One of the more important aspects of privacy and data protection is to keep track of what is going on. DQ0 offers an exstensive auditing system that can be used by Data Owners and Administrators to inspect what happened on the platform.
Use the following command to get a list of all recent audited events:
dq0 audit list
A response to this command can look like this:
+---------------------------+--------------------+------------------+----------------------------------------+
| TIMESTAMP | ACTOR | ACTION | DESCRIPTION |
+---------------------------+--------------------+------------------+----------------------------------------+
| 2021-01-11T15:58:39+01:00 | user@gradient0.com | query.added | query with uuid |
| | | | '4298634d-727d-48dd-96a9-29f8ce2f563b' |
| | | | was added by user 'jb@gradient0.com' |
| 2021-01-11T12:08:28+01:00 | user@gradient0.com | query.added | query with uuid |
| | | | 'e62cd857-ded7-4f8d-83e6-ec8820301278' |
| | | | was added by user 'jb@gradient0.com' |
| 2021-01-08T16:31:51+01:00 | user@gradient0.com | user.loggedIn | user with email |
| | | | 'jb@gradient0.com' and device |
| | | | '597dce26-5219-43e2-add3-a9322ac40210' |
| | | | logged in successfully |
+---------------------------+--------------------+------------------+----------------------------------------+
# Total Items: 3, page: 1, pageSize: 100
You can control the output with the optional flags json
(output in json format) page
and page-size
.
dq0 audit list --json --page=1 --page-size=10
{
"total": 36,
"page": 1,
"page_size": 10,
"items": [
{
"timestamp": "1610363349",
"actor": "jb@gradient0.com",
"action": "query.added",
"description": "query with uuid '80918f4f-5530-42b4-91cb-7625c6687ad2' was added by user 'jb@gradient0.com'"
},
{
"timestamp": "1610363308",
"actor": "jb@gradient0.com",
"action": "query.added",
"description": "query with uuid 'e62cd857-ded7-4f8d-83e6-ec8820301278' was added by user 'jb@gradient0.com'"
},
{
"timestamp": "1610119911",
"actor": "jb@gradient0.com",
"action": "user.loggedIn",
"description": "user with email 'jb@gradient0.com' and device '597dce26-5219-43e2-add3-a9322ac40210' logged in successfully"
}
]
}
Policies (DATA OWNER ROLE)
TODO