DQ0 Software Architecture Overview
DQ0 is the machine learning and analytics interface to secured data. It provides users a robust, secure and privacy-preserving way to perform data analytics and train machine learning models on highly sensitive data. Users of DQ0 won't ever touch the data itself but are able to develop and perform a wide range of exploratory and machine learning analytics.
This document describes the overall software architecture and the roles of the different software components of DQ0.
The DQ0 architecture can be subdivided into 5 sections: “Client”, “Proxy”, “Platform”, “Compute” and “Quarantine”:
Client users of DQ0 can be logically divided into three groups:
- IT Adminstrators
- Data Owners
- Data Scientists
IT Admins communicate with the system to control general settings, manage compute cluster configurations etc. Data Owners are in control of the sensitive datasets and use DQ0 to manage access and privacy mechanisms. Data Scientists in this case refer to all users that want to perform data analytics tasks with the DQ0 platform.
All clients communicate via the CLI, the command line interface. This is a program, written in Go, that communicates with the DQ0 platform via the DQ0 Proxy (see below).
Usually, Data Science users will manage their code with external code versioning systems. Therefore, local projects are synchronized with the DQ0 instance via certain CLI commands (most of this is done automatically when working with the DQ0 Web Application).
To create machine learning models and other analytic processes that can be executed on the DQ0 platform, users can (but do not need to) use the open source DQ0 SDK.
The most convenient way to work with the DQ0 instance is to start the local web application with a CLI command. The DQ0 Data Science and Data Owner Studi0 web apps offer a clear interface to everything that can be done with DQ0.
Communication between DQ0 CLI and the DQ0 Platform is end to end encrypted (see section below). For this reason, client users must always use the CLI (directly or indirectly with the web application) for all requests.
The DQ0 Proxy acts as a message server and stores new user requests in a queue, which in turn is picked up by the DQ0 Platform. If the Proxy already knows the answer to a query, the answer will be returned directly. In order to avoid collisions, each message has an identifier, which enables an assignment between request and response. The Proxy has no information about the DQ0 instance (the platform side) and only serves as a proxy / cache server for DQ0.
DQ0 instance is the secure DQ0 runtime environment inside the customer’s highly protected sub-network. It is the place where the actual analytics and model code (setup, train and predict) is executed on the sensitive datasets. The platform manages all processes and information inside the secure enclave (quarantine). It comprises several components (all instance components except Plugins and SDK are written in Go, most plugins are written in Python):
Pulls new messages from and pushes new messages to DQ0 Proxy instances. It is impossible to connect to this service from the outside. DQ0 Platform runs as a native os service (restart on exit). It acts as the "central point of truth". Information is stored in a local SQLite database (this way instance migration and backup is straightforward).
DQ0 Platform manages users, roles, keys, projects, experiments, models, queries etc. and handles all permission and privacy (pre- and post-)checks. It also prepares analytics jobs and send them to DQ0 Service for execution. Job states are also stored by the Platform.
After the Platform has prepared an analytics job, the DQ0 Service module takes over and creates suitable runtime environments for job execution. Available runtimes can be configured by the IT Administrator. Environments can be local runtimes or distributed or containerized versions.
DQ0 Service manages all available DQ0 Plugins (see below) in a Plugin Catalog. Depending the requestes from the analytics job, the matching plugin is selected, initialized and executed as an independent batch job in the selected (existing or created) runtime environment.
A DQ0 Plugin consists of a yaml plugin definition file and an executable program that can run independently as a batch job on *nix systems. Most plugins are written in Python, but this is not a requirement. A plugin has a set of defined input arguments and a set of defined output metrics and data that is sent at Plugin runtime to the DQ0 Tracking Server (see below).
Important existing DQ0 Plugins are among others:
- make-dp: performs a machine learning model training job. Ensures that the provided model definition is changed on the fly to use differential privacy compliant training
- sql: runs queries against sensitive data sets while preserving privacy of individual records
- privacy-checker: checks results and models for privacy guarantees.
- synth: generates privacy preserving synthethic data
You can find out more about some of the plugins in the methods section of this documentation.
DQ0 Tracking Server
The tracking server is an important component of the DQ0 instance. It receives status updates and results from all Plugin runs started from the DQ0 Platform (via the DQ0 Service Catalog). The current implementation builds on the open source mlflow tracking server. The communication to and with this tracking server is detailed in the "Message Flow" section below.
The DQ0 SDK is a software development kit / library written in Python that is used by certain DQ0 Plugins to execute model training and evaluation and to access data sources.
It also serves as a communication interface, implementing the DQ0 CLI API to communicate with the DQ0 Instance from a local Python environment (e.g. Jupyter Notebook).
The DQ0 SDK also contains templates, blueprints, and examples to formulate models in a way that is compatible with the DQ0 differential privacy machine learning modules.
The data connectors are used by the Platform and Services and manages access to the data. It is a data service with read-only access to whatever data sources the customer provides. Data is protected by DQ0 to ensure that:
- the customer’s data is accessible by the DQ0 Instance (DQ0 Plugins called by Platform / Serivce) only
- the customer’s data is read-only
- the customer’s data never leaves the quarantine
A data connector instance is a concrete data access implementation - there are data connectors for different kinds of data sources like CSV, Postgres, MSSQL, BigQuery, Presto, Images, etc.
The DQ0 Proxy acts as the only communication gateway connecting:
- DQ0 CLI (including web application and SDK) via REST / WebSockets
- DQ0 Platform via REST (unidirectional, not possible to connect to dq0-main)
The proxy itself only offers the possibility of exchanging messages. Since all messages are encrypted end-to-end, the proxy itself cannot read them, but can only make them accessible to client and server endpoints. As no information at all is available to the proxy itself, the dq0 proxy does not need be installed in a private customer network. To ensure high security and data protection, the client and the server are exchanging messages by using a hybrid encryption system. This means that the contents of messages are encrypted symmetrically, but the symmetric key itself is encrypted asymmetrically using public keys. dq0 is using AES 256 for symmetric encryption and RSA 2048 for asymmetric encryption.
After the exchange of public keys, the server's public key is pushed from the server to the proxy. The client's public key is also sent when the client is registered. The registration process is not different from other processes. Communication follows this schema:
- The client creates a new message (M) and generates a new symmetric key (S) encrypted with the public asymmetric key of the server. The client then encrypts its message (M) with the symmetric key (S) and signs the message with its own private asymmetric key. Both are sent to the proxy as one message.
- The proxy receives the client's message and stores it in a database.
- The DQ0 instance fetches this new encrypted message from proxy via pull request.
- The DQ0 instance verifies the message with the client's public asymmetric key. This determines whether the message was manipulated on the go.
- The DQ0 instance decrypts the symmetric key (S) with its own private asymmetric key
- The DQ0 instance now uses the symmetric key (S) to decrypt the message (M)
As soon as the DQ0 instance has processed the new message (M), the DQ0 instance generates and sends a new message (N) in response to the proxy. The DQ0 instance performs the same process as described in (1). The client then carries out the same steps analogously to the DQ0 instance.
The diagram below gives an overview of the software components and their roles in the DQ0 data processing processes:
On the left-hand side, some data sources are shown as examples, which serve as the basis for the analysis models to be developed with the platform. As a control instance, the “platform” module monitors the entire analysis and development flow and stores all important data on actions, users, data and analyzes with the aid of the “Auditing / Logging / Tracking” module. The “Service Manager” serves as an abstraction layer over all available services within the platform. These services include meta-data management (ie the description of the data sources), job management (monitoring and traceability of actions within the platform), governance (linking data to analyzes, models and other data) and components different types of analysis. Access controls are provided as vertical blocks between these services, which decide which user groups are allowed to carry out which actions in which way. The results of analyzes are also checked before they are released by the services.
DQ0 comes with a strict permissioning system that is used to control access to data, results and actions per roles (via groups and users). Privacy permissions are checked before, during and after analytics tasks:
A more detailed look at the runtime - the DQ0 Service starting a Plugin as requested by the Platform - is sketched below:
- The Plugin with provided arguments was requested for execution. The payload includes, among other things: Plugin Identifier: "plugin_1", Executor Identifier: "local" (or "remote"), arguments: "arg1=123" and "arg2=other" etc.
- Depending on the configuration the plugin batch job is either started in LocalExecutor or RemoteExecutor mode.
- New plugin_1 instance was started as detached process. Dataset access credentials were provided by Platform (via Service). plugin_1 saves results using the tracking service. Monitoring is done by the LocalExecutor. Once plugin_1 finishes, Service (and Platform) will be notified.
- In the remote execution environment the plugin job is usually created as a containerized version of "plugin_1". Monitoring is managed by the remote environment but tracking is routed to the Tracking Server.
A User can register via a dq0-proxy using the dq0-cli.
Public Key Exchange
The dq0 platform uses the public key infrastructure (pki) to exchange information with all participants. The public key of a dq0 platform is not a secret and can be provided together with further information about the dq0 platform. The dq0 platform pushes publicly available information to all connected dq0 proxy instances.
Registration software pki
Before a registration is carried out, the client generates a new public key pair specific to the dq0 platform to be registered. A client uses the public key of the dq0 platform to encrypt registration information and transmits this information to the currently connected dq0 proxy instance. A client should provide at least the following information when registering:
- email address
- device token (hidden, for request / response assignment)
- client public key
This registration request is transmitted to the connected proxy instance and saved. In the next moment, this request will be picked up and processed by the dq0 platform component.
The registration request is first legitimized by the user confirming the email address. The confirmation token is processed by the proxy instance. After confirmation of the email address, the new user is manually confirmed by an authorized user. An email is then sent to the new user. dq0 now converts the registration request into a new user and authorizes the user to communicate with dq0. Rights and roles can be assigned by an authorized user.
Authentication software pki
dq0 uses a token-based authentication mechanism. A token is generated by dq0 after the user has logged on. This token should be stored locally by the client and used for further requirements for dq0 in HTTP authentification scheme (header). In addition, a token expires in 30 days. To get a new token, the user has to log in again.
A login request is encrypted with the public key of the dq0 platform and should contain at least the following information:
- email address
- device token (hidden, for request / response assignment)
With this information, dq0 can provide a new token, which is encrypted with the user's public key, to all connected proxy instances. The user can now retrieve this encrypted token from the connected proxy instance and decrypt it with his own private key.
dq0-api provides very secure login mechanisms to obtain an eligible token:
- public key infrastructure (PKI) managed by dq0-client
- U2F / FIDO2 hardware token
dq0-client ↔︎ dq0-proxy
- HTTP 1.1 via WebSockets
dq0-proxy <- dq0-platform
- HTTP 1.1 (one direction only)