Data Science SDK Introduction

The DQ0 SDK is available as a public github repository: https://github.com/gradientzero/dq0-sdk

This introduction explains how to work with the DQ0 SDK by walking through a simple machine learning development example.

For a quickstart, please visit the getting started notebook at: https://github.com/gradientzero/dq0-sdk/blob/master/notebooks/DQ0SDK-Quickstart.ipynb

The DQ0 Interface

To implement your own Gaussian naive Bayesian classifier or your own neural network method, instantiate the UserModel class by deriving from one of the following models:

class NaiveBayesianModel defined in module dq0sdk.models.bayes.naive_bayesian_model;
class NeuralNetworkClassification defined in module dq0sdk.models.tf.neural_network_classification;
class NeuralNetworkRegression defined in module dq0sdk.models.tf.neural_network_regression;

Methods of UserModel instance

Any UserModel instance must implement the following two methods:

setup_model(), defining the machine learning model (for example, the topology of the neural network);
setup_data(), defining train and test data by initializing the X_train, y_train, X_test, y_test instance attributes (please see below).

It might be useful (please see the SDK examples) to define also an auxiliary preprocess method, responsible of preparing the data (for example, converting string labels to numeric values) for being processed by the machine learning algorithm instantiated and assigned to attribute model (please see below).

Attributes of UserModel instance

Any UserModel instance must include the following instance attributes:

X_train, a matrix with each row containing a training example;
y_train, a column vector or zero-dimensional Python array with the learning signal;
model, a TensorFlow.Keras or Scikit-learn model instance. Parameters can be passed to the instance, as specified in the documentation of the corresponding TensorFlow.Keras or Scikit-learn class;
metrics, a list of metrics, selected from TensorFlow.Keras.metrics, to evaluate the model performance. Each metric is specified by providing its name (string) or by instantiating the corresponding class. In the latter case, metric parameters may be provided, as specified in the documentation of the corresponding TensorFlow.Keras.metrics class. Basically, the metrics are specified in the same way as for the TensorFlow.Keras library.

If the predict and evaluate methods of the UserModel instance are called, the following additional attributes must be provided:

X_test, a matrix with each row containing a test example;
y_test, a column vector or zero-dimensional Python array with the learning signal.

Consistency among data types of X_train, y_train and X_test, y_test (if X_test, y_test are defined) is required. That is, if X_train is a Pandas DataFrame, y_train is expected to be a Pandas Series. Analogously, if X_train type is Numpy ndarray, y_train is expected to be Numpy ndarray as well. The same data-type relationship constraints holds for X_test and y_test. Furthermore, if X_train is a Pandas DataFrame, X_test is expected to be a Pandas DataFrame as well. Finally, if y_test is defined, it is expected to have the same number of dimensions of y_train. Even though DQ0 is robust to violations of above data-type relationship and dimensionality constraints, the UserModel instance is recommended to satisfy them. To check whether these type-relationship constraints are satisfied, one can include the following peace of code in the UserModel’s method defining self.X_train, self.X_test, self.y_train, and self.y_test:

from dq0.sdk.data.utils import util
util.check_data_structure_type_consistency(self.X_train, self.X_test,
                                           self.y_train, self.y_test)

In the case of a classification task, if the original data labels are encoded (e.g., as integers), we suggest to assign the encoder object to attribute label_encoder of the UserModel class. This choice enables the use of the original rather than the encoded labels when annotating the confusion matrix during the evaluation of the user model. The label encoding is usually done in the setup_data or preprocess method of the UserModel class.

Additional requirements for neural networks models

UserModel instances implementing neural networks models (based on TensorFlow.Keras) require the following additional instance attributes, defining the learning process:

optimizer, defined either by providing the name (string) or by instantiating a TensorFlow.Keras.optimizers class. The latter approach enables to pass parameters to the optimizer, as specified in the documentation of the corresponding TensorFlow.Keras.optimizers class. Currently, DQ0 supports the following three optimization algorithms from TensorFlow.Keras.optimizers:
- Adagrad, class tensorflow.keras.optimizers.Adagrad
- Adam, class tensorflow.keras.optimizers.Adam
- SGD, class tensorflow.keras.optimizers.SGD
loss, defining the loss function driving the learning process. The loss function is specified either by providing its name (string) or by instantiating the suitable class from TensorFlow.Keras.losses. The latter approach enables to pass parameters to the loss function, as specified in the documentation of the corresponding TensorFlow.Keras.losses class. Basically, the loss function is specified in the same way as for TensorFlow.Keras.
epochs, defining the number of epochs of the learning process;
batch_size, defining the minibatch size adopted for the learning process.

For classification tasks, UserModel is assumed to be probabilistic classifier. That is, for a given input observation, UserModel is expected to return a probability distribution over the set of classes. This assumption does not limit the usage of DQ0. Indeed, it is consistent with the common implementation of neural networks for classification tasks, which usually returns class membership probabilities by applying, e.g., a "softmax" activation function in the last layer of the network. Furthermore, if necessary, the raw (un-normalized) predictions of the neural network (also known as "logits") can straightforwardly be extracted from the Keras object underlying the DQ0 UserModel instance.

Additional requirements for naïve Bayesian models

For UserModel instances inheriting from the Gaussian naïve Bayesian model, calibration of model posterior probabilities is enabled by default. Probability calibration is obtained by fitting a sklearn.calibration.CalibratedClassifierCV to the training data (X_train, y_train). Five-fold cross-validation is used. The calibration process may be optionally controlled and tuned by the following UserModel attributes:

calibrate_posterior_probabilities, a Boolean attribute (default value True). Set to False to disable probabilities calibration;
calibration_method, string attribute defining the calibration method adopted. Possible values are:
- sigmoid, to learn a probability calibrator by logistic regression;
- isotonic, to apply isotonic regression for calibration.

Default value for calibration_method: sigmoid if X_train contains less than 1000 training examples, isotonic otherwise.

The Census Example

The census adult dataset is a popular test bed for a simple machine learning model. The goal is to predict whether income exceeds $50K/year based on census data. The data set and a description thereof is available here: https://archive.ics.uci.edu/ml/datasets/adult

The example discussed in this guide is located here: https://github.com/gradientzero/dq0-sdk/tree/master/dq0/sdk/examples/census/raw

Running the example

Let's have a look at the example's run script:

run_demo.py

# -*- coding: utf-8 -*-
"""Adult dataset example.

Run script to test the execution locally.

Copyright 2020, Gradient Zero
All rights reserved
"""

import os

import dq0.sdk
from dq0.sdk.data.utils import util
from dq0.sdk.examples.census.raw.model.user_model import UserModel


if __name__ == '__main__':

    print('\nRunning demo for the "Census" dataset\n')

    # set seed of random number generator to ensure reproducibility of results
    util.initialize_rnd_numbers_generators_state()

    # path to input
    path = '../_data/adult_with_rand_names.csv'
    filepath = os.path.join(os.path.dirname(
        os.path.abspath(__file__)), path)

    # init input data source
    data_source = dq0.sdk.data.text.CSV(filepath)

    # create model
    model = UserModel()

    # attach data source
    model.attach_data_source(data_source)

    # prepare data
    model.setup_data()

    # setup model
    model.setup_model()

    # fit the model
    model.fit()

    # evaluate the model
    model.evaluate()
    model.evaluate(test_data=False)

    print('\nDemo run successfully!\n')

The first interesting line is this one: data_source = dq0.sdk.data.text.CSV(filepath). Here, a new data source object is created from a CSV file. The dq0.sdk.data package contains many pre-defined data sources that you can use for different types of data sets.

In the next line model = UserModel() an instance of our model is created. We will have a closer look at this code in the next section.

model.attach_data_source(data_source) shows an important function of the SDK: attach_data_source "registers" the selected data set as the desired target data set, the model shall be trained on. Since from the client side you cannot access the data sets inside the DQ0 platform instance (the data quarantine) you need to "attach" the source to the model, so that the platfrom can load the correct dataset at runtime.

The next two lines are the two most important SDK model functions: model.setup_data() needs to be implemented for the data loading and preperation step. model.setup_model() is called by the DQ0 platform right before the model training and is responsible for the complete model definition.

The remainder of the run script is there for demo purposes only. The fit function is called at runtime by the platform but in a secure and non-obvious way, to ensure the correct application of the implemented privacy-preserving mechanisms of DQ0.

User Model

The user model class is defined as follows:

model/user_model.py

# -*- coding: utf-8 -*-
"""Adult dataset example.

Neural network model definition

Copyright 2020, Gradient Zero
All rights reserved
"""

import logging

from dq0.sdk.models.tf import NeuralNetworkClassification

logger = logging.getLogger()


class UserModel(NeuralNetworkClassification):
    """Derived from dq0.sdk.models.tf.NeuralNetwork class

    Model classes provide a setup method for data and model
    definitions.
    """

    def __init__(self):
        super().__init__()

    def setup_data(self):
        """Setup data function

        This function can be used to prepare data or perform
        other tasks for the training run.

        At runtime the selected datset is attached to this model. It
        is available as the `data_source` attribute.

        For local testing call `model.attach_data_source(some_data_source)`
        manually before calling `setup_data()`.

        Use `self.data_source.read()` to read the attached data.
        """
        from sklearn.model_selection import train_test_split

        # columns
        self.column_names_list = [
            'lastname',
            'firstname',
            'age',
            'workclass',
            'fnlwgt',
            'education',
            'education-num',
            'marital-status',
            'occupation',
            'relationship',
            'race',
            'sex',
            'capital-gain',
            'capital-loss',
            'hours-per-week',
            'native-country',
            'income'
        ]

        self.columns_types_list = [
            {
                'name': 'age',
                'type': 'int'
            }
            # CUT FOR BETTER READABILITY
        ]

        # read and preprocess the data
        dataset_df = self.preprocess()

        # do the train test split
        X_train_df, X_test_df, y_train_ts, y_test_ts =\
            train_test_split(dataset_df.iloc[:, :-1],
                             dataset_df.iloc[:, -1],
                             test_size=0.33
                             )
        self.input_dim = X_train_df.shape[1]

        # set data attributes
        self.X_train = X_train_df
        self.X_test = X_test_df
        self.y_train = y_train_ts
        self.y_test = y_test_ts

    def preprocess(self):
        """Preprocess the data

        Preprocess the data set. The input data is read from the attached source.

        At runtime the selected datset is attached to this model. It
        is available as the `data_source` attribute.

        For local testing call `model.attach_data_source(some_data_source)`
        manually before calling `setup_data()`.

        Use `self.data_source.read()` to read the attached data.

        Returns:
            preprocessed data
        """
        from dq0.sdk.data.preprocessing import preprocessing
        import sklearn.preprocessing
        import pandas as pd

        column_names_list = self.column_names_list
        columns_types_list = self.columns_types_list

        # get the input dataset
        if self.data_source is None:
            logger.error('No data source found')
            return

        # read the data via the attached input data source
        dataset = self.data_source.read(
            names=column_names_list,
            sep=',',
            skiprows=1,
            index_col=None,
            skipinitialspace=True,
            na_values={
                'capital-gain': 99999,
                'capital-loss': 99999,
                'hours-per-week': 99,
                'workclass': '?',
                'native-country': '?',
                'occupation': '?'}
        )

        # drop unused columns
        dataset.drop(['lastname', 'firstname'], axis=1, inplace=True)
        column_names_list.remove('lastname')
        column_names_list.remove('firstname')

        # define target feature
        target_feature = 'income'

        # get categorical features
        categorical_features_list = [
            col['name'] for col in columns_types_list
            if col['type'] == 'string']

        # get categorical features
        quantitative_features_list = [
            col['name'] for col in columns_types_list
            if col['type'] == 'int' or col['type'] == 'float']

        # get arguments
        approach_for_missing_feature = 'imputation'
        imputation_method_for_cat_feats = 'unknown'
        imputation_method_for_quant_feats = 'median'
        features_to_drop_list = None

        # handle missing data
        dataset = preprocessing.handle_missing_data(
            dataset,
            mode=approach_for_missing_feature,
            imputation_method_for_cat_feats=imputation_method_for_cat_feats,
            imputation_method_for_quant_feats=imputation_method_for_quant_feats,  # noqa: E501
            categorical_features_list=categorical_features_list,
            quantitative_features_list=quantitative_features_list)

        if features_to_drop_list is not None:
            dataset.drop(features_to_drop_list, axis=1, inplace=True)

        # get dummy columns
        dataset = pd.get_dummies(dataset, columns=categorical_features_list, dummy_na=False)

        # unzip categorical features with dummies
        categorical_features_list_with_dummies = []
        for col in columns_types_list:
            if col['type'] == 'string':
                for value in col['values']:
                    categorical_features_list_with_dummies.append('{}_{}'.format(col['name'], value))

        # add missing columns
        missing_columns = set(categorical_features_list_with_dummies) - set(dataset.columns)
        for col in missing_columns:
            dataset[col] = 0

        # and sort the columns
        dataset = dataset.reindex(sorted(dataset.columns), axis=1)

        # Scale values to the range from 0 to 1 to be precessed by the neural network
        dataset[quantitative_features_list] = sklearn.preprocessing.minmax_scale(dataset[quantitative_features_list])

        # label target
        y_ts = dataset[target_feature]
        self.label_encoder = sklearn.preprocessing.LabelEncoder()
        y_bin_nb = self.label_encoder.fit_transform(y_ts)
        y_bin = pd.Series(index=y_ts.index, data=y_bin_nb)
        dataset.drop([target_feature], axis=1, inplace=True)
        dataset[target_feature] = y_bin

        return dataset

    def setup_model(self):
        """Setup model function

        Define the model here.
        """
        import tensorflow.compat.v1 as tf

        self.model = tf.keras.Sequential([
            tf.keras.layers.Input(self.input_dim),
            tf.keras.layers.Dense(10, activation='tanh'),
            tf.keras.layers.Dense(10, activation='tanh'),
            tf.keras.layers.Dense(2, activation='softmax')])
        self.optimizer = 'Adam'
        # To set optimizer params, self.optimizer = optimizer instance
        # rather than string, with params values passed as input to the class
        # constructor. E.g.:
        #
        #   import tensorflow
        #   self.optimizer = tensorflow.keras.optimizers.Adam(
        #       learning_rate=0.015)
        #
        self.epochs = 10
        self.batch_size = 250
        self.metrics = ['accuracy']
        self.loss = tf.keras.losses.SparseCategoricalCrossentropy()
        # As an alternative, define the loss function with a string

The UserModel class is derived from the NeuralNetworkClassification of the DQ0 SDK. There are a couple of model parent classes for different model types in the SDK's dq0.sdk.models package.

As mentioned above the UserModel needs to define two functions that are called by the DQ0 runtime (the train plugin) at runtime:

setup_data: read the sensitive data set and prepare it for model training.
setup_model: define the model

The setup_data function above starts with a (quite lengthy) definition of the census columns. This is required for the feature preprocessing in the following steps. In more realistic (non-self contained) scenarios this information should be loaded from the data sets metadata.

After the column definitions the data is loaded into a pandas dataframe by the helper function preprocess. The line dataset = self.data_source.read(...) is actually loading the attached dataset. The attribute data_source of the UserModel object is populated with the correct data set at runtime by the DQ0 system.

The rest of the preprocess function is preparing the features from the data for later use in the model training.

The lines below # do the train test split in setup_data are important:

# do the train test split
X_train_df, X_test_df, y_train_ts, y_test_ts =\
    train_test_split(dataset_df.iloc[:, :-1],
                        dataset_df.iloc[:, -1],
                        test_size=0.33
                        )
self.input_dim = X_train_df.shape[1]

# set data attributes
self.X_train = X_train_df
self.X_test = X_test_df
self.y_train = y_train_ts
self.y_test = y_test_ts

The attributes X_train, x_test, y_train, and y_test are conventions that are expected by the DQ0 instance. Use these exact names to assign the data splits for the DQ0 runtime to able to pick them up correctly.

Finally, setup_model defines the keras model that is fitted during the DQ0 training. All attributes that are used in this function are important at runtime:

model: holds the actual model reference
optimizer: string or instance reference of the optimizer to be used
epocs: the number of training epochs
batch_size: desired batch size for training
metrics: metrics to log during or after training. Note that all compatible values are allowed but not all values are returned from DQ0 due to privacy checks.
loss: reference of loss function to be used.