Data Science SDK Introduction
The DQ0 SDK is available as a public github repository: https://github.com/gradientzero/dq0-sdk
This introduction explains how to work with the DQ0 SDK by walking through a simple machine learning development example.
For a quickstart, please visit the getting started notebook at: https://github.com/gradientzero/dq0-sdk/blob/master/notebooks/DQ0SDK-Quickstart.ipynb
The DQ0 Interface
To implement your own Gaussian naive Bayesian classifier or your own neural network method, instantiate the UserModel class by deriving from one of the following models:
- class NaiveBayesianModel defined in module dq0sdk.models.bayes.naive_bayesian_model;
- class NeuralNetworkClassification defined in module dq0sdk.models.tf.neural_network_classification;
- class NeuralNetworkRegression defined in module dq0sdk.models.tf.neural_network_regression;
Methods of UserModel instance
Any UserModel instance must implement the following two methods:
setup_model()
, defining the machine learning model (for example, the topology of the neural network);setup_data()
, defining train and test data by initializing theX_train
,y_train
,X_test
,y_test
instance attributes (please see below).
It might be useful (please see the SDK examples) to define also an auxiliary preprocess method, responsible of preparing the data (for example, converting string labels to numeric values) for being processed by the machine learning algorithm instantiated and assigned to attribute model (please see below).
Attributes of UserModel instance
Any UserModel instance must include the following instance attributes:
X_train
, a matrix with each row containing a training example;y_train
, a column vector or zero-dimensional Python array with the learning signal;model
, a TensorFlow.Keras or Scikit-learn model instance. Parameters can be passed to the instance, as specified in the documentation of the corresponding TensorFlow.Keras or Scikit-learn class;metrics
, a list of metrics, selected from TensorFlow.Keras.metrics, to evaluate the model performance. Each metric is specified by providing its name (string) or by instantiating the corresponding class. In the latter case, metric parameters may be provided, as specified in the documentation of the corresponding TensorFlow.Keras.metrics class. Basically, the metrics are specified in the same way as for the TensorFlow.Keras library.
If the predict and evaluate methods of the UserModel
instance are called, the following additional attributes must be provided:
X_test
, a matrix with each row containing a test example;y_test
, a column vector or zero-dimensional Python array with the learning signal.
Consistency among data types of X_train
, y_train
and X_test
, y_test
(if X_test
, y_test
are defined) is required. That is, if X_train
is a Pandas DataFrame, y_train
is expected to be a Pandas Series. Analogously, if X_train
type is Numpy ndarray, y_train
is expected to be Numpy ndarray as well. The same data-type relationship constraints holds for X_test
and y_test
. Furthermore, if X_train
is a Pandas DataFrame, X_test
is expected to be a Pandas DataFrame as well. Finally, if y_test
is defined, it is expected to have the same number of dimensions of y_train
. Even though DQ0 is robust to violations of above data-type relationship and dimensionality constraints, the UserModel instance is recommended to satisfy them. To check whether these type-relationship constraints are satisfied, one can include the following peace of code in the UserModel’s method defining self.X_train
, self.X_test
, self.y_train
, and self.y_test
:
from dq0.sdk.data.utils import util
util.check_data_structure_type_consistency(self.X_train, self.X_test,
self.y_train, self.y_test)
In the case of a classification task, if the original data labels are encoded (e.g., as integers), we suggest to assign the encoder object to attribute label_encoder
of the UserModel
class. This choice enables the use of the original rather than the encoded labels when annotating the confusion matrix during the evaluation of the user model. The label encoding is usually done in the setup_data or preprocess method of the UserModel
class.
Additional requirements for neural networks models
UserModel
instances implementing neural networks models (based on TensorFlow.Keras) require the following additional instance attributes, defining the learning process:
optimizer
, defined either by providing the name (string) or by instantiating a TensorFlow.Keras.optimizers class. The latter approach enables to pass parameters to the optimizer, as specified in the documentation of the corresponding TensorFlow.Keras.optimizers class. Currently, DQ0 supports the following three optimization algorithms from TensorFlow.Keras.optimizers:- Adagrad, class tensorflow.keras.optimizers.Adagrad
- Adam, class tensorflow.keras.optimizers.Adam
- SGD, class tensorflow.keras.optimizers.SGD
loss
, defining the loss function driving the learning process. The loss function is specified either by providing its name (string) or by instantiating the suitable class from TensorFlow.Keras.losses. The latter approach enables to pass parameters to the loss function, as specified in the documentation of the corresponding TensorFlow.Keras.losses class. Basically, the loss function is specified in the same way as for TensorFlow.Keras.epochs
, defining the number of epochs of the learning process;batch_size
, defining the minibatch size adopted for the learning process.
For classification tasks, UserModel
is assumed to be probabilistic classifier. That is, for a given input observation, UserModel is expected to return a probability distribution over the set of classes. This assumption does not limit the usage of DQ0. Indeed, it is consistent with the common implementation of neural networks for classification tasks, which usually returns class membership probabilities by applying, e.g., a "softmax" activation function in the last layer of the network. Furthermore, if necessary, the raw (un-normalized) predictions of the neural network (also known as "logits") can straightforwardly be extracted from the Keras object underlying the DQ0 UserModel instance.
Additional requirements for naïve Bayesian models
For UserModel instances inheriting from the Gaussian naïve Bayesian model, calibration of model posterior probabilities is enabled by default. Probability calibration is obtained by fitting a sklearn.calibration.CalibratedClassifierCV
to the training data (X_train
, y_train
). Five-fold cross-validation is used. The calibration process may be optionally controlled and tuned by the following UserModel attributes:
calibrate_posterior_probabilities
, a Boolean attribute (default value True). Set to False to disable probabilities calibration;calibration_method
, string attribute defining the calibration method adopted. Possible values are:sigmoid
, to learn a probability calibrator by logistic regression;isotonic
, to apply isotonic regression for calibration.
Default value for calibration_method
: sigmoid
if X_train
contains less than 1000 training examples, isotonic
otherwise.
The Census Example
The census adult dataset is a popular test bed for a simple machine learning model. The goal is to predict whether income exceeds $50K/year based on census data. The data set and a description thereof is available here: https://archive.ics.uci.edu/ml/datasets/adult
The example discussed in this guide is located here: https://github.com/gradientzero/dq0-sdk/tree/master/dq0/sdk/examples/census/raw
Running the example
Let's have a look at the example's run script:
run_demo.py
# -*- coding: utf-8 -*-
"""Adult dataset example.
Run script to test the execution locally.
Copyright 2020, Gradient Zero
All rights reserved
"""
import os
import dq0.sdk
from dq0.sdk.data.utils import util
from dq0.sdk.examples.census.raw.model.user_model import UserModel
if __name__ == '__main__':
print('\nRunning demo for the "Census" dataset\n')
# set seed of random number generator to ensure reproducibility of results
util.initialize_rnd_numbers_generators_state()
# path to input
path = '../_data/adult_with_rand_names.csv'
filepath = os.path.join(os.path.dirname(
os.path.abspath(__file__)), path)
# init input data source
data_source = dq0.sdk.data.text.CSV(filepath)
# create model
model = UserModel()
# attach data source
model.attach_data_source(data_source)
# prepare data
model.setup_data()
# setup model
model.setup_model()
# fit the model
model.fit()
# evaluate the model
model.evaluate()
model.evaluate(test_data=False)
print('\nDemo run successfully!\n')
The first interesting line is this one: data_source = dq0.sdk.data.text.CSV(filepath)
. Here, a new data source object is created from a CSV file. The dq0.sdk.data
package contains many pre-defined data sources that you can use for different types of data sets.
In the next line model = UserModel()
an instance of our model is created. We will have a closer look at this code in the next section.
model.attach_data_source(data_source)
shows an important function of the SDK: attach_data_source "registers" the selected data set as the desired target data set, the model shall be trained on. Since from the client side you cannot access the data sets inside the DQ0 platform instance (the data quarantine) you need to "attach" the source to the model, so that the platfrom can load the correct dataset at runtime.
The next two lines are the two most important SDK model functions: model.setup_data()
needs to be implemented for the data loading and preperation step. model.setup_model()
is called by the DQ0 platform right before the model training and is responsible for the complete model definition.
The remainder of the run script is there for demo purposes only. The fit
function is called at runtime by the platform but in a secure and non-obvious way, to ensure the correct application of the implemented privacy-preserving mechanisms of DQ0.
User Model
The user model class is defined as follows:
model/user_model.py
# -*- coding: utf-8 -*-
"""Adult dataset example.
Neural network model definition
Copyright 2020, Gradient Zero
All rights reserved
"""
import logging
from dq0.sdk.models.tf import NeuralNetworkClassification
logger = logging.getLogger()
class UserModel(NeuralNetworkClassification):
"""Derived from dq0.sdk.models.tf.NeuralNetwork class
Model classes provide a setup method for data and model
definitions.
"""
def __init__(self):
super().__init__()
def setup_data(self):
"""Setup data function
This function can be used to prepare data or perform
other tasks for the training run.
At runtime the selected datset is attached to this model. It
is available as the `data_source` attribute.
For local testing call `model.attach_data_source(some_data_source)`
manually before calling `setup_data()`.
Use `self.data_source.read()` to read the attached data.
"""
from sklearn.model_selection import train_test_split
# columns
self.column_names_list = [
'lastname',
'firstname',
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income'
]
self.columns_types_list = [
{
'name': 'age',
'type': 'int'
}
# CUT FOR BETTER READABILITY
]
# read and preprocess the data
dataset_df = self.preprocess()
# do the train test split
X_train_df, X_test_df, y_train_ts, y_test_ts =\
train_test_split(dataset_df.iloc[:, :-1],
dataset_df.iloc[:, -1],
test_size=0.33
)
self.input_dim = X_train_df.shape[1]
# set data attributes
self.X_train = X_train_df
self.X_test = X_test_df
self.y_train = y_train_ts
self.y_test = y_test_ts
def preprocess(self):
"""Preprocess the data
Preprocess the data set. The input data is read from the attached source.
At runtime the selected datset is attached to this model. It
is available as the `data_source` attribute.
For local testing call `model.attach_data_source(some_data_source)`
manually before calling `setup_data()`.
Use `self.data_source.read()` to read the attached data.
Returns:
preprocessed data
"""
from dq0.sdk.data.preprocessing import preprocessing
import sklearn.preprocessing
import pandas as pd
column_names_list = self.column_names_list
columns_types_list = self.columns_types_list
# get the input dataset
if self.data_source is None:
logger.error('No data source found')
return
# read the data via the attached input data source
dataset = self.data_source.read(
names=column_names_list,
sep=',',
skiprows=1,
index_col=None,
skipinitialspace=True,
na_values={
'capital-gain': 99999,
'capital-loss': 99999,
'hours-per-week': 99,
'workclass': '?',
'native-country': '?',
'occupation': '?'}
)
# drop unused columns
dataset.drop(['lastname', 'firstname'], axis=1, inplace=True)
column_names_list.remove('lastname')
column_names_list.remove('firstname')
# define target feature
target_feature = 'income'
# get categorical features
categorical_features_list = [
col['name'] for col in columns_types_list
if col['type'] == 'string']
# get categorical features
quantitative_features_list = [
col['name'] for col in columns_types_list
if col['type'] == 'int' or col['type'] == 'float']
# get arguments
approach_for_missing_feature = 'imputation'
imputation_method_for_cat_feats = 'unknown'
imputation_method_for_quant_feats = 'median'
features_to_drop_list = None
# handle missing data
dataset = preprocessing.handle_missing_data(
dataset,
mode=approach_for_missing_feature,
imputation_method_for_cat_feats=imputation_method_for_cat_feats,
imputation_method_for_quant_feats=imputation_method_for_quant_feats, # noqa: E501
categorical_features_list=categorical_features_list,
quantitative_features_list=quantitative_features_list)
if features_to_drop_list is not None:
dataset.drop(features_to_drop_list, axis=1, inplace=True)
# get dummy columns
dataset = pd.get_dummies(dataset, columns=categorical_features_list, dummy_na=False)
# unzip categorical features with dummies
categorical_features_list_with_dummies = []
for col in columns_types_list:
if col['type'] == 'string':
for value in col['values']:
categorical_features_list_with_dummies.append('{}_{}'.format(col['name'], value))
# add missing columns
missing_columns = set(categorical_features_list_with_dummies) - set(dataset.columns)
for col in missing_columns:
dataset[col] = 0
# and sort the columns
dataset = dataset.reindex(sorted(dataset.columns), axis=1)
# Scale values to the range from 0 to 1 to be precessed by the neural network
dataset[quantitative_features_list] = sklearn.preprocessing.minmax_scale(dataset[quantitative_features_list])
# label target
y_ts = dataset[target_feature]
self.label_encoder = sklearn.preprocessing.LabelEncoder()
y_bin_nb = self.label_encoder.fit_transform(y_ts)
y_bin = pd.Series(index=y_ts.index, data=y_bin_nb)
dataset.drop([target_feature], axis=1, inplace=True)
dataset[target_feature] = y_bin
return dataset
def setup_model(self):
"""Setup model function
Define the model here.
"""
import tensorflow.compat.v1 as tf
self.model = tf.keras.Sequential([
tf.keras.layers.Input(self.input_dim),
tf.keras.layers.Dense(10, activation='tanh'),
tf.keras.layers.Dense(10, activation='tanh'),
tf.keras.layers.Dense(2, activation='softmax')])
self.optimizer = 'Adam'
# To set optimizer params, self.optimizer = optimizer instance
# rather than string, with params values passed as input to the class
# constructor. E.g.:
#
# import tensorflow
# self.optimizer = tensorflow.keras.optimizers.Adam(
# learning_rate=0.015)
#
self.epochs = 10
self.batch_size = 250
self.metrics = ['accuracy']
self.loss = tf.keras.losses.SparseCategoricalCrossentropy()
# As an alternative, define the loss function with a string
The UserModel
class is derived from the NeuralNetworkClassification
of the DQ0 SDK. There are a couple of model parent classes for different model types in the SDK's dq0.sdk.models
package.
As mentioned above the UserModel
needs to define two functions that are called by the DQ0 runtime (the train plugin) at runtime:
setup_data
: read the sensitive data set and prepare it for model training.setup_model
: define the model
The setup_data
function above starts with a (quite lengthy) definition of the census columns. This is required for the feature preprocessing in the following steps. In more realistic (non-self contained) scenarios this information should be loaded from the data sets metadata.
After the column definitions the data is loaded into a pandas dataframe by the helper function preprocess
. The line dataset = self.data_source.read(...)
is actually loading the attached dataset. The attribute data_source
of the UserModel
object is populated with the correct data set at runtime by the DQ0 system.
The rest of the preprocess
function is preparing the features from the data for later use in the model training.
The lines below # do the train test split
in setup_data
are important:
# do the train test split
X_train_df, X_test_df, y_train_ts, y_test_ts =\
train_test_split(dataset_df.iloc[:, :-1],
dataset_df.iloc[:, -1],
test_size=0.33
)
self.input_dim = X_train_df.shape[1]
# set data attributes
self.X_train = X_train_df
self.X_test = X_test_df
self.y_train = y_train_ts
self.y_test = y_test_ts
The attributes X_train
, x_test
, y_train
, and y_test
are conventions that are expected by the DQ0 instance. Use these exact names to assign the data splits for the DQ0 runtime to able to pick them up correctly.
Finally, setup_model
defines the keras model that is fitted during the DQ0 training.
All attributes that are used in this function are important at runtime:
model
: holds the actual model referenceoptimizer
: string or instance reference of the optimizer to be usedepocs
: the number of training epochsbatch_size
: desired batch size for trainingmetrics
: metrics to log during or after training. Note that all compatible values are allowed but not all values are returned from DQ0 due to privacy checks.loss
: reference of loss function to be used.