MLT Modules

These are the main modules of MLT. Their usage and details are explained in detail for every module.

MLT.datasets

Module for preparing individual datasets.

CICIDS2017

Load the CICIDS2017 dataset from the pickle and filter features

MLT.datasets.CIC._load_cic(columns=None, transformed=False)[source]

Load an return the feature dataset as tuple.

Parameters:
  • columns (list[int] or lsit[string], optional) – List of columns to keep from the full dataset
  • transformed (bool, optional) – Whether to use a PowerTransformed version of the dataset
Returns:

A tuple containing train- and test-data and -labels

Return type:

data (tuple)

MLT.datasets.CIC.get_CIC_Top20()[source]

Get the randomized Top 20 class subset identified by mutual_info_classif.

To generate these fields, call cic_feature_selection.py

MLT.datasets.pickleCIC.pickleCIC_randomized()[source]

Pulls a randomized test partition and pickles it to disk.

MLT.datasets.pickleCIC.prepare_dataset()[source]

Base function for dataset loading and preparation.

Returns:data – A tuple consisting of (cic_data (Pandas.DataFrame), cic_labels (Pandas.DataFrame), group_list (List))
Return type:tuple

NSL_KDD

Load the NSL_KDD dataset from the pickle and filter for attributes

MLT.datasets.NSL._load_nsl(column_names)[source]

Loads the dataset and filters for given column names

Parameters:column_names (list(str)) – List of column names that you want in your dataset.
Returns:data – A tuple containing the filtered train- and test-data and -labels
Return type:tuple
MLT.datasets.NSL.get_NSL_16class()[source]

Load the dataset, choose 16 features based on Iglesias & Zseby (2015) and binarize labels

MLT.datasets.NSL.get_NSL_6class()[source]

Load the dataset, choose 6 features and binarize the labels

MLT.implementations

This module contains the specific implementations to benchmark

Autoencoder

AutoEncoder pyod implementation based on Aggarwal, C.C. (2015)

MLT.implementations.Autoencoder.train_model(training_data, training_labels, test_data, test_labels, full_filename, hidden_neurons=None, hidden_activation='relu', output_activation='sigmoid', optimizer='adam', epochs=100, batch_size=32, dropout_rate=0.2, l2_regularizer=0.1, validation_size=0.1, preprocessing=True, verbose=2, random_state=42, contamination=0.1, learning_rate=0.001)[source]

Created and trains a Autoencoder instance with given params. See https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.auto_encoder

The call is slightly extended in regard to the PyOD version: If no hidden_neuron-list is provided, a custom one is generated with a bottleneck of half the feature count. Also, if a custom learning rate is provided, an adam optimizer with that lr is used instead of the defaults.

Returns:Named tuple with training results
Return type:PredictionEntry

HBOS

HBOS pyod implementation based on Goldstein and Dengel (2012)

MLT.implementations.HBOS.train_model(n_bins, alpha, tol, contamination, training_data, training_labels, test_data, test_labels, full_filename)[source]

Created and trains a HBOS instance with given params

Parameters:
  • n_bins (int, optional (default=10)) – The number of bins
  • alpha (float in (0, 1), optional (default=0.1)) – The regularizer for preventing overflow
  • tol (float in (0, 1), optional (default=0.1)) – The parameter to decide the flexibility while dealing the samples falling outside the bins.
  • training_data (numpy.ndarray or Pandas.DataFrame) – Data to train on
  • training_labels (list) – List of labels corresponding to the training data - can be left empty for unsupervised learning
  • test_data (numpy.ndarray or Pandas.DataFrame) – Data to train on
  • test_labels (list) – List of labels corresponding to the test data
Returns:

Named tuple with training results

Return type:

PredictionEntry

IsolationForest

iForest implementation by pyod based on scikit-learn

MLT.implementations.IsolationForest.train_model(training_data, training_labels, test_data, test_labels, full_filename, n_estimators=100, contamination=0.1, max_features=1.0, bootstrap=False)[source]

Created and trains an Isolation Forest instance with given params

Parameters:
  • n_estimators (int, optional (default=100)) – The number of base estimators in the ensemble.
  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set
  • max_features (int or float, optional (default=1.0)) – The number of features to draw from X to train each base estimator.
  • bootstrap (boolean, optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
Returns:

Named tuple with training results

Return type:

PredictionEntry

LSTM_2_Multiclass

Keras-based custom LSTM that classifies into 2 categories

MLT.implementations.LSTM_2_Multiclass.train_model(batch_size, epochs, learning_rate, training_data, training_labels, test_data, test_labels, logdir, model_savename)[source]

Creates and trains an instance with given params.

Parameters:
  • batch_size (int) – Batch size for use in training
  • epochs (int) – How many epochs does the training take
  • learn_rate (float) – Boosting learning rate (XGB’s “eta”)
  • training_data (numpy.ndarray) – Data to train on
  • training_labels (list) – List of labels corresponding to the training data
  • test_data (numpy.ndarray) – Data to train on
  • test_labels (list) – List of labels corresponding to the test data
  • logdir (string) – In this path all Tensorboard logs will be stored
  • model_savename (string) – This filename will be used for persisting the trained model
Returns:

Named tuple with training results

Return type:

PredictionEntry

RandomForest

Basic scikit implementation of a Random Forest Classifier

MLT.implementations.RandomForest.train_model(n_estimators, max_depth, training_data, training_labels, test_data, test_labels, model_savename)[source]

Creates and trains a XGBoost sklearn instance with given params.

Parameters:
  • n_estimators (int) – Number of estimators to use
  • max_depth (int) – Maximum tree depth for individual trees
  • training_data (numpy.ndarray) – Data to train on
  • training_labels (list) – List of labels corresponding to the training data
  • test_data (numpy.ndarray) – Data to train on
  • test_labels (list) – List of labels corresponding to the test data
  • full_filename (string) – This filename will be used for persisting the trained model
Returns:

Named tuple with training results

Return type:

PredictionEntry

XGBoost

XGBoost scikit implementation based on https://xgboost.readthedocs.io/en/latest/

MLT.implementations.XGBoost.train_model(n_estimators, max_depth, learning_rate, training_data, training_labels, test_data, test_labels, full_filename)[source]

Creates and trains a XGBoost sklearn instance with given params

Parameters:
  • n_estimators (int) – Number of estimators to use
  • max_depth (int) – Maximum tree depth for base learners
  • learning_rate (float) – Boosting learning rate (XGB’s “eta”)
  • training_data (numpy.ndarray) – Data to train on
  • training_labels (list) – List of labels corresponding to the training data
  • test_data (numpy.ndarray) – Data to train on
  • test_labels (list) – List of labels corresponding to the test data
  • full_filename (string) – This filename will be used for persisting the trained model
Returns:

Named tuple with training results

Return type:

PredictionEntry

MLT.metrics

Generates advanced metrics for results and datasets.

Base Metrics

Utility module for various basic metric functions.

These functions all take stats_data and transforms these to a list of target metrics for all folds.

MLT.metrics.metrics_base.calc_acc(prediction_data)[source]

Calculate basic accuracy for a given list of prediction entries

MLT.metrics.metrics_base.calc_fbeta_binary(prediction_data, beta)[source]

Calculate fß score for a given list of prediction entries

MLT.metrics.metrics_base.calc_mean_training_time(stats_data)[source]

Calculate the mean traning time over all folds

MLT.metrics.metrics_base.calc_precision(prediction_data)[source]

Calculate the precision for a given list of predictions.

MLT.metrics.metrics_base.calc_recall(prediction_data)[source]

Calculate the recall for a given list of predictions.

MLT.metrics.metrics_base.sum_training_times(stats_data)[source]

Sum all training times of all folds

Metrics related to Confusion Matrices

Utility module for generating various confusion matrix flavours.

MLT.metrics.metrics_cm.calc_cm(prediction_data)[source]

Calculate the Confusion Matrices for given prediction entries

MLT.metrics.metrics_cm.generate_all_cm_to_disk(cm_array, modelname, filepath)[source]

Generate normalized and absolute matrices as images at the given filepath

MLT.metrics.metrics_cm.generate_confusion_matrix_to_disk(cmatrix, classes, modelname, filepath, normalize=False)[source]

Plot and save the confusion matrix as a picture.

Classes are fixed and given, as well as the save path and the modelname. The latter also gets incorporated in the plot title. Normalization can be applied by setting normalize=True.

MLT.metrics.metrics_cm.normalize_cm(cmatrix)[source]

Translate absolte values of a CM to relative distributions

MLT.metrics.metrics_cm.save_cm_arr_to_disk(cm_array, modelname, result_path)[source]

Save the confusion array for a given model to disk as a json

Feature Distribution Metrics

Generate distribution-related graphs as PNGs to disk.

MLT.metrics.metrics_distrib.generate_boxplot_to_disk(data_pdframe, title, resultpath)[source]

Generate a boxplot for given dataframe to the path

MLT.metrics.metrics_distrib.generate_feature_distribution_to_disk(datas, title, resultpath, column_names=None)[source]

Generate distribution graphs for a given dataset into the resultfolder

MLT.metrics.metrics_distrib.generate_hist_to_disk(data_pdframe, title, resultpath)[source]

Generate a histogram for given dataframe to the path

ROC and AUC Metrics

Utility functions for generating ROC and AUC statistics

MLT.metrics.metrics_roc.append_roc_model_selection(result_json, modelname, line_format)[source]

Appends the CV-mean ROC to an existing plot.

MLT.metrics.metrics_roc.calc_auc(prediction_data)[source]

Calculated the area under curve on given DF

MLT.metrics.metrics_roc.generate_avg_roc_to_disk(prediction_data, modelname, filepath)[source]

Generates an average of all given ROCs and plots all ROCs and Avg to a single figure

MLT.metrics.metrics_roc.generate_cv_roc_model_selection(modelname, result_path, parameter_name, model_id_list=None, format_list=None)[source]

Generates an average of all CV-results in a given folder.

Point this function to a folder that contains multiple result-subfolders with crossvalidated results. It will generate the average ROC for every result and add them all to a single figure.

Parameters:
  • modelname (string) – The name of the model to draw. Will be used to determine filename and title of the plot.
  • result_path (string or list) – Path to the result base folder that contains multiple test runs. Can be a list of single runs. All runs will be combined into a single figure.
  • parameter_name (string) – Parameter under test - will be in the title and appended to the filename.
  • model_id_list (list) – A list of Strings. This is used for the legend.
  • format_list (list) – A list of pyplot format Strings to be used for the single plots.

MLT.testrunners

These runners are responsible for the benchmark execution and additional features like crossvalidation.

Benchmark

This runner implements the main benchmark for qualitative analysis based on the full training and test sets.

MLT.testrunners.single_benchmark.run_benchmark(train_data, train_labels, test_data, test_labels, result_path, model_savepath, args)[source]

Run the full benchmark.

As this is the full benchmark, it needs a train and a test partition. Besides that, it is mostly similar to the kfold_runner.

Parameters:
  • train_data (numpy.ndarray) – Training partition
  • train_labels (numpy.ndarray) – According labels for supervised learning
  • test_data (numpy.ndarray) – Training partition
  • test_labels (numpy.ndarray) – According labels for supervised learning
  • result_path (str) – Where to save the results
  • model_savepath (str) – Where to store the trainned models
  • args (argparse.Namespace) – Parsed CMD arguments that contain all the switches and settings
Returns:

The path where to find the final results

Return type:

result_path (str)

K-Fold Crossvalidation

This runner implements the benchmark with a configurable number of k-Folds for crossvalidation

MLT.testrunners.kfold_runner.run_benchmark(candidate_data, candidate_labels, result_path, model_savepath, args)[source]

Run the k-fold benchmark itself.

Note the absence of train- and test-partitions. As this is a crossvalidation run, the test partition is not to be touched!

Keyword arguments: candidate_data – Training data with 6 features candidate_labels – According labels for supervised learning result_path – Where to save the results args – Parsed CMD arguments that contain all the switches and settings

MLT.tools

A collection of misc tools that support the main modules.

PredictionEntry

This is a global define for a namded tuple that stores training results.

class MLT.tools.prediction_entry.PredictionEntry(test_labels, predicted_labels, predicted_probabilities, training_time)

A single prediction entry that holds all information of a test run.

Parameters:
  • test_labels – The unmodified, original labels of the test set
  • predicted_labels – These are the binary classes that have been predicted (i.e.: 0 or 1)
  • predicted_probabilities – A list of probabilities. Each entry represents a value between 0 and 1
  • training_time – The time it took for the training to finish
predicted_labels

Alias for field number 1

predicted_probabilities

Alias for field number 2

test_labels

Alias for field number 0

training_time

Alias for field number 3

Dataset Tools

Miscellaneous dataset tools and helper functions

MLT.tools.dataset_tools.abs_scaler(train_data, test_data)[source]

Scale given data with a MaxAbsScaler trained on the train data.

Parameters:
  • train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to scale
  • test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to scale
Returns:

The transformed data sets

Return type:

train_data, test_data (Numpy.ndarray)

MLT.tools.dataset_tools.load_df(filename, folderpath)[source]

Helper function to load Dataframes from a given folder

MLT.tools.dataset_tools.min_max_scale(train_data, test_data)[source]

Scale given data with a MinMaxScaler trained on the train data.

Parameters:
  • train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to scale
  • test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to scale
Returns:

The transformed data sets

Return type:

train_data, test_data (Numpy.ndarray)

MLT.tools.dataset_tools.powertransform_yeoJohnson(train_data, test_data=None)[source]

Transforms given datasets with a Yeo Johnson Powertransform.

This transformer will train on the training set and then scale both sets, training and test. See I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).

Parameters:
  • train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to transform
  • test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to transform
Returns:

The transformed data sets

Return type:

train_data, test_data (Numpy.ndarray)

MLT.tools.dataset_tools.standard_scale(train_data, test_data)[source]

Scale given data with a StandardScaler trained on the train data.

Parameters:
  • train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to scale
  • test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to scale
Returns:

The transformed data sets

Return type:

train_data, test_data (Numpy.ndarray)

Keras Helper

Utility functions for Keras-realted implementations

MLT.tools.helper_keras.keras_load_model(full_path)[source]

Load a single model from given path

MLT.tools.helper_keras.keras_load_modellist(model_filenames, model_path)[source]

Load a list of models from a path

MLT.tools.helper_keras.keras_persist_model(model, model_savename)[source]

Save the full model to disk.

MLT.tools.helper_keras.keras_train_model(model, epochs, batch_size, training_data, training_labels, test_data, test_labels, logdir, model_savename)[source]

Train the given model with data and predict the run.

MLT.tools.helper_keras.keras_train_model_adaptive(model, epochs, batch_size, training_data, training_labels, test_data, test_labels, logdir, model_savename)[source]

Train the given model with data and predict the run.

This training reduces the learning rate on a fixed base every 30 epochs to 10% of the original value.

MLT.tools.helper_keras.predict_keras(single_model, test_data, test_labels)[source]

Only predict a model without training it.

Pyod Helper

Utility functions for pyod-related implementations

MLT.tools.helper_pyod.predict_pyod(single_model, test_data, test_labels)[source]

Only predict a model without fitting it

MLT.tools.helper_pyod.pyod_load_model(dirpath, modelname)[source]

Load a scikit model from disk

MLT.tools.helper_pyod.pyod_load_modellist(model_filenames, model_path)[source]

Load a list of scikit models from disk from given path

MLT.tools.helper_pyod.pyod_persist_model(model, model_savename)[source]

Save a scikit model to disk

MLT.tools.helper_pyod.pyod_train_model(model, training_data, training_labels, test_data, test_labels, model_savename)[source]

Train the given model with data and predict the run

Scikit Helper

Utility functions for scikit-learn-realted implementations

MLT.tools.helper_sklearn.predict_scikit(single_model, test_data, test_labels)[source]

Only predict a model without fitting it

MLT.tools.helper_sklearn.sklearn_load_model(dirpath, modelname)[source]

Load a scikit model from disk

MLT.tools.helper_sklearn.sklearn_load_modellist(model_filenames, model_path)[source]

Load a list of scikit models from disk from given path

MLT.tools.helper_sklearn.sklearn_persist_model(model, model_savename)[source]

Save a scikit model to disk

MLT.tools.helper_sklearn.sklearn_train_model(model, training_data, training_labels, test_data, test_labels, model_savename)[source]

Train the given model with data and predict the run

Email Tools

Load results, compile them into a mail and send it.

The details (where to send the mail, the sender address, server credentials) can be found in result_mail_credentials.py.dist - to set this up, copy the file, remove the .dist and fill it with real info.

MLT.tools.result_mail.compose_and_send(message_content)[source]

Take the content and send it to a defined sender.

MLT.tools.result_mail.prepare_and_send_results(resultpath, args)[source]

Conditionally load results, then send them via mail

Result Helper

Additional tools to simplify and speed up the result evaluation

MLT.tools.result_helper.gen_ltx(modelname, top_resultpath)[source]

Generate a LaTeX table from metrics.json in every subfolder with the call_params, if existing.

Parameters:
  • modelname (str) – Name of the model to evaluate. Used to derive filenames.
  • top_resultpath (str) – Path to the parent folder with all subresults
MLT.tools.result_helper.list_scores(modelname, top_resultpath)[source]

Lists the metrics.json in every subfolder with the call_params, if existing.

Parameters:
  • modelname (str) – Name of the model to evaluate. Used to derive filenames.
  • top_resultpath (str) – Path to the parent folder with all subresults
MLT.tools.result_helper.list_single_score(modelname, resultpath)[source]

List score of a single result in given folder.

Parameters:
  • modelname (str) – Name of the model to evaluate. Used to derive filenames.
  • resultpath (str) – Path to the folder with a test run result

Uncategorized Tools

Collection of misc tools that don’t fit in a standalone module

MLT.tools.toolbelt.create_dir(dirpath)[source]

Create the specified path if it is not existing.

MLT.tools.toolbelt.list_files(dirpath, fname_start)[source]

List all files in a folder that start with the given string.

MLT.tools.toolbelt.list_folders(dirpath)[source]

List all subfolders in a given path

MLT.tools.toolbelt.load_fold_indices(path)[source]

Load the stard and end indices of the test set for every fold.

MLT.tools.toolbelt.load_result(path, modelname)[source]

Load the metrics for a given model in the given path.

MLT.tools.toolbelt.load_results_from_disk(path, modelname)[source]

Load the full result json for the given model from the path.

MLT.tools.toolbelt.prepare_folders(runner_name)[source]

Creates all the folders needed for a test run

Parameters:runner_name (string) – Name of the calling runner. Will be the base name for results
Returns
result_path (string): The full path where results can be stored
MLT.tools.toolbelt.read_from_json(full_path_with_name)[source]

Read from an arbitrary JSON and return the structure

MLT.tools.toolbelt.read_from_pickle(full_path_with_name)[source]

PRead from pickle at given location

MLT.tools.toolbelt.save_metrics_to_disk(metrics_array, modelname, result_path)[source]

Save a given metric array as json to disk

MLT.tools.toolbelt.save_np_to_disk(stats_dataframe, filename, result_path)[source]

Save a given dataframe as binary numpy pickle to disk

MLT.tools.toolbelt.save_results_to_disk(stats_data, filename, result_path)[source]

save the full results for a given model as json to disk

MLT.tools.toolbelt.write_call_params(args, result_path)[source]

Write the parametes with wich MLT has been called to a txt file in the result path

MLT.tools.toolbelt.write_to_json(full_path_with_name, data)[source]

JSON dump the given file to disk at the given path

MLT.tools.toolbelt.write_to_pickle(full_path_with_name, data)[source]

Pickle the given file to disk at the given path