MLT Modules¶
These are the main modules of MLT. Their usage and details are explained in detail for every module.
MLT.datasets¶
Module for preparing individual datasets.
CICIDS2017¶
Load the CICIDS2017 dataset from the pickle and filter features
-
MLT.datasets.CIC.
_load_cic
(columns=None, transformed=False)[source]¶ Load an return the feature dataset as tuple.
Parameters: - columns (list[int] or lsit[string], optional) – List of columns to keep from the full dataset
- transformed (bool, optional) – Whether to use a PowerTransformed version of the dataset
Returns: A tuple containing train- and test-data and -labels
Return type: data (tuple)
-
MLT.datasets.CIC.
get_CIC_Top20
()[source]¶ Get the randomized Top 20 class subset identified by mutual_info_classif.
To generate these fields, call cic_feature_selection.py
NSL_KDD¶
Load the NSL_KDD dataset from the pickle and filter for attributes
-
MLT.datasets.NSL.
_load_nsl
(column_names)[source]¶ Loads the dataset and filters for given column names
Parameters: column_names (list(str)) – List of column names that you want in your dataset. Returns: data – A tuple containing the filtered train- and test-data and -labels Return type: tuple
MLT.implementations¶
This module contains the specific implementations to benchmark
Autoencoder¶
AutoEncoder pyod implementation based on Aggarwal, C.C. (2015)
-
MLT.implementations.Autoencoder.
train_model
(training_data, training_labels, test_data, test_labels, full_filename, hidden_neurons=None, hidden_activation='relu', output_activation='sigmoid', optimizer='adam', epochs=100, batch_size=32, dropout_rate=0.2, l2_regularizer=0.1, validation_size=0.1, preprocessing=True, verbose=2, random_state=42, contamination=0.1, learning_rate=0.001)[source]¶ Created and trains a Autoencoder instance with given params. See https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.auto_encoder
The call is slightly extended in regard to the PyOD version: If no hidden_neuron-list is provided, a custom one is generated with a bottleneck of half the feature count. Also, if a custom learning rate is provided, an adam optimizer with that lr is used instead of the defaults.
Returns: Named tuple with training results Return type: PredictionEntry
HBOS¶
HBOS pyod implementation based on Goldstein and Dengel (2012)
-
MLT.implementations.HBOS.
train_model
(n_bins, alpha, tol, contamination, training_data, training_labels, test_data, test_labels, full_filename)[source]¶ Created and trains a HBOS instance with given params
Parameters: - n_bins (int, optional (default=10)) – The number of bins
- alpha (float in (0, 1), optional (default=0.1)) – The regularizer for preventing overflow
- tol (float in (0, 1), optional (default=0.1)) – The parameter to decide the flexibility while dealing the samples falling outside the bins.
- training_data (numpy.ndarray or Pandas.DataFrame) – Data to train on
- training_labels (list) – List of labels corresponding to the training data - can be left empty for unsupervised learning
- test_data (numpy.ndarray or Pandas.DataFrame) – Data to train on
- test_labels (list) – List of labels corresponding to the test data
Returns: Named tuple with training results
Return type:
IsolationForest¶
iForest implementation by pyod based on scikit-learn
-
MLT.implementations.IsolationForest.
train_model
(training_data, training_labels, test_data, test_labels, full_filename, n_estimators=100, contamination=0.1, max_features=1.0, bootstrap=False)[source]¶ Created and trains an Isolation Forest instance with given params
Parameters: - n_estimators (int, optional (default=100)) – The number of base estimators in the ensemble.
- contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set
- max_features (int or float, optional (default=1.0)) – The number of features to draw from X to train each base estimator.
- bootstrap (boolean, optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
Returns: Named tuple with training results
Return type:
LSTM_2_Multiclass¶
Keras-based custom LSTM that classifies into 2 categories
-
MLT.implementations.LSTM_2_Multiclass.
train_model
(batch_size, epochs, learning_rate, training_data, training_labels, test_data, test_labels, logdir, model_savename)[source]¶ Creates and trains an instance with given params.
Parameters: - batch_size (int) – Batch size for use in training
- epochs (int) – How many epochs does the training take
- learn_rate (float) – Boosting learning rate (XGB’s “eta”)
- training_data (numpy.ndarray) – Data to train on
- training_labels (list) – List of labels corresponding to the training data
- test_data (numpy.ndarray) – Data to train on
- test_labels (list) – List of labels corresponding to the test data
- logdir (string) – In this path all Tensorboard logs will be stored
- model_savename (string) – This filename will be used for persisting the trained model
Returns: Named tuple with training results
Return type:
RandomForest¶
Basic scikit implementation of a Random Forest Classifier
-
MLT.implementations.RandomForest.
train_model
(n_estimators, max_depth, training_data, training_labels, test_data, test_labels, model_savename)[source]¶ Creates and trains a XGBoost sklearn instance with given params.
Parameters: - n_estimators (int) – Number of estimators to use
- max_depth (int) – Maximum tree depth for individual trees
- training_data (numpy.ndarray) – Data to train on
- training_labels (list) – List of labels corresponding to the training data
- test_data (numpy.ndarray) – Data to train on
- test_labels (list) – List of labels corresponding to the test data
- full_filename (string) – This filename will be used for persisting the trained model
Returns: Named tuple with training results
Return type:
XGBoost¶
XGBoost scikit implementation based on https://xgboost.readthedocs.io/en/latest/
-
MLT.implementations.XGBoost.
train_model
(n_estimators, max_depth, learning_rate, training_data, training_labels, test_data, test_labels, full_filename)[source]¶ Creates and trains a XGBoost sklearn instance with given params
Parameters: - n_estimators (int) – Number of estimators to use
- max_depth (int) – Maximum tree depth for base learners
- learning_rate (float) – Boosting learning rate (XGB’s “eta”)
- training_data (numpy.ndarray) – Data to train on
- training_labels (list) – List of labels corresponding to the training data
- test_data (numpy.ndarray) – Data to train on
- test_labels (list) – List of labels corresponding to the test data
- full_filename (string) – This filename will be used for persisting the trained model
Returns: Named tuple with training results
Return type:
MLT.metrics¶
Generates advanced metrics for results and datasets.
Base Metrics¶
Utility module for various basic metric functions.
These functions all take stats_data and transforms these to a list of target metrics for all folds.
-
MLT.metrics.metrics_base.
calc_acc
(prediction_data)[source]¶ Calculate basic accuracy for a given list of prediction entries
-
MLT.metrics.metrics_base.
calc_fbeta_binary
(prediction_data, beta)[source]¶ Calculate fß score for a given list of prediction entries
-
MLT.metrics.metrics_base.
calc_mean_training_time
(stats_data)[source]¶ Calculate the mean traning time over all folds
-
MLT.metrics.metrics_base.
calc_precision
(prediction_data)[source]¶ Calculate the precision for a given list of predictions.
Metrics related to Confusion Matrices¶
Utility module for generating various confusion matrix flavours.
-
MLT.metrics.metrics_cm.
calc_cm
(prediction_data)[source]¶ Calculate the Confusion Matrices for given prediction entries
-
MLT.metrics.metrics_cm.
generate_all_cm_to_disk
(cm_array, modelname, filepath)[source]¶ Generate normalized and absolute matrices as images at the given filepath
-
MLT.metrics.metrics_cm.
generate_confusion_matrix_to_disk
(cmatrix, classes, modelname, filepath, normalize=False)[source]¶ Plot and save the confusion matrix as a picture.
Classes are fixed and given, as well as the save path and the modelname. The latter also gets incorporated in the plot title. Normalization can be applied by setting normalize=True.
Feature Distribution Metrics¶
Generate distribution-related graphs as PNGs to disk.
-
MLT.metrics.metrics_distrib.
generate_boxplot_to_disk
(data_pdframe, title, resultpath)[source]¶ Generate a boxplot for given dataframe to the path
ROC and AUC Metrics¶
Utility functions for generating ROC and AUC statistics
-
MLT.metrics.metrics_roc.
append_roc_model_selection
(result_json, modelname, line_format)[source]¶ Appends the CV-mean ROC to an existing plot.
-
MLT.metrics.metrics_roc.
calc_auc
(prediction_data)[source]¶ Calculated the area under curve on given DF
-
MLT.metrics.metrics_roc.
generate_avg_roc_to_disk
(prediction_data, modelname, filepath)[source]¶ Generates an average of all given ROCs and plots all ROCs and Avg to a single figure
-
MLT.metrics.metrics_roc.
generate_cv_roc_model_selection
(modelname, result_path, parameter_name, model_id_list=None, format_list=None)[source]¶ Generates an average of all CV-results in a given folder.
Point this function to a folder that contains multiple result-subfolders with crossvalidated results. It will generate the average ROC for every result and add them all to a single figure.
Parameters: - modelname (string) – The name of the model to draw. Will be used to determine filename and title of the plot.
- result_path (string or list) – Path to the result base folder that contains multiple test runs. Can be a list of single runs. All runs will be combined into a single figure.
- parameter_name (string) – Parameter under test - will be in the title and appended to the filename.
- model_id_list (list) – A list of Strings. This is used for the legend.
- format_list (list) – A list of pyplot format Strings to be used for the single plots.
MLT.testrunners¶
These runners are responsible for the benchmark execution and additional features like crossvalidation.
Benchmark¶
This runner implements the main benchmark for qualitative analysis based on the full training and test sets.
-
MLT.testrunners.single_benchmark.
run_benchmark
(train_data, train_labels, test_data, test_labels, result_path, model_savepath, args)[source]¶ Run the full benchmark.
As this is the full benchmark, it needs a train and a test partition. Besides that, it is mostly similar to the kfold_runner.
Parameters: - train_data (numpy.ndarray) – Training partition
- train_labels (numpy.ndarray) – According labels for supervised learning
- test_data (numpy.ndarray) – Training partition
- test_labels (numpy.ndarray) – According labels for supervised learning
- result_path (str) – Where to save the results
- model_savepath (str) – Where to store the trainned models
- args (argparse.Namespace) – Parsed CMD arguments that contain all the switches and settings
Returns: The path where to find the final results
Return type: result_path (str)
K-Fold Crossvalidation¶
This runner implements the benchmark with a configurable number of k-Folds for crossvalidation
-
MLT.testrunners.kfold_runner.
run_benchmark
(candidate_data, candidate_labels, result_path, model_savepath, args)[source]¶ Run the k-fold benchmark itself.
Note the absence of train- and test-partitions. As this is a crossvalidation run, the test partition is not to be touched!
Keyword arguments: candidate_data – Training data with 6 features candidate_labels – According labels for supervised learning result_path – Where to save the results args – Parsed CMD arguments that contain all the switches and settings
MLT.tools¶
A collection of misc tools that support the main modules.
PredictionEntry¶
This is a global define for a namded tuple that stores training results.
-
class
MLT.tools.prediction_entry.
PredictionEntry
(test_labels, predicted_labels, predicted_probabilities, training_time)¶ A single prediction entry that holds all information of a test run.
Parameters: - test_labels – The unmodified, original labels of the test set
- predicted_labels – These are the binary classes that have been predicted (i.e.: 0 or 1)
- predicted_probabilities – A list of probabilities. Each entry represents a value between 0 and 1
- training_time – The time it took for the training to finish
-
predicted_labels
¶ Alias for field number 1
-
predicted_probabilities
¶ Alias for field number 2
-
test_labels
¶ Alias for field number 0
-
training_time
¶ Alias for field number 3
Dataset Tools¶
Miscellaneous dataset tools and helper functions
-
MLT.tools.dataset_tools.
abs_scaler
(train_data, test_data)[source]¶ Scale given data with a MaxAbsScaler trained on the train data.
Parameters: - train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to scale
- test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to scale
Returns: The transformed data sets
Return type: train_data, test_data (Numpy.ndarray)
-
MLT.tools.dataset_tools.
load_df
(filename, folderpath)[source]¶ Helper function to load Dataframes from a given folder
-
MLT.tools.dataset_tools.
min_max_scale
(train_data, test_data)[source]¶ Scale given data with a MinMaxScaler trained on the train data.
Parameters: - train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to scale
- test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to scale
Returns: The transformed data sets
Return type: train_data, test_data (Numpy.ndarray)
-
MLT.tools.dataset_tools.
powertransform_yeoJohnson
(train_data, test_data=None)[source]¶ Transforms given datasets with a Yeo Johnson Powertransform.
This transformer will train on the training set and then scale both sets, training and test. See I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).
Parameters: - train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to transform
- test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to transform
Returns: The transformed data sets
Return type: train_data, test_data (Numpy.ndarray)
-
MLT.tools.dataset_tools.
standard_scale
(train_data, test_data)[source]¶ Scale given data with a StandardScaler trained on the train data.
Parameters: - train_data (Pandas.DataFrame or Numpy.ndarray) – Training data to scale
- test_data (Pandas.DataFrame or Numpy.ndarray) – Test data to scale
Returns: The transformed data sets
Return type: train_data, test_data (Numpy.ndarray)
Keras Helper¶
Utility functions for Keras-realted implementations
-
MLT.tools.helper_keras.
keras_load_modellist
(model_filenames, model_path)[source]¶ Load a list of models from a path
-
MLT.tools.helper_keras.
keras_persist_model
(model, model_savename)[source]¶ Save the full model to disk.
-
MLT.tools.helper_keras.
keras_train_model
(model, epochs, batch_size, training_data, training_labels, test_data, test_labels, logdir, model_savename)[source]¶ Train the given model with data and predict the run.
-
MLT.tools.helper_keras.
keras_train_model_adaptive
(model, epochs, batch_size, training_data, training_labels, test_data, test_labels, logdir, model_savename)[source]¶ Train the given model with data and predict the run.
This training reduces the learning rate on a fixed base every 30 epochs to 10% of the original value.
Pyod Helper¶
Utility functions for pyod-related implementations
-
MLT.tools.helper_pyod.
predict_pyod
(single_model, test_data, test_labels)[source]¶ Only predict a model without fitting it
-
MLT.tools.helper_pyod.
pyod_load_modellist
(model_filenames, model_path)[source]¶ Load a list of scikit models from disk from given path
Scikit Helper¶
Utility functions for scikit-learn-realted implementations
-
MLT.tools.helper_sklearn.
predict_scikit
(single_model, test_data, test_labels)[source]¶ Only predict a model without fitting it
-
MLT.tools.helper_sklearn.
sklearn_load_model
(dirpath, modelname)[source]¶ Load a scikit model from disk
-
MLT.tools.helper_sklearn.
sklearn_load_modellist
(model_filenames, model_path)[source]¶ Load a list of scikit models from disk from given path
Email Tools¶
Load results, compile them into a mail and send it.
The details (where to send the mail, the sender address, server credentials) can be found in result_mail_credentials.py.dist - to set this up, copy the file, remove the .dist and fill it with real info.
Result Helper¶
Additional tools to simplify and speed up the result evaluation
-
MLT.tools.result_helper.
gen_ltx
(modelname, top_resultpath)[source]¶ Generate a LaTeX table from metrics.json in every subfolder with the call_params, if existing.
Parameters: - modelname (str) – Name of the model to evaluate. Used to derive filenames.
- top_resultpath (str) – Path to the parent folder with all subresults
-
MLT.tools.result_helper.
list_scores
(modelname, top_resultpath)[source]¶ Lists the metrics.json in every subfolder with the call_params, if existing.
Parameters: - modelname (str) – Name of the model to evaluate. Used to derive filenames.
- top_resultpath (str) – Path to the parent folder with all subresults
Uncategorized Tools¶
Collection of misc tools that don’t fit in a standalone module
-
MLT.tools.toolbelt.
list_files
(dirpath, fname_start)[source]¶ List all files in a folder that start with the given string.
-
MLT.tools.toolbelt.
load_fold_indices
(path)[source]¶ Load the stard and end indices of the test set for every fold.
-
MLT.tools.toolbelt.
load_result
(path, modelname)[source]¶ Load the metrics for a given model in the given path.
-
MLT.tools.toolbelt.
load_results_from_disk
(path, modelname)[source]¶ Load the full result json for the given model from the path.
-
MLT.tools.toolbelt.
prepare_folders
(runner_name)[source]¶ Creates all the folders needed for a test run
Parameters: runner_name (string) – Name of the calling runner. Will be the base name for results - Returns
- result_path (string): The full path where results can be stored
-
MLT.tools.toolbelt.
read_from_json
(full_path_with_name)[source]¶ Read from an arbitrary JSON and return the structure
-
MLT.tools.toolbelt.
read_from_pickle
(full_path_with_name)[source]¶ PRead from pickle at given location
-
MLT.tools.toolbelt.
save_metrics_to_disk
(metrics_array, modelname, result_path)[source]¶ Save a given metric array as json to disk
-
MLT.tools.toolbelt.
save_np_to_disk
(stats_dataframe, filename, result_path)[source]¶ Save a given dataframe as binary numpy pickle to disk
-
MLT.tools.toolbelt.
save_results_to_disk
(stats_data, filename, result_path)[source]¶ save the full results for a given model as json to disk
-
MLT.tools.toolbelt.
write_call_params
(args, result_path)[source]¶ Write the parametes with wich MLT has been called to a txt file in the result path