Deep Fast Machine Learning Utils

Welcome to the deepfastmlu documentation!

Contents

Introduction

Official build at: https://github.com/fabprezja/deep-fast-machine-learning-utils

Feature Selection Modules

class AdaptiveVarianceThreshold(percentile=1.5, verbose=False)

Bases: object

AdaptiveVarianceThreshold is a feature selector that dynamically determines a variance threshold based on the provided percentile of the feature variances. Features with a variance below this threshold are dropped. Traditional variance-based feature selection uses a fixed threshold, which is not optimal for all datasets.

Attributes:

percentile (float): The percentile of the feature variances used to determine the threshold. A higher percentile will result in a higher threshold and potentially more features being dropped.

variances (np.ndarray, optional): Holds the variances of each feature in the dataset. Calculated during the fit method.

threshold (float, optional): The calculated variance threshold. Features with a variance below this value will be dropped during the transform method.

indices_to_drop (np.ndarray, optional): After fitting and transforming, provides the indices of features that were dropped due to their variance.

selector (VarianceThreshold, optional): An instance of Scikit-learn’s VarianceThreshold class, used to perform the feature selection based on the calculated threshold.

verbose (bool): If set to True, prints additional information during the processing, such as the calculated variance threshold.

Args:

percentile (float, optional): The desired percentile of feature variances to use for determining the threshold. Defaults to 1.5.

verbose (bool, optional): Whether to print additional information during processing. Defaults to False.

fit(features)

Calculate the variances and threshold, and fit the selector.

Args:

features (np.array): The feature set.

fit_transform(features, y=None)

Call the fit and transform methods successively.

Args:

features (np.array): The feature set. y (np.array, optional): Target values. This parameter is included for compatibility with scikit-learn’s transformer API but is not used in this method.

Returns:

np.array: The transformed feature set.

transform(features)

Apply the selector to the features and store the indices of the dropped features.

Args:

features (np.array): The feature set.

Returns:

np.array: The transformed feature set.

class ChainedFS(methods: List[Any] | None = None)

Bases: object

ChainedFS is a feature selector that sequentially applies a list of feature selection methods. This class allows for the chaining of multiple feature selection methods, where the output of one method becomes the input for the next. This can be particularly useful when one wants to combine the strengths of different feature selection techniques or when a sequence of operations is required to refine the feature set.

For instance, one might first want to use a variance threshold to remove low-variance features and then apply a more computationally intensive method on the reduced set.

Attributes:

methods (List[Any]): A list of feature selection methods to be applied in sequence. Each method should have fit and transform methods. The order of the list determines the order of application of the methods.

indices_ (np.ndarray): After fitting, provides the indices of features retained after all methods are applied.

Args:

methods (List[Any], optional): Feature selection methods to apply in sequence. Defaults to an empty list.

fit(X: ndarray, y: ndarray | None = None)

Fit the ChainedFS to the data.

Args:

X (np.ndarray): Training data. y (np.ndarray, optional): Target values. Defaults to None.

fit_transform(X: ndarray, y: ndarray | None = None) ndarray

Fit to data, then transform it.

Args:

X (np.ndarray): Training data. y (np.ndarray, optional): Target values. Defaults to None.

Returns:

np.ndarray: Transformed data.

transform(X: ndarray) ndarray

Transform data based on the sequence of methods applied.

Args:

X (np.ndarray): Data to transform.

Returns:

np.ndarray: Transformed data.

class RankAggregatedFS(methods: List[Any] | None = None, k: int = 3, weights: List[float] | None = None)

Bases: object

RankAggregatedFS is a feature selector that aggregates the rankings of features from multiple feature selection methods. It combines the scores or rankings of features from different methods to provide a unified ranking of features. This approach can be useful when there’s uncertainty about which feature selection method to use, as it combines the strengths of multiple methods.

Attributes:

ranking_ (np.ndarray): After fitting, provides the indices of features ranked based on their aggregated scores. aggregated_scores_ (np.ndarray): After fitting, provides the aggregated scores of features.

Args:

methods (List[Any], optional): A list of feature selection methods to aggregate. Each method should have a fit method and either provide a scores_ attribute after fitting or be compatible with the transform method. Defaults to an empty list.

k (int): The number of top-ranked features to select after aggregation. Features are ranked based on their aggregated scores, and the top k features are selected.

weights (Optional[List[float]], optional): If provided, assigns weights to the feature selection methods. This can be used to give more importance to certain methods over others. If None, all methods are equally weighted.

fit(X: ndarray, y: ndarray | None = None) RankAggregatedFS

Fit the RankAggregatedFS to the data.

Args:

X (np.ndarray): Training data. y (np.ndarray, optional): Target values. Defaults to None.

fit_transform(X: ndarray, y: ndarray | None = None) ndarray

Fit to data, then transform it.

Args:

X (np.ndarray): Training data. y (np.ndarray, optional): Target values. Defaults to None.

Returns:

np.ndarray: Transformed data.

get_feature_rankings() ndarray

Return features ranked by their aggregated scores.

Returns:

np.ndarray: Features ranked by their aggregated scores.

transform(X: ndarray) ndarray

Transform data to retain only the selected features.

Args:

X (np.ndarray): Data to transform.

Returns:

np.ndarray: Transformed data.

Model Search Modules

class PCCDNAS

Bases: object

Principal Component Cascade Dense Neural Architecture Search (PCCDNAS).

PCCDNAS provides an automated method for designing neural networks. Using PCA (Principal Component Analysis), it systematically sets the number of neurons in each layer of the network. After applying PCA to the initial data, the neuron count for the first layer is determined based on the principal components (PCs) for a given variance threshold. Subsequently, the cascade mechanism ensures that the activations from each trained layer undergo PCA again. This process, in turn, determines the neuron count for the subsequent layers using the same principal component variance threshold criteria.

build()

Builds and trains the complete neural network model based on the PCA specification and the specified hyperparameters.

Returns:

tuple: A tuple containing the trained Keras model and a list of the number of neurons for each layer.

data_init(X_train, y_train, validation=None, normalize=True, unit=False)

Initialize data for searching the model.

Args:

X_train (ndarray): Training data. y_train (ndarray): Training labels/targets. validation (float, tuple, optional): Validation data. Can be a percentage (float) or a tuple (X_val, y_val). normalize (bool, optional): Whether to normalize the data or not. Default is True. unit (bool, optional): If True, standard deviation is used for normalization. Default is False.

Initialize the model’s hyperparameters based on provided keyword arguments.

Note:

All parameters are passed as keyword arguments.

Args:

epochs (int, optional): Number of training epochs. Default is 15. layers (int, optional): Number of layers in the model. Default is 2. activation (str, optional): Activation function for the layers. Default is ‘elu’. pca_variance (float or list of float, optional): Desired explained variance for PCA. Default is 0.95. loss (str, optional): Loss function for the model. Default is ‘categorical_crossentropy’. optimizer (str, optional): Optimizer for the model. Default is ‘adam’. metrics (list, optional): List of metrics to be evaluated. Default is [‘accuracy’]. output_neurons (int, optional): Number of neurons in the output layer. Default is 1. out_activation (str, optional): Activation function for the output layer. Default is ‘sigmoid’. stop_criteria (str, optional): Criteria for early stopping. Default is ‘val_loss’. es_mode (str, optional): Mode for early stopping. Default is ‘max’. dropout (float, optional): Dropout rate for dropout layers (includes). Default is None. regularize (tuple (str, float), optional): Regularization type and value. Default is None. batch_size (int, optional): Batch size for training. Default is 32. kernel_initializer (str, optional): Kernel initializer for the dense layers. Default is ‘he_normal’. batch_norm (bool, optional): Whether to include batch normalization layers. Default is True. es_patience (int, optional): Number of epochs with no improvement for early stopping. Default is 5. verbose (int, optional): Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. Default is 1. custom_callback (callback, optional): Custom callback for training. Default is None. learn_rate (float, optional): Learning rate for the optimizer. Default is 0.0001.

Extra Modules

class DataSubSampler(data_dir, destination_dir, fraction, seed=None)

Bases: object

DataSubSampler class for creating a smaller dataset by randomly sampling a fraction of files from the original dataset.

Attributes:

data_dir (str): Directory containing the original dataset. destination_dir (str): Directory where the sampled dataset will be saved. fraction (float): Fraction of files to sample from the original dataset. seed (int): Seed for random number generator.

create_miniature_dataset()

Creates a copy of all folders and subfolders in the path, but only samples a fraction of files.

class DatasetSplitter(data_dir, destination_dir, train_ratio=0.7, val_ratio=0.15, test_ratio=0.15, test_ratio_2=None, seed=None)

Bases: object

DatasetSplitter class for splitting a dataset into train, validation, and test sets.

Attributes:

data_dir (str): Directory containing the dataset. destination_dir (str): Directory where the train, validation, and test sets will be saved. train_ratio (float): Ratio of train set. val_ratio (float): Ratio of validation set. test_ratio (float): Ratio of test set. test_ratio_2 (float): Ratio of an additional test set, if needed. seed (int): Seed for random number generator.

run()

Main method to execute dataset splitting and copying files.

plot_confusion_matrix(model, generator, generator_name, target_type)

Plot the confusion matrix of the model predictions on a given dataset.

Args:

model (Model): A trained Keras model. generator (ImageDataGenerator): A Keras ImageDataGenerator. generator_name (str): Name of the generator, used in the plot title. target_type (str): Type of target labels (‘binary’ or ‘categorical’).

plot_history_curves(history, show_min_max_plot, user_metric)

Plot the training and validation loss and user-specified metric curves.

Args:

history (History): Keras training history object. show_min_max_plot (bool): Whether to plot the maximum/minimum value lines. user_metric (str): User-specified metric to be plotted.