Documentation

The full documentation is available here. Use the side bar to select the sub-module that is most applicable.

All of these functions can be loaded from the sku package directly, for example:

>>> import sku
>>> aml.PipelineDD(...)

Wrapping Utilities

Here is the documentation for the useful wrapping functionality.

class sku.flatten_wrapper.FlattenWrapper(estimator: Any | None = None, start_dim: int = 1, end_dim: int = -1, flatten_arg: int | str = 0, unflatten_transform: bool = True, **kwargs)

Bases: BaseEstimator, TransformerMixin

Parameters:
  • estimator (Any) –

  • start_dim (int) –

  • end_dim (int) –

  • flatten_arg (Union[int, str]) –

  • unflatten_transform (bool) –

__init__(estimator: Any | None = None, start_dim: int = 1, end_dim: int = -1, flatten_arg: int | str = 0, unflatten_transform: bool = True, **kwargs) None

This class allows you to wrap any transformer or model with a flattening operation. By default, the flattening will allow for shape (in_shape[0], -1). Please see the flattening operations in aml.flatten to understand the arguments start_dim and end_dim.

Note: Any attribute of the underlying estimator is accessible as normal as an attribute of this class. If you require a flattened call of any method of the underlying estimator, then you may use a prefix of flatten__ to the method name. See the examples.

Examples

>>> flatten_scaler = FlattenWrapper(
                        StandardScaler,
                        unflatten_transform=True,
                        )
>>> flatten_scaler.fit(x)
>>> flatten_scaler.transform(x)

Also, since any attribute of the underlying estimator is accessible through this wrapper, you may do the following:

>>> flatten_scaler = FlattenWrapper(
                        StandardScaler,
                        unflatten_transform=True,
                        )
>>> flatten_scaler.fit(x)
>>> flatten_scaler.transform(x)
>>> flatten_scaler.mean_
[0.5, 0.2, 100, 20]

If you require a function call of the underlying estimator, but still want to flatten the argument first. Use the following structure:

>>> flatten_lr = FlattenWrapper(
                        LogisticRegression,
                        )
>>> flatten_lr.fit(x, y)
>>> flatten_lr.flatten__predict_log_proba(x)
[0, 1, 0, 1, 1, 0]
Parameters:
  • estimator (-) – The transformer or model that requires a flattening of the before fit and transform or predict. Defaults to None.

  • start_dim (-) – The first dim to flatten. Defaults to 0.

  • end_dim (-) – The last dim to flatten. Defaults to -1.

  • flatten_arg (-) – The argument of the call on fit and transform/predict that contains the array that requires flattening. This can either be the keyword, that needs to be given in the method calls, or the idx of the argument. Please keep in mind that an integer here will only work if the fit and predict methods are given positional arguments. To be safe it is best to pass a string here and use keyword argumetns in the methods. Defaults to 0.

  • unflatten_transform (-) – Return an unflattened version of the transformed array. If the output is a numpy array, then this will unflattened directly. If the output is a list or tuple, then the flatten_arg of the output will be unflattened. Defaults to True.

  • **kwargs (-) –

    The keywords that will be passed to the estimator init function. Defaults to {}.

Return type:

None

property estimator

This is the estimator, which can be accessed as an attribute of this class.

fit(*args, **kwargs) FlattenWrapper

Compute the mean and std to be used for later scaling.

Parameters:
  • X (-) – The data used to compute the mean and standard deviation used for later scaling along the features axis. This should be of shape (n_samples, n_features).

  • y (-) – Igorned. Defaults to None.

Returns:

- self – The fitted scaler.

Return type:

FlattenStandardScalerOld:

fit_transform(*args, **kwargs)

Fit and then transform with the estimator and flattened array.

Parameters:
  • args (-) – Arguments passed to the estimator when transforming.

  • kwargs (-) – Keyword arguments passed to the estimator when transforming.

Returns:

- args_out – The transformed input.

Return type:

FlattenWrapper:

get_params(deep=True) dict

Overrides sklearn function.

Parameters:

deep (-) – Ignored. Defaults to True.

Returns:

- out – Dictionary of parameters.

Return type:

dict:

set_params(**params)

From sklearn documentation: Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form :code:<component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (-) –

Estimator parameters.

Returns:

- self – Estimator instance.

Return type:

estimator instance

Model Wrapper

Here is the documentation for the model wrapping functionality.

class sku.model_wrapper.SKModelWrapperDD(model: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], predict_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], **kwargs)

Bases: BaseEstimator, ClassifierMixin

Parameters:
__init__(model: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], predict_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], **kwargs) None

A wrapper for any scikit-learn model that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with sklearn.pipeline.Pipeline. The model should not be initiated yet, and all arguments passed as positional or keyword arguments after the model is given.

Note: Any attribute or method of the underlying model is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple fit_on arguments are used, otherwise a single value is returned.

Parameters:
  • model (-) – The model to wrap. This model must have both .fit(X, y) and .predict(X). An example would be an abstracted pytorch model.

  • fit_on (-) –

    This allows the user to define the keys in the data dictionary that will be passed to the fit function.

    • If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys given as strings.

    • If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys and values as before.

    The multiple fit models will be saved in a list, accessible through the fitted_models attribute. Defaults to [['X', 'y']].

  • predict_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .predict() function. Multiple inner lists will cause the .predict() function to be called multiple times, with each predict corresponding to the fitted object in each of the fit calls. If there are more .predict() calls than .fit() calls, then the predict will be called on the beginning of the fit object list again (ie: the predict calls roll over the fit calls). If a list of strings is given then they will be wrapped in an outer list, meaning that one .predict() is called, with arguments corresponding to the keys given as strings. Defaults to [['X']].

  • *kwargs (-) –

    Keyword arguments given to the model init.

Return type:

None

fit(X: Dict[str, ndarray] | ndarray, y: None | ndarray = None, sample_weight: None | ndarray = None) SKModelWrapperDD

This will fit the model being wrapped.

Parameters:
  • X (-) – A dictionary containing the data. If X is a numpy.ndarray, then the fit_on arguments will be ignored and the model will be passed .fit(X,y). In this case, consider using sklearn. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

  • y (-) – Ignored unless X is a numpy.ndarray. If using a data dictionary, please pass labels in the dictionary to X. If X is a numpy.ndarray then this should also be a numpy.ndarray of targets. Defaults to None.

  • sample_weight (-) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored unless X is a numpy.ndarray. If using a data dictionary, please pass sample weights in the dictionary to X and use the fit_on argument. If X is a numpy.ndarray then this should also be a numpy.ndarray of sample weights. This will only pass this argument to the underlying transform’s fit function if it includes this argument in the signature. Defaults to None.

Returns:

- self – This object.

Return type:

SKModelWrapperDD:

get_params(deep=True) dict

Overrides sklearn function.

Parameters:

deep (-) – Ignored. Defaults to True.

Returns:

- out – Dictionary of parameters.

Return type:

dict:

property model

This is the model, which can be accessed as an attribute of this class. If multiple fit_on arguments were given, and the class has been fitted, then this will be a list of the fitted models. Otherwise, it will be a single instance.

predict(X: Dict[str, ndarray], return_data_dict: bool = False) ndarray | Dict[str, ndarray]

This will predict using the model being wrapped.

Parameters:
  • X (-) – A dictionary containing the data. If X is a numpy.ndarray, then the predict_on arguments will be ignored and the model will be passed .predict(X). In this case, consider using sklearn. In addition, this will be performed on the first fitted model if many are fitted. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

  • return_data_dict (bool) –

Return type:

ndarray | Dict[str, ndarray]

  • return_data_dict: bool, optional:

    Whether to return the ground truth with the output. This is useful if this model was part of a pipeline in which the labels are altered. This is ignored if X is a numpy.ndarray Defaults to False.

Returns:

  • - predictions (numpy.ndarray:) – The predictions, as a numpy array. If multiple inner lists are given as predict_on, then a list of predictions will be returned.

  • - data_dict (typing.Dict[str, np.ndarray]:) – The labels, as a numpy array. Only returned if return_data_dict=True.

Parameters:
  • X (Dict[str, ndarray]) –

  • return_data_dict (bool) –

Return type:

ndarray | Dict[str, ndarray]

predict_proba(X: Dict[str, ndarray], return_data_dict: bool = False) ndarray | Dict[str, ndarray]

This will predict the probabilities using the model being wrapped.

Parameters:
  • X (-) – A dictionary containing the data. If X is a numpy.ndarray, then the predict_on arguments will be ignored and the model will be passed .predict_proba(X). In this case, consider using sklearn. In addition, this will be performed on the first fitted model if many are fitted. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

  • return_data_dict (-) – Whether to return the ground truth with the output. This is useful if this model was part of a pipeline in which the labels are altered. This is ignored if X is a numpy.ndarray Defaults to False.

Returns:

  • - predictions (numpy.ndarray:) – The predictions, as a numpy array. If multiple inner lists are given as predict_on, then a list of predictions will be returned.

  • - data_dict (typing.Dict[str, np.ndarray]:) – The labels, as a numpy array. Only returned if return_data_dict=True.

Return type:

ndarray | Dict[str, ndarray]

set_params(**params)

From sklearn documentation: Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form :code:<component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (-) –

Estimator parameters.

Returns:

- self – Estimator instance.

Return type:

estimator` instance

Transformer Wrapper

Here is the documentation for the transformer wrapping functionality.

class sku.transformer_wrapper.SKTransformerWrapperDD(transformer: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], transform_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], all_key_transform=False, **kwargs)

Bases: BaseEstimator, TransformerMixin

Parameters:
__init__(transformer: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], transform_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], all_key_transform=False, **kwargs) None

A wrapper for any scikit-learn transformer that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with sklearn.pipeline.Pipeline.

Note: Any attribute or method of the underlying transformer is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple fit_on arguments are used, otherwise a single value is returned.

Parameters:
  • transformer (-) – The transformer to wrap. This transformer must have both .fit(X, y) and .transform(X). An example would be a custom transformer that uses data dictionaries.

  • fit_on (-) –

    This allows the user to define the keys in the data dictionary that will be passed to the fit function.

    • If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys given as strings.

    • If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys and values as before.

    The multiple fit transforms will be saved in a list, accessible through the fitted_transforms attribute. Defaults to [['X', 'y']].

  • transform_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .transform() function. Multiple inner lists will cause the .transform() function to be called multiple times, with each transform corresponding to the fitted object in each of the fit calls. If there are more .transform() calls than .fit() calls, then the transform will be called on the beginning of the fit object list again (ie: the transform calls roll over the fit calls). The first key in each inner list will be overwritten with the result from .transform(), unless all_key_transform=True. If a list of strings is given then they will be wrapped in an outer list, meaning that one .transform() is called, with arguments corresponding to the keys given as strings. Defaults to [['X']].

  • all_key_transform (-) – This dictates whether the transformer being wrapped will output a result for all of the arguments given to it. In this case, each of the values corresponding to the keys being transformed on, will be replaced by the corresponding output of the wrapped transform. If False, only the first key will be transformed. ie: x, y, z = self.wrapped_transform(x, y, z) rather than: x = self.wrapped_transform(x, y, z) Defaults to False.

  • **kwargs (-) –

    Keyword arguments given to the transformer.

Return type:

None

fit(X: Dict[str, ndarray] | ndarray, y: None | ndarray = None, sample_weight: None | ndarray = None) SKTransformerWrapperDD

This will fit the transformer being wrapped.

Parameters:
  • X (-) – A dictionary containing the data. If X is a numpy.ndarray, then the fit_on arguments will be ignored and the transformer will be passed .fit(X,y). In this case, consider using sklearn. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

  • y (-) – Ignored unless X is a numpy.ndarray. If using a data dictionary, please pass labels in the dictionary to X. If X is a numpy.ndarray then this should also be a numpy.ndarray of targets. Defaults to None.

  • sample_weight (-) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored unless X is a numpy.ndarray. If using a data dictionary, please pass sample weights in the dictionary to X and use the fit_on argument. If X is a numpy.ndarray then this should also be a numpy.ndarray of sample weights. This will only pass this argument to the underlying transform’s fit function if it includes this argument in the signature. Defaults to None.

Returns:

- self – This object.

Return type:

SKTransformerWrapperDD:

fit_transform(X: Dict[str, ndarray] | ndarray, y: None | ndarray = None, **fit_params)

This will fit and transform the transformer being wrapped and the data given.

Parameters:
  • X (-) – A dictionary containing the data. If X is a numpy.ndarray, then the fit_on arguments will be ignored and the transformer will be passed .fit_transform(X,y). In this case, consider using sklearn. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

  • y (-) – Ignored unless X is a numpy.ndarray. If using a data dictionary, please pass labels in the dictionary to X. Defaults to None.

  • **fit_params (-) –

    These are passed to the fit function.

Returns:

- X_out

The data dictionary, with stucture the same as X,

with transformed X data.

If X is a numpy.ndarray, then a numpy.ndarray will be returned.

Return type:

Dict[str, Union[np.ndarray, Dict[str, np.ndarray]]]:

get_params(deep=True) dict

Overrides sklearn function.

Parameters:

deep (-) – Ignored. Defaults to True.

Returns:

- out – Dictionary of parameters.

Return type:

dict:

set_params(**params)

From sklearn documentation: Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form :code:<component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (-) –

Estimator parameters.

Returns:

- self – Estimator instance.

Return type:

estimator` instance

transform(X: Dict[str, ndarray]) Dict[str, ndarray]

This will transform the data using the transformer being wrapped.

Parameters:

X (-) – A dictionary containing the data. If X is a numpy.ndarray, then the transform_on arguments will be ignored and the transform will be passed .transform(X). In this case, consider using sklearn. In addition, this will be performed on the first fitted transform if many are fitted. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

Returns:

- X_out – The data dictionary, with stucture the same as X, with transformed X data. If X is a numpy.ndarray, then a numpy.ndarray will be returned.

Return type:

Dict[str, Union[np.ndarray, Dict[str, np.ndarray]]]:

property transformer

This is the transformer, which can be accessed as an attribute of this class. If multiple fit_on arguments were given, and the class has been fitted, then this will be a list of the fitted transformers. Otherwise, it will be a single instance.

Transformer

Here is the documentation for the transformer functionality.

class sku.transformer.DropNaNRowsDD(*, transformer: typing.Any = <class 'sku.transformer._DropNaNRowsDD'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=True, **kwargs)

Bases: SKTransformerWrapperDD

Parameters:
__init__(*, transformer: typing.Any = <class 'sku.transformer._DropNaNRowsDD'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=True, **kwargs) None

A wrapper for any scikit-learn transformer that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with sklearn.pipeline.Pipeline.

Note: Any attribute or method of the underlying transformer is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple fit_on arguments are used, otherwise a single value is returned.

Parameters:
  • transformer (-) – The transformer to wrap. This transformer must have both .fit(X, y) and .transform(X). An example would be a custom transformer that uses data dictionaries.

  • fit_on (-) –

    This allows the user to define the keys in the data dictionary that will be passed to the fit function.

    • If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys given as strings.

    • If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys and values as before.

    The multiple fit transforms will be saved in a list, accessible through the fitted_transforms attribute. Defaults to [['X', 'y']].

  • transform_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .transform() function. Multiple inner lists will cause the .transform() function to be called multiple times, with each transform corresponding to the fitted object in each of the fit calls. If there are more .transform() calls than .fit() calls, then the transform will be called on the beginning of the fit object list again (ie: the transform calls roll over the fit calls). The first key in each inner list will be overwritten with the result from .transform(), unless all_key_transform=True. If a list of strings is given then they will be wrapped in an outer list, meaning that one .transform() is called, with arguments corresponding to the keys given as strings. Defaults to [['X']].

  • all_key_transform (-) – This dictates whether the transformer being wrapped will output a result for all of the arguments given to it. In this case, each of the values corresponding to the keys being transformed on, will be replaced by the corresponding output of the wrapped transform. If False, only the first key will be transformed. ie: x, y, z = self.wrapped_transform(x, y, z) rather than: x = self.wrapped_transform(x, y, z) Defaults to False.

  • **kwargs (-) –

    Keyword arguments given to the transformer.

Return type:

None

class sku.transformer.NumpyToDict

Bases: BaseEstimator, TransformerMixin

__init__()

This function will transform two numpy arrays containing the inputs and targets to a dictionary containing the same arrays.

fit(X: None | ndarray = None, y: None | ndarray = None) NumpyToDict

This will save the numpy arrays to the class, ready to return them as a dictionary in the .transform() method

Parameters:
  • X (-) – The inputs. Defaults to None.

  • y (-) – The targets. Defaults to None.

Return type:

NumpyToDict

transform(X: Any = None) Dict[str, ndarray]

_summary_

Parameters:

X (-) – Ignored. Defaults to None.

Returns:

- out – A dictionary of the form: {'X': self.X, 'y': self.y}

Return type:

Dict[str, np.ndarray]:

class sku.transformer.StandardGroupScalerDD(*, transformer: typing.Any = <class 'sku.preprocessing.StandardGroupScaler'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=False, **kwargs)

Bases: SKTransformerWrapperDD

Parameters:
__init__(*, transformer: typing.Any = <class 'sku.preprocessing.StandardGroupScaler'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=False, **kwargs) None

A wrapper for any scikit-learn transformer that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with sklearn.pipeline.Pipeline.

Note: Any attribute or method of the underlying transformer is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple fit_on arguments are used, otherwise a single value is returned.

Parameters:
  • transformer (-) – The transformer to wrap. This transformer must have both .fit(X, y) and .transform(X). An example would be a custom transformer that uses data dictionaries.

  • fit_on (-) –

    This allows the user to define the keys in the data dictionary that will be passed to the fit function.

    • If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys given as strings.

    • If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the .fit() function. Multiple inner lists will cause the .fit() function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one .fit() is called, with arguments corresponding to the keys and values as before.

    The multiple fit transforms will be saved in a list, accessible through the fitted_transforms attribute. Defaults to [['X', 'y']].

  • transform_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the .transform() function. Multiple inner lists will cause the .transform() function to be called multiple times, with each transform corresponding to the fitted object in each of the fit calls. If there are more .transform() calls than .fit() calls, then the transform will be called on the beginning of the fit object list again (ie: the transform calls roll over the fit calls). The first key in each inner list will be overwritten with the result from .transform(), unless all_key_transform=True. If a list of strings is given then they will be wrapped in an outer list, meaning that one .transform() is called, with arguments corresponding to the keys given as strings. Defaults to [['X']].

  • all_key_transform (-) – This dictates whether the transformer being wrapped will output a result for all of the arguments given to it. In this case, each of the values corresponding to the keys being transformed on, will be replaced by the corresponding output of the wrapped transform. If False, only the first key will be transformed. ie: x, y, z = self.wrapped_transform(x, y, z) rather than: x = self.wrapped_transform(x, y, z) Defaults to False.

  • **kwargs (-) –

    Keyword arguments given to the transformer.

Return type:

None

Model Selection

Here is the documentation for the model selection functionality.

class sku.model_selection.DataPreSplit(data: List[Tuple[Dict[str, ndarray], Dict[str, ndarray]]] | Tuple[Dict[str, ndarray], Dict[str, ndarray]], split_fit_on: List[str] = ['X', 'y'])

Bases: object

Parameters:
__init__(data: List[Tuple[Dict[str, ndarray], Dict[str, ndarray]]] | Tuple[Dict[str, ndarray], Dict[str, ndarray]], split_fit_on: List[str] = ['X', 'y'])

This function allows you to wrap pre-split data into a class that behaves like an sklearn splitter. This is useful in pipeline searches.

Examples

>>> splitter = sku.DataPreSplit(
        data=[
                (
                    {'X': np.arange(10)},
                    {'X': np.arange(5)},
                    ),
                (
                    {'X': np.arange(5)},
                    {'X': np.arange(2)},
                    ),
                (
                    {'X': np.arange(2)},
                    {'X': np.arange(3)},
                    ),
            ],
        split_fit_on=['X']
        )
>>> X = splitter.reformat_X()
>>> for train_idx, val_idx in splitter.split(X['X']):
        train_data, val_data = X['X'][train_idx], X['X'][val_idx]
        do_things(train_data, val_data)
Parameters:
  • data (-) – The pre-split data. Please ensure all splits have the same keys.

  • split_fit_on (-) – The labels in the data dictionaries to split the data on. Defaults to ['X', 'y'].

Raises:

TypeError – If data is not a list or tuple:

get_n_splits(groups: Any | None = None) int

Returns the number of splits.

Parameters:

groups (-) – Ignored. Defaults to None.

Returns:

- out – The number of splits.

Return type:

int:

reformat_X() Dict[str, ndarray]

This reformats the X so that it can be split by the indices returned in splits. It essentially concatenates all of the data dictionaries.

Returns:

- out – Dictionary containing concatenated arrays from the pre-split data.

Return type:

Dict[str, np.ndarray]:

split(X: Dict[str, ndarray], y: Any | None = None, groups: Any | None = None)

This returns the training and testing idices.

Parameters:
  • X (-) – A data dictionary that is only used to ensure that an array of the right shape is being used for the splitting operation.

  • y (-) – Ignored. Defaults to None.

  • groups (-) – Ignored. Defaults to None.

Returns:

The train and test idices, wraped in a generator. See the Examples for an understanding of the output.

Return type:

  • out

class sku.model_selection.RepeatedStratifiedGroupKFold(*, n_splits=5, n_repeats=10, random_state=None)

Bases: _RepeatedSplits

Repeated Stratified Group K-Fold cross validator. Repeats Stratified Group K-Fold n times with different randomization in each repetition. Read more in the User Guide. :param n_splits: Number of folds. Must be at least 2. :type n_splits: int, default=5 :param n_repeats: Number of times cross-validator needs to be repeated. :type n_repeats: int, default=10 :param random_state: Controls the generation of the random states for each repetition.

Pass an int for reproducible output across multiple function calls. See Glossary.

Examples

>>> import numpy as np
>>> from sklearn.model_selection import RepeatedStratifiedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> rskf = RepeatedStratifiedKFold(n_splits=2, n_repeats=2,
...     random_state=36851234)
>>> rskf.get_n_splits(X, y)
4
>>> print(rskf)
RepeatedStratifiedKFold(n_repeats=2, n_splits=2, random_state=36851234)
>>> for i, (train_index, test_index) in enumerate(rskf.split(X, y)):
...     print(f"Fold {i}:")
...     print(f"  Train: index={train_index}")
...     print(f"  Test:  index={test_index}")
...
Fold 0:
  Train: index=[1 2]
  Test:  index=[0 3]
Fold 1:
  Train: index=[0 3]
  Test:  index=[1 2]
Fold 2:
  Train: index=[1 3]
  Test:  index=[0 2]
Fold 3:
  Train: index=[0 2]
  Test:  index=[1 3]

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

See also

RepeatedKFold

Repeats K-Fold n times.

__init__(*, n_splits=5, n_repeats=10, random_state=None)
sku.model_selection.train_test_group_split(*arrays, y, group, test_size: float | None = None, train_size: float | None = None, random_state: None | int = None, shuffle: bool = True)

This function returns the train and test data given the split and the data. A single group will not be in both the training and testing set. You should use either test_size or train_size but not both.

Example

:code:` >>> (X_train, X_test,

y_train, y_test, ids_train, ids_test) = train_test_group_split(X, y=y, group=group, test_size=0.33)

:code:`

Parameters:
  • arrays (-) – The data to split into training and testing sets. The labels and the group should be passed to y and group respectively.

  • y (-) – Label data with shape (n_samples), where n_samples is the number of samples. These are the labels that are used to group the data into either the training or testing set.

  • group (-) – Event data with shape (n_samples), where n_samples is the number of samples. These are the group ids that are used to group the data into either the training or testing set.

  • test_size (-) – This dictates the size of the outputted test set. This should be used if train_size=None. If no test_size or train_size are given, then test_size will default to 0.25 Defaults to None.

  • train_size (-) – This dictates the size of the outputted train set. This should be used if test_size=None. Defaults to None.

  • shuffle (-) – dictates whether the data should be shuffled before the split is made.

  • random_state (-) – This dictates the random seed that is used in the random operations for this function.

Returns:

- split arrays – This is a list of the input data, split into the training and testing sets. See the Example for an understanding of the order of the outputted arrays.

Return type:

list:

Pipeline

Here is the documentation for the pipeline functionality.

class sku.pipeline.Pipeline(steps, *, memory=None, verbose=False)

Bases: Pipeline

Parameters:

steps (List[Any]) –

predict(X: ndarray, **predict_params)

Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to fit_params.

Parameters:
  • X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

  • **predict_params (dict of string -> object) – Parameters to the predict passed at each step

Returns:

y_pred – Result of calling predict on the final estimator.

Return type:

ndarray of shape (n_samples, n_classes)

predict_log_proba(X: ndarray, **predict_log_proba_params)

Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to fit_params.

Parameters:
  • X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

  • **predict_log_proba (dict of string -> object) – Parameters to the predict_log_proba passed at each step

Returns:

y_log_proba – Result of calling predict_log_proba on the final estimator.

Return type:

ndarray of shape (n_samples, n_classes)

predict_proba(X: ndarray, **predict_proba_params)

Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to fit_params.

Parameters:
  • X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.

  • **predict_proba_params (dict of string -> object) – Parameters to the predict_proba passed at each step

Returns:

y_proba – Result of calling predict_proba on the final estimator.

Return type:

ndarray of shape (n_samples, n_classes)

steps: List[Any]
transform(X, **transform_params)

Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to fit_params.

Parameters:
  • X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.

  • **transform_params (dict of string -> object) – Parameters to the transform_params passed at each step

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_transformed_features)

class sku.pipeline.PipelineDD(steps, *, memory=None, verbose=False)

Bases: Pipeline

Parameters:

steps (List[Any]) –

fit(X: Dict[str, ndarray], y: None = None, *args, **kwargs) PipelineDD

This will fit the pipeline.

Parameters:
  • X (-) – A dictionary containing the data. When supplying X, please be mindful of the objects in the pipeline and their data requirements. If you are using model wrappers and transformer wrappers from this package, then using a dictionary will be more powerful. However, you may supply a numpy.ndarray, but all fit_on, and transform_on and predict_on arguments will be ignored and sklearn defaults will be used. An example X: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

  • y (-) – Ignored unless X is a numpy.ndarray. If using a data dictionary, please pass labels in the dictionary to X. Defaults to None.

Returns:

- self – This object.

Return type:

PipelineDD:

predict(X: Dict[str, ndarray], *args, **kwargs) ndarray | Dict[str, ndarray]

This will predict using the fitted pipeline.

Parameters:

X (-) – A dictionary containing the data. If X is a numpy.ndarray, then the predict_on arguments will be ignored and the model will be passed .predict(X). In this case, consider using sklearn. In addition, this will be performed on the first fitted model if many are fitted. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

Returns:

- predictions – The predictions, as a numpy array. If multiple inner lists are given as predict_on, then a list of predictions will be returned.

Return type:

numpy.ndarray:

score(X: Dict[str, ndarray], y: str | ndarray, sample_weight=None)

Transform the data, and apply score with the final estimator. Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls score method. Only valid if the final estimator implements score.

Parameters:
  • X (-) – A dictionary containing the data. For example: X = {'X': X_DATA, 'y': Y_DATA, **kwargs}.

  • y (-) – Please either pass a string, which corresponds to the key in X which contains the labels, or pass the labels themselves.

  • sample_weight (-) – If not None, this argument is passed as :code:sample_weight keyword argument to the :code:score method of the final estimator. Defaults to None.

Returns:

- score – Result of calling score on the final estimator.

Return type:

float`

steps: List[Any]
transform(X: Dict[str, ndarray]) Dict[str, ndarray]

Transform the data, and apply transform with the final estimator.

Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls transform method. Only valid if the final estimator implements transform.

This also works where final estimator is None in which case all prior transformations are applied.

Parameters:

X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.

Returns:

Xt – Transformed data.

Return type:

ndarray of shape (n_samples, n_transformed_features)

sku.pipeline.pipeline_constructor(name: str, name_to_object: Dict[str, BaseEstimator], **pipeline_kwargs) PipelineDD

A function that constructs a pipeline from a string of pipeline keys and a dictionary mapping the keys to pipeline objects.

This function will copy each object before adding it to the pipeline. This means that repeating objects in the name will produce separate objects for each.

Example

>>> name_to_object = {

        'standard_scaler': sku.SKTransformerWrapperDD(StandardScaler,
                            fit_on=['X'],
                            transform_on=['X'],
                            ),

        'ae': sku.SKTransformerWrapperDD(
                transformer=AEModel,
                n_input=19,
                n_embedding=19,
                n_epochs=5,
                n_layers=2,
                dropout=0.2,
            , optimizer={'adam':{'lr':0.01}},
                criterion=nn.MSELoss(),
                fit_on=['X_unlabelled'],
                transform_on =[['X'],
                    ['X_unlabelled']],
                ),

        'mlp': sku.SKModelWrapperDD(MLPModel,
                n_input=19,
                n_output=2,
                hidden_layer_sizes=(100,50),
                dropout=0.2,
                n_epochs=10,
            , optimizer={'adam':{'lr':0.01}},
                criterion=nn.CrossEntropyLoss(),
                fit_on=['X', 'y'],
                predict_on=['X'],
                )

        }
>>> name = 'standard_scaler--ae--standard_scaler--mlp'
>>> pipeline = pipeline_constructor(name, name_to_object)

Here, the pipeline will be returned as: .. code-block:

[
    ['standard_scaler1', sku.SKTransformerWrapperDD],
    ['ae', sku.SKTransformerWrapperDD],
    ['standard_scaler2', sku.SKTransformerWrapperDD],
    ['mlp', sku.SKModelWrapperDD],
]

Note the change in names of the standard scalers to ensure that object names are unique.

Parameters:
  • name (-) – A string in which the pipeline object names are split by '--'. For example: 'standard_scaler--ae--standard_scaler--mlp'.

  • name_to_object (-) – A dictionary mapping the strings in the name argument to objects that will be placed in the pipeline. These objects must have a .fit() and .predict() method and use data dictionaries. Please see sku.SKTransformerWrapperDD and sku.SKModelWrapperDD for further explanation of structure.

  • pipeline_kwargs (-) – These are any extra keyword arguments that will be passed to sku.PipelineDD.

Returns:

- out – A data dictionary pipeline.

Return type:

sku.pipeline.PipelineDD:

Pipeline Searching

Here is the documentation for the pipeline searching functionality.

class sku.pipeline_searcher.PipelineBasicSearchCV(pipeline_names: List[str], name_to_object: Dict[str, BaseEstimator], metrics: Dict[str, Callable], metrics_probability: Dict[str, Callable] = {}, cv=None, repeat: int = 1, split_fit_on: List[str] = ['X', 'y'], split_transform_on: List[str] = ['X', 'y'], verbose: bool = False, n_jobs: int = 1, combine_splits: bool = False, combine_runs: bool = False, opt_metric: str | None = None, minimise: bool = True, bootstrap: bool = False)

Bases: BaseEstimator

Parameters:
__init__(pipeline_names: List[str], name_to_object: Dict[str, BaseEstimator], metrics: Dict[str, Callable], metrics_probability: Dict[str, Callable] = {}, cv=None, repeat: int = 1, split_fit_on: List[str] = ['X', 'y'], split_transform_on: List[str] = ['X', 'y'], verbose: bool = False, n_jobs: int = 1, combine_splits: bool = False, combine_runs: bool = False, opt_metric: str | None = None, minimise: bool = True, bootstrap: bool = False)

This class allows you to test multiple pipelines on a supervised task, reporting on the metrics given in a table of results. Given a splitting function, it will perform cross validation on these pipelines. You may also pass your own splits.

Careful not to set a random_state in the objects passed in the name_to_object dictionary, since each model gets cloned each time it is used in a pipeline, and the random_state will be the same in every run, and every split. Future code will allow the passing of a random state to this object directly.

Example

name_to_object = {
    'gbt': sku.SKModelWrapperDD(
        HistGradientBoostingClassifier,
        fit_on=['X', 'y'],
        predict_on=['X'],
        ),
    'standard_scaler': sku.SKTransformerWrapperDD(
        StandardScaler,
        fit_on=['X'],
        transform_on=['X'],
        ),
    }
pipeline_names = [
    'standard_scaler--gbt',
    'gbt'
    ]
metrics = {
    'accuracy': accuracy_score,
    'recall': recall_score,
    'precision': precision_score,
    'f1': f1_score,
    }
splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=1024)

pscv = PipelineBasicSearchCV(
    pipeline_names=pipeline_names,
    name_to_object=name_to_object,
    metrics=metrics,
    cv=splitter,
    split_fit_on=['X', 'y'],
    split_transform_on=['X', 'y', 'id'],
    verbose=True,
    )
X_data = {
    'X': X_labelled, 'y': y_labelled, 'id': id_labelled,
    'X_unlabelled': X_unlabelled, 'id_unlabelled': id_unlabelled,
    }
pscv.fit(X_data)
results = pscv.cv_results_
Parameters:
  • pipeline_names (-) – This is a list of strings that describe the pipelines An example would be standard_scaler--ae--mlp. The objects, separated by :code:’–’ should be keys in name_to_object.

  • name_to_object (-) – A dictionary mapping the keys in pipeline_names to the objects that will be used as transformers and models in the pipeline.

  • metrics (-) – A dictionary mapping the metric names to their callable functions. These functions should take the form: func(labels, predictions).

  • metrics_probability (-) – A dictionary mapping the metric names to their callable functions. These functions should take the form: func(labels, prediction_probabilities). Defaults to {}.

  • cv (-) – This is the class that is used to produce the cross validation data. It should have the method .split(X, y, event), which returns the indices of the training and testing set, and the method get_n_splits(), which should return the number of splits that this splitter was indended to make. Alternatively, if you pass None, you may pass the training and validation data dictionaries themselves, in a tuple or list of tuples in the structure (data_train, data_val) to the fit method in place of the argument X. Defaults to None.

  • repeat (-) – The number of times to repeat the experiment. Defaults to 1.

  • split_fit_on (-) – The keys corresponding to the values in the data dictionary passed in .fit() that the cv will take as positional arguments to the split() function. If cv=None then this is ignored. Defaults to ['X', 'y'].

  • split_transform_on (-) – The keys corresponding to the values in the data dictionary passed in .fit() that the cv will split into training and testing data. This allows you to split data that isn’t used in finding the splitting indices. If cv=None then this is ignored. Defaults to ['X', 'y'].

  • verbose (-) – Whether to print progress as the models are being tested. Remeber that you might also need to change the verbose, options in each of the objects given in name_to_object. Defaults to False.

  • n_jobs (-) – The number of parallel jobs. -1 will run the searches on all cores, but will incur significant memory and cpu cost. Defaults to 1.

  • combine_splits (-) – Whether to combine the predictions over the splits before calculating the metrics. This can help reduce the variance in results when using Leave-One-Out. Defaults to False.

  • combine_runs (-) – Whether to combine the predictions over the runs before calculating the metrics. If True, combine_splits must also be True. Defaults to False.

  • opt_metric (-) – The metric values to use when determining the optimal parameters. If None, the first metric given in metrics.keys() will be used. If a str, this should be a key in metrics. Defaults to None.

  • minimise (-) – Whether to minimise the metric given in opt_metric. If False, the metric will be maximised. Defaults to True.

  • bootstrap (-) – Whether to build a bootstrap training sample data before fitting the model. This will be done on every run of every split. Defaults to False.

property best_params_
fit(X: Dict[str, ndarray] | Tuple[Dict[str, ndarray], Dict[str, ndarray]] | List[Tuple[Dict[str, ndarray], Dict[str, ndarray]]], y: str) DataFrame

This function fits and predicts the pipelines, with the, optional parameters and splitting arguments and produces a table of results given the metrics.

Parameters:
  • X (-) – The data dictionary that will be used to run the experiments. This may also be a tuple of data dictionaries used for training and validation splits if cv=None. In this case, you may also pass a list of tuples of data dictionaries in the form [(data_train, data_val), (data_train, data_val), ...], where data_train and data_val are data dictionaries.

  • y (-) – Please pass a string, which corresponds to the key in X which contains the labels.

Return type:

DataFrame

class sku.pipeline_searcher.PipelineBayesSearchCV(*args, param_grid: List[Dict[str, List[Any]]] | None = None, max_iter: int = 10, **kwargs)

Bases: PipelineBasicSearchCV

Parameters:
__init__(*args, param_grid: List[Dict[str, List[Any]]] | None = None, max_iter: int = 10, **kwargs)

This class allows you to test multiple pipelines on a supervised task, reporting on the metrics given in a table of results. Given a splitting function, it will perform cross validation on these pipelines. A parameter grid can also be passed, allowing the user to test multiple configurations of each pipeline. This class allows you to perform a bayesian parameter search over the parameters given in param_grid. Note: At the moment this only supports real value param searches.

Example

pscv = PipelineBayesSearchCV(
    pipeline_names=pipeline_names,
    name_to_object=name_to_object,
    metrics=metrics,
    cv=splitter,
    split_fit_on=['X', 'y'],
    split_transform_on=['X', 'y', 'id'],
    verbose=True,
    param_grid= [
        {'flatten_gbt__learning_rate': [5, 20]},
        {'flatten_standard_scaler__with_mean': [True, False]},
        {'flatten_mlp__dropout': [0.2, 0.9]},
        ],
    )
X_data = {
    'X': X_labelled, 'y': y_labelled, 'id': id_labelled,
    'X_unlabelled': X_unlabelled, 'id_unlabelled': id_unlabelled,
    }
pscv.fit(X_data)
results = pscv.cv_results_
Parameters:
  • param_grid (-) – A dictionary or list of dictionaries that are used as a parameter grid for testing performance of pipelines with multiple hyper-parameters. This should be of the usual format given to sklearn.model_selection.ParameterGrid when used with sklearn.pipeline.Pipeline. If None, then the pipeline is tested with the parameters given to the objects in name_to_object. All pipelines will be tested with the given parameters in addition to the parameter grid passed. Only those parameters relevant to a pipeline will be used. Defaults to None.

  • max_iter (-) – The number of calls to make on each pipeline when finding the, optimum params. Defaults to 10.

class sku.pipeline_searcher.PipelineSearchCV(*args, param_grid: Dict[str, Any] | List[Dict[str, Any]] | None = None, **kwargs)

Bases: PipelineBasicSearchCV

Parameters:

param_grid (Dict[str, Any] | List[Dict[str, Any]]) –

__init__(*args, param_grid: Dict[str, Any] | List[Dict[str, Any]] | None = None, **kwargs)

This is the same as PipelineBasicSearchCV except a parameter grid can also be passed, allowing the user to test multiple configurations of each pipeline.

Example

pscv = PipelineSearchCV(
    pipeline_names=pipeline_names,
    name_to_object=name_to_object,
    metrics=metrics,
    cv=splitter,
    split_fit_on=['X', 'y'],
    split_transform_on=['X', 'y', 'id'],
    verbose=True,
    param_grid={
    'gbt__learning_rate':[0.1, 0.01],
    'gbt__max_depth':[3, 10],
    },
    )
X_data = {
    'X': X_labelled, 'y': y_labelled, 'id': id_labelled,
    'X_unlabelled': X_unlabelled, 'id_unlabelled': id_unlabelled,
    }
pscv.fit(X_data)
results = pscv.cv_results_
Parameters:
  • args (-) – All the same args as PipelineSearchCV.

  • param_grid (-) – A dictionary or list of dictionaries that are used as a parameter grid for testing performance of pipelines with multiple hyper-parameters. This should be of the usual format given to sklearn.model_selection.ParameterGrid when used with sklearn.pipeline.Pipeline. If None, then the pipeline is tested with the parameters given to the objects in name_to_object. All pipelines will be tested with the given parameters in addition to the parameter grid passed. Only those parameters relevant to a pipeline will be used. Defaults to None.

  • kwargs (-) – All the same kwargs as PipelineSearchCV.

Metrics

Here is the documentation for the metric functionality.

sku.metric.positive_split(labels: ndarray, predictions: None = None)

Calculates the proportion of positive labels in all of the labels.

Parameters:
  • labels (-) – Labels should be from [0,1] or [False, True].

  • predictions (-) – Ignored. Defaults to None.

Returns:

- out – A proportion. This is calculated using the function np.sum(labels)/labels.shape[0].

Return type:

float:

Pre-Processing

Here is the documentation for the pre-processing functionality.

class sku.preprocessing.Flatten(start_dim: int = 0, end_dim: int = -1, copy=False)

Bases: BaseEstimator, TransformerMixin

Parameters:
  • start_dim (int) –

  • end_dim (int) –

__init__(start_dim: int = 0, end_dim: int = -1, copy=False) None

This class allows you to flatten an array inside a pipeline. This class was implemented to mirror the behaviour in https://pytorch.org/docs/stable/generated/torch.flatten.html.

Examples

>>> flat = Flatten(start_dim=1, end_dim=-1)
>>> flat.fit(None, None) # ignored
>>> flat.transform(
        [[[1, 2],
        [3, 4]],
        [[5, 6],
        [7, 8]]]
        )
[[1,2,3,4],
[5,6,7,8]]
Parameters:
  • start_dim (-) – The first dim to flatten. Defaults to 0.

  • end_dim (-) – The last dim to flatten. Defaults to -1.

  • copy (-) – Whether to return a copied version of the array during the transform method. Defaults to False.

Return type:

None

fit(X: ndarray | None = None, y: None = None) Flatten

This function is required for the pipelines to work, but is ignored.

Parameters:
  • X (-) – Ignored. Defaults to None.

  • y (-) – Igorned. Defaults to None.

Returns:

The fitted scaler.

Return type:

  • self

transform(X: ndarray) ndarray

This will transform the array by returning a flattened version.

Parameters:

X (-) – The array to be flattened.

Returns:

- out – The flattened array.

Return type:

np.ndarray:

sku.preprocessing.FlattenStandardScaler

alias of FlattenWrapper

class sku.preprocessing.StandardGroupScaler(warn: bool = True)

Bases: BaseEstimator, TransformerMixin

Parameters:

warn (bool) –

__init__(warn: bool = True)

This class allows you to scale the data based on a group.

When calling transform, if the group has not been seen in the fitting method, then the global statistics will be used to scale the data (global = across all groups).

Where the mean or standard deviation are equal to NaN, in any axis on any group, that particular value will be replaced with the global mean or standard deviation for that axis (global = across all groups). If the standard deviation is returned as 0.0 then the global standard deviation and mean is used.

Parameters:

warn (-) – Whether to warn the user if they are using the grouped version and no groups are passed. Defaults to True.

fit(X: ndarray, y: None | ndarray = None, groups: None | ndarray = None) StandardGroupScaler

Compute the mean and std to be used for later scaling.

Parameters:
  • X (-) – The data used to compute the mean and standard deviation used for later scaling along the features axis. This should be of shape (n_samples, n_features).

  • y (-) – Igorned. Defaults to None.

  • groups (-) – The groups to split the scaling by. This should be of shape (n_samples,). Defaults to None.

Returns:

- self – The fitted scaler.

Return type:

sku.StandardGroupScaler:

fit_transform(X: ndarray, y: ndarray | None = None, groups: ndarray | None = None) ndarray

Fit to data, then transform it. Fits transformer to X using the groups and returns a transformed version of X.

Parameters:
  • X (-) – The data used to compute the mean and standard deviation used for later scaling along the features axis. This should be of shape (n_samples, n_features).

  • groups (-) – The groups to split the scaling by. This should be of shape (n_samples,). Defaults to None.

  • y (-) – Igorned. Defaults to None.

Returns:

The fitted scaler.

Return type:

  • self

transform(X: ndarray, y: ndarray | None = None, groups: ndarray | None = None) ndarray

Perform standardization by centering and scaling by group.

Parameters:
  • X (-) – The data used to scale along the features axis. This should be of shape (n_samples, n_features).

  • y (-) – Ignored. Defaults to None.

  • groups (-) – The groups to split the scaling by. This should be of shape (n_samples,). Defaults to None.

Returns:

- X_norm – The transformed version of X.

Return type:

np.ndarray:

Progress Utils

Here is the documentation for the progress util functionality.

Utils

Here is the documentation for the util functionality.

sku.utils.get_default_args(func: Callable)

https://stackoverflow.com/a/12627202

Allows you to collect the default arguments of a function. This is useful for setting params in sklearn wrappers.

Parameters:

func (-) – Function to get parameter defaults for.

Returns:

- out – _description_

Return type:

_type_:

sku.utils.hasarg(func: Callable, name: str) bool

Checks if the function can take the name as an argument

Parameters:
  • func (-) – Function

  • name (-) – Name to check in the args

Returns:

- out – True or False

Return type:

bool:

sku.utils.partialclass(cls, *args, **kwds)
sku.utils.partialclass_pickleable(name, cls, *args, **kwds)