Documentation
The full documentation is available here. Use the side bar to select the sub-module that is most applicable.
All of these functions can be loaded from the sku
package directly, for example:
>>> import sku
>>> aml.PipelineDD(...)
Wrapping Utilities
Here is the documentation for the useful wrapping functionality.
- class sku.flatten_wrapper.FlattenWrapper(estimator: Any | None = None, start_dim: int = 1, end_dim: int = -1, flatten_arg: int | str = 0, unflatten_transform: bool = True, **kwargs)
Bases:
BaseEstimator
,TransformerMixin
- Parameters:
- __init__(estimator: Any | None = None, start_dim: int = 1, end_dim: int = -1, flatten_arg: int | str = 0, unflatten_transform: bool = True, **kwargs) None
This class allows you to wrap any transformer or model with a flattening operation. By default, the flattening will allow for shape (in_shape[0], -1). Please see the flattening operations in
aml.flatten
to understand the argumentsstart_dim
andend_dim
.Note: Any attribute of the underlying estimator is accessible as normal as an attribute of this class. If you require a flattened call of any method of the underlying estimator, then you may use a prefix of
flatten__
to the method name. See the examples.Examples
>>> flatten_scaler = FlattenWrapper( StandardScaler, unflatten_transform=True, ) >>> flatten_scaler.fit(x) >>> flatten_scaler.transform(x)
Also, since any attribute of the underlying estimator is accessible through this wrapper, you may do the following:
>>> flatten_scaler = FlattenWrapper( StandardScaler, unflatten_transform=True, ) >>> flatten_scaler.fit(x) >>> flatten_scaler.transform(x) >>> flatten_scaler.mean_ [0.5, 0.2, 100, 20]
If you require a function call of the underlying estimator, but still want to flatten the argument first. Use the following structure:
>>> flatten_lr = FlattenWrapper( LogisticRegression, ) >>> flatten_lr.fit(x, y) >>> flatten_lr.flatten__predict_log_proba(x) [0, 1, 0, 1, 1, 0]
- Parameters:
estimator (-) – The transformer or model that requires a flattening of the before fit and transform or predict. Defaults to
None
.start_dim (-) – The first dim to flatten. Defaults to
0
.end_dim (-) – The last dim to flatten. Defaults to
-1
.flatten_arg (-) – The argument of the call on fit and transform/predict that contains the array that requires flattening. This can either be the keyword, that needs to be given in the method calls, or the idx of the argument. Please keep in mind that an integer here will only work if the fit and predict methods are given positional arguments. To be safe it is best to pass a string here and use keyword argumetns in the methods. Defaults to
0
.unflatten_transform (-) – Return an unflattened version of the transformed array. If the output is a numpy array, then this will unflattened directly. If the output is a list or tuple, then the
flatten_arg
of the output will be unflattened. Defaults toTrue
.**kwargs (-) –
The keywords that will be passed to the estimator init function. Defaults to
{}
.
- Return type:
None
- property estimator
This is the estimator, which can be accessed as an attribute of this class.
- fit(*args, **kwargs) FlattenWrapper
Compute the mean and std to be used for later scaling.
- Parameters:
X (-) – The data used to compute the mean and standard deviation used for later scaling along the features axis. This should be of shape
(n_samples, n_features)
.y (-) – Igorned. Defaults to
None
.
- Returns:
- self – The fitted scaler.
- Return type:
FlattenStandardScalerOld:
- fit_transform(*args, **kwargs)
Fit and then transform with the estimator and flattened array.
- Parameters:
args (-) – Arguments passed to the estimator when transforming.
kwargs (-) – Keyword arguments passed to the estimator when transforming.
- Returns:
- args_out – The transformed input.
- Return type:
FlattenWrapper:
- get_params(deep=True) dict
Overrides sklearn function.
- Parameters:
deep (-) – Ignored. Defaults to
True
.- Returns:
- out – Dictionary of parameters.
- Return type:
dict:
- set_params(**params)
From sklearn documentation: Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form :code:<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (-) –
Estimator parameters.
- Returns:
- self – Estimator instance.
- Return type:
estimator instance
Model Wrapper
Here is the documentation for the model wrapping functionality.
- class sku.model_wrapper.SKModelWrapperDD(model: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], predict_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], **kwargs)
Bases:
BaseEstimator
,ClassifierMixin
- Parameters:
- __init__(model: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], predict_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], **kwargs) None
A wrapper for any scikit-learn model that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with
sklearn.pipeline.Pipeline
. The model should not be initiated yet, and all arguments passed as positional or keyword arguments after the model is given.Note: Any attribute or method of the underlying model is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple
fit_on
arguments are used, otherwise a single value is returned.- Parameters:
model (-) – The model to wrap. This model must have both
.fit(X, y)
and.predict(X)
. An example would be an abstracted pytorch model.fit_on (-) –
This allows the user to define the keys in the data dictionary that will be passed to the fit function.
If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys given as strings.If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys and values as before.
The multiple fit models will be saved in a list, accessible through the
fitted_models
attribute. Defaults to[['X', 'y']]
.predict_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.predict()
function. Multiple inner lists will cause the.predict()
function to be called multiple times, with each predict corresponding to the fitted object in each of the fit calls. If there are more.predict()
calls than.fit()
calls, then the predict will be called on the beginning of the fit object list again (ie: the predict calls roll over the fit calls). If a list of strings is given then they will be wrapped in an outer list, meaning that one.predict()
is called, with arguments corresponding to the keys given as strings. Defaults to[['X']]
.*kwargs (-) –
Keyword arguments given to the model init.
- Return type:
None
- fit(X: Dict[str, ndarray] | ndarray, y: None | ndarray = None, sample_weight: None | ndarray = None) SKModelWrapperDD
This will fit the model being wrapped.
- Parameters:
X (-) – A dictionary containing the data. If
X
is anumpy.ndarray
, then thefit_on
arguments will be ignored and the model will be passed.fit(X,y)
. In this case, consider using sklearn. For example:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.y (-) – Ignored unless
X
is anumpy.ndarray
. If using a data dictionary, please pass labels in the dictionary toX
. IfX
is anumpy.ndarray
then this should also be anumpy.ndarray
of targets. Defaults toNone
.sample_weight (-) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored unless
X
is anumpy.ndarray
. If using a data dictionary, please pass sample weights in the dictionary toX
and use thefit_on
argument. IfX
is anumpy.ndarray
then this should also be anumpy.ndarray
of sample weights. This will only pass this argument to the underlying transform’s fit function if it includes this argument in the signature. Defaults toNone
.
- Returns:
- self – This object.
- Return type:
SKModelWrapperDD:
- get_params(deep=True) dict
Overrides sklearn function.
- Parameters:
deep (-) – Ignored. Defaults to
True
.- Returns:
- out – Dictionary of parameters.
- Return type:
dict:
- property model
This is the model, which can be accessed as an attribute of this class. If multiple
fit_on
arguments were given, and the class has been fitted, then this will be a list of the fitted models. Otherwise, it will be a single instance.
- predict(X: Dict[str, ndarray], return_data_dict: bool = False) ndarray | Dict[str, ndarray]
This will predict using the model being wrapped.
- Parameters:
X (-) – A dictionary containing the data. If
X
is anumpy.ndarray
, then thepredict_on
arguments will be ignored and the model will be passed.predict(X)
. In this case, consider using sklearn. In addition, this will be performed on the first fitted model if many are fitted. For example:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.return_data_dict (bool) –
- Return type:
- return_data_dict: bool, optional:
Whether to return the ground truth with the output. This is useful if this model was part of a pipeline in which the labels are altered. This is ignored if
X
is anumpy.ndarray
Defaults toFalse
.
- Returns:
- predictions (numpy.ndarray:) – The predictions, as a numpy array. If multiple inner lists are given as
predict_on
, then a list of predictions will be returned.- data_dict (typing.Dict[str, np.ndarray]:) – The labels, as a numpy array. Only returned if
return_data_dict=True
.
- Parameters:
- Return type:
- predict_proba(X: Dict[str, ndarray], return_data_dict: bool = False) ndarray | Dict[str, ndarray]
This will predict the probabilities using the model being wrapped.
- Parameters:
X (-) – A dictionary containing the data. If
X
is anumpy.ndarray
, then thepredict_on
arguments will be ignored and the model will be passed.predict_proba(X)
. In this case, consider using sklearn. In addition, this will be performed on the first fitted model if many are fitted. For example:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.return_data_dict (-) – Whether to return the ground truth with the output. This is useful if this model was part of a pipeline in which the labels are altered. This is ignored if
X
is anumpy.ndarray
Defaults toFalse
.
- Returns:
- predictions (numpy.ndarray:) – The predictions, as a numpy array. If multiple inner lists are given as
predict_on
, then a list of predictions will be returned.- data_dict (typing.Dict[str, np.ndarray]:) – The labels, as a numpy array. Only returned if
return_data_dict=True
.
- Return type:
- set_params(**params)
From sklearn documentation: Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form :code:<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (-) –
Estimator parameters.
- Returns:
- self – Estimator instance.
- Return type:
estimator` instance
Transformer Wrapper
Here is the documentation for the transformer wrapping functionality.
- class sku.transformer_wrapper.SKTransformerWrapperDD(transformer: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], transform_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], all_key_transform=False, **kwargs)
Bases:
BaseEstimator
,TransformerMixin
- Parameters:
- __init__(transformer: Any, fit_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X', 'y']], transform_on: Union[Union[List[str], List[List[str]]], Dict[str]] = [['X']], all_key_transform=False, **kwargs) None
A wrapper for any scikit-learn transformer that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with
sklearn.pipeline.Pipeline
.Note: Any attribute or method of the underlying transformer is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple
fit_on
arguments are used, otherwise a single value is returned.- Parameters:
transformer (-) – The transformer to wrap. This transformer must have both
.fit(X, y)
and.transform(X)
. An example would be a custom transformer that uses data dictionaries.fit_on (-) –
This allows the user to define the keys in the data dictionary that will be passed to the fit function.
If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys given as strings.If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys and values as before.
The multiple fit transforms will be saved in a list, accessible through the
fitted_transforms
attribute. Defaults to[['X', 'y']]
.transform_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.transform()
function. Multiple inner lists will cause the.transform()
function to be called multiple times, with each transform corresponding to the fitted object in each of the fit calls. If there are more.transform()
calls than.fit()
calls, then the transform will be called on the beginning of the fit object list again (ie: the transform calls roll over the fit calls). The first key in each inner list will be overwritten with the result from.transform()
, unlessall_key_transform=True
. If a list of strings is given then they will be wrapped in an outer list, meaning that one.transform()
is called, with arguments corresponding to the keys given as strings. Defaults to[['X']]
.all_key_transform (-) – This dictates whether the transformer being wrapped will output a result for all of the arguments given to it. In this case, each of the values corresponding to the keys being transformed on, will be replaced by the corresponding output of the wrapped transform. If
False
, only the first key will be transformed. ie:x, y, z = self.wrapped_transform(x, y, z)
rather than:x = self.wrapped_transform(x, y, z)
Defaults toFalse
.**kwargs (-) –
Keyword arguments given to the transformer.
- Return type:
None
- fit(X: Dict[str, ndarray] | ndarray, y: None | ndarray = None, sample_weight: None | ndarray = None) SKTransformerWrapperDD
This will fit the transformer being wrapped.
- Parameters:
X (-) – A dictionary containing the data. If
X
is anumpy.ndarray
, then thefit_on
arguments will be ignored and the transformer will be passed.fit(X,y)
. In this case, consider using sklearn. For example:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.y (-) – Ignored unless
X
is anumpy.ndarray
. If using a data dictionary, please pass labels in the dictionary toX
. IfX
is anumpy.ndarray
then this should also be anumpy.ndarray
of targets. Defaults toNone
.sample_weight (-) – Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. Ignored unless
X
is anumpy.ndarray
. If using a data dictionary, please pass sample weights in the dictionary toX
and use thefit_on
argument. IfX
is anumpy.ndarray
then this should also be anumpy.ndarray
of sample weights. This will only pass this argument to the underlying transform’s fit function if it includes this argument in the signature. Defaults toNone
.
- Returns:
- self – This object.
- Return type:
SKTransformerWrapperDD:
- fit_transform(X: Dict[str, ndarray] | ndarray, y: None | ndarray = None, **fit_params)
This will fit and transform the transformer being wrapped and the data given.
- Parameters:
X (-) – A dictionary containing the data. If
X
is anumpy.ndarray
, then thefit_on
arguments will be ignored and the transformer will be passed.fit_transform(X,y)
. In this case, consider using sklearn. For example:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.y (-) – Ignored unless
X
is anumpy.ndarray
. If using a data dictionary, please pass labels in the dictionary toX
. Defaults toNone
.**fit_params (-) –
These are passed to the fit function.
- Returns:
- X_out –
- The data dictionary, with stucture the same as
X
, with transformed
X
data.
If
X
is anumpy.ndarray
, then anumpy.ndarray
will be returned.- The data dictionary, with stucture the same as
- Return type:
- get_params(deep=True) dict
Overrides sklearn function.
- Parameters:
deep (-) – Ignored. Defaults to
True
.- Returns:
- out – Dictionary of parameters.
- Return type:
dict:
- set_params(**params)
From sklearn documentation: Set the parameters of this estimator. The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form :code:<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (-) –
Estimator parameters.
- Returns:
- self – Estimator instance.
- Return type:
estimator` instance
- transform(X: Dict[str, ndarray]) Dict[str, ndarray]
This will transform the data using the transformer being wrapped.
- Parameters:
X (-) – A dictionary containing the data. If
X
is anumpy.ndarray
, then thetransform_on
arguments will be ignored and the transform will be passed.transform(X)
. In this case, consider using sklearn. In addition, this will be performed on the first fitted transform if many are fitted. For example:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.- Returns:
- X_out – The data dictionary, with stucture the same as
X
, with transformedX
data. IfX
is anumpy.ndarray
, then anumpy.ndarray
will be returned.- Return type:
- property transformer
This is the transformer, which can be accessed as an attribute of this class. If multiple
fit_on
arguments were given, and the class has been fitted, then this will be a list of the fitted transformers. Otherwise, it will be a single instance.
Transformer
Here is the documentation for the transformer functionality.
- class sku.transformer.DropNaNRowsDD(*, transformer: typing.Any = <class 'sku.transformer._DropNaNRowsDD'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=True, **kwargs)
Bases:
SKTransformerWrapperDD
- Parameters:
- __init__(*, transformer: typing.Any = <class 'sku.transformer._DropNaNRowsDD'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=True, **kwargs) None
A wrapper for any scikit-learn transformer that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with
sklearn.pipeline.Pipeline
.Note: Any attribute or method of the underlying transformer is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple
fit_on
arguments are used, otherwise a single value is returned.- Parameters:
transformer (-) – The transformer to wrap. This transformer must have both
.fit(X, y)
and.transform(X)
. An example would be a custom transformer that uses data dictionaries.fit_on (-) –
This allows the user to define the keys in the data dictionary that will be passed to the fit function.
If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys given as strings.If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys and values as before.
The multiple fit transforms will be saved in a list, accessible through the
fitted_transforms
attribute. Defaults to[['X', 'y']]
.transform_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.transform()
function. Multiple inner lists will cause the.transform()
function to be called multiple times, with each transform corresponding to the fitted object in each of the fit calls. If there are more.transform()
calls than.fit()
calls, then the transform will be called on the beginning of the fit object list again (ie: the transform calls roll over the fit calls). The first key in each inner list will be overwritten with the result from.transform()
, unlessall_key_transform=True
. If a list of strings is given then they will be wrapped in an outer list, meaning that one.transform()
is called, with arguments corresponding to the keys given as strings. Defaults to[['X']]
.all_key_transform (-) – This dictates whether the transformer being wrapped will output a result for all of the arguments given to it. In this case, each of the values corresponding to the keys being transformed on, will be replaced by the corresponding output of the wrapped transform. If
False
, only the first key will be transformed. ie:x, y, z = self.wrapped_transform(x, y, z)
rather than:x = self.wrapped_transform(x, y, z)
Defaults toFalse
.**kwargs (-) –
Keyword arguments given to the transformer.
- Return type:
None
- class sku.transformer.NumpyToDict
Bases:
BaseEstimator
,TransformerMixin
- __init__()
This function will transform two numpy arrays containing the inputs and targets to a dictionary containing the same arrays.
- fit(X: None | ndarray = None, y: None | ndarray = None) NumpyToDict
This will save the numpy arrays to the class, ready to return them as a dictionary in the
.transform()
method- Parameters:
X (-) – The inputs. Defaults to
None
.y (-) – The targets. Defaults to
None
.
- Return type:
- class sku.transformer.StandardGroupScalerDD(*, transformer: typing.Any = <class 'sku.preprocessing.StandardGroupScaler'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=False, **kwargs)
Bases:
SKTransformerWrapperDD
- Parameters:
- __init__(*, transformer: typing.Any = <class 'sku.preprocessing.StandardGroupScaler'>, fit_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X', 'y']], transform_on: typing.Union[typing.Union[typing.List[str], typing.List[typing.List[str]]], typing.Dict[str]] = [['X']], all_key_transform=False, **kwargs) None
A wrapper for any scikit-learn transformer that accepts a dictionary containing the data. This is useful to use when combining semi-supervised methods with
sklearn.pipeline.Pipeline
.Note: Any attribute or method of the underlying transformer is accessible as normal as an attribute of this class. The returned value will be wrapped in a list if multiple
fit_on
arguments are used, otherwise a single value is returned.- Parameters:
transformer (-) – The transformer to wrap. This transformer must have both
.fit(X, y)
and.transform(X)
. An example would be a custom transformer that uses data dictionaries.fit_on (-) –
This allows the user to define the keys in the data dictionary that will be passed to the fit function.
If List of List or List: The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If a list of strings is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys given as strings.If Dict or List of Dict: The outer list will be iterated over, and the inner dicts’s values will be used to get the data from the data dictionary, which will be passed in as keyword arguments using the dict’s keys to the
.fit()
function. Multiple inner lists will cause the.fit()
function to be called multiple times, with each fitted version being saved as a unique object in this class. If just a dict of is given then they will be wrapped in an outer list, meaning that one.fit()
is called, with arguments corresponding to the keys and values as before.
The multiple fit transforms will be saved in a list, accessible through the
fitted_transforms
attribute. Defaults to[['X', 'y']]
.transform_on (-) – This allows the user to define the keys in the data dictionary that will be passed to the fit function. The outer list will be iterated over, and the inner list’s keys will be used to get the data from the data dictionary, which will be passed in that order as positional arguments to the
.transform()
function. Multiple inner lists will cause the.transform()
function to be called multiple times, with each transform corresponding to the fitted object in each of the fit calls. If there are more.transform()
calls than.fit()
calls, then the transform will be called on the beginning of the fit object list again (ie: the transform calls roll over the fit calls). The first key in each inner list will be overwritten with the result from.transform()
, unlessall_key_transform=True
. If a list of strings is given then they will be wrapped in an outer list, meaning that one.transform()
is called, with arguments corresponding to the keys given as strings. Defaults to[['X']]
.all_key_transform (-) – This dictates whether the transformer being wrapped will output a result for all of the arguments given to it. In this case, each of the values corresponding to the keys being transformed on, will be replaced by the corresponding output of the wrapped transform. If
False
, only the first key will be transformed. ie:x, y, z = self.wrapped_transform(x, y, z)
rather than:x = self.wrapped_transform(x, y, z)
Defaults toFalse
.**kwargs (-) –
Keyword arguments given to the transformer.
- Return type:
None
Model Selection
Here is the documentation for the model selection functionality.
- class sku.model_selection.DataPreSplit(data: List[Tuple[Dict[str, ndarray], Dict[str, ndarray]]] | Tuple[Dict[str, ndarray], Dict[str, ndarray]], split_fit_on: List[str] = ['X', 'y'])
Bases:
object
- Parameters:
- __init__(data: List[Tuple[Dict[str, ndarray], Dict[str, ndarray]]] | Tuple[Dict[str, ndarray], Dict[str, ndarray]], split_fit_on: List[str] = ['X', 'y'])
This function allows you to wrap pre-split data into a class that behaves like an sklearn splitter. This is useful in pipeline searches.
Examples
>>> splitter = sku.DataPreSplit( data=[ ( {'X': np.arange(10)}, {'X': np.arange(5)}, ), ( {'X': np.arange(5)}, {'X': np.arange(2)}, ), ( {'X': np.arange(2)}, {'X': np.arange(3)}, ), ], split_fit_on=['X'] ) >>> X = splitter.reformat_X() >>> for train_idx, val_idx in splitter.split(X['X']): train_data, val_data = X['X'][train_idx], X['X'][val_idx] do_things(train_data, val_data)
- Parameters:
data (-) – The pre-split data. Please ensure all splits have the same keys.
split_fit_on (-) – The labels in the data dictionaries to split the data on. Defaults to
['X', 'y']
.
- Raises:
TypeError – If
data
is not a list or tuple:
- get_n_splits(groups: Any | None = None) int
Returns the number of splits.
- Parameters:
groups (-) – Ignored. Defaults to
None
.- Returns:
- out – The number of splits.
- Return type:
int:
- reformat_X() Dict[str, ndarray]
This reformats the X so that it can be split by the indices returned in
splits
. It essentially concatenates all of the data dictionaries.
- split(X: Dict[str, ndarray], y: Any | None = None, groups: Any | None = None)
This returns the training and testing idices.
- Parameters:
X (-) – A data dictionary that is only used to ensure that an array of the right shape is being used for the splitting operation.
y (-) – Ignored. Defaults to
None
.groups (-) – Ignored. Defaults to
None
.
- Returns:
The train and test idices, wraped in a generator. See the Examples for an understanding of the output.
- Return type:
out
- class sku.model_selection.RepeatedStratifiedGroupKFold(*, n_splits=5, n_repeats=10, random_state=None)
Bases:
_RepeatedSplits
Repeated Stratified Group K-Fold cross validator. Repeats Stratified Group K-Fold n times with different randomization in each repetition. Read more in the User Guide. :param n_splits: Number of folds. Must be at least 2. :type n_splits: int, default=5 :param n_repeats: Number of times cross-validator needs to be repeated. :type n_repeats: int, default=10 :param random_state: Controls the generation of the random states for each repetition.
Pass an int for reproducible output across multiple function calls. See Glossary.
Examples
>>> import numpy as np >>> from sklearn.model_selection import RepeatedStratifiedKFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([0, 0, 1, 1]) >>> rskf = RepeatedStratifiedKFold(n_splits=2, n_repeats=2, ... random_state=36851234) >>> rskf.get_n_splits(X, y) 4 >>> print(rskf) RepeatedStratifiedKFold(n_repeats=2, n_splits=2, random_state=36851234) >>> for i, (train_index, test_index) in enumerate(rskf.split(X, y)): ... print(f"Fold {i}:") ... print(f" Train: index={train_index}") ... print(f" Test: index={test_index}") ... Fold 0: Train: index=[1 2] Test: index=[0 3] Fold 1: Train: index=[0 3] Test: index=[1 2] Fold 2: Train: index=[1 3] Test: index=[0 2] Fold 3: Train: index=[0 2] Test: index=[1 3]
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.
See also
RepeatedKFold
Repeats K-Fold n times.
- __init__(*, n_splits=5, n_repeats=10, random_state=None)
- sku.model_selection.train_test_group_split(*arrays, y, group, test_size: float | None = None, train_size: float | None = None, random_state: None | int = None, shuffle: bool = True)
This function returns the train and test data given the split and the data. A single
group
will not be in both the training and testing set. You should use eithertest_size
ortrain_size
but not both.Example
:code:` >>> (X_train, X_test,
y_train, y_test, ids_train, ids_test) = train_test_group_split(X, y=y, group=group, test_size=0.33)
:code:`
- Parameters:
arrays (-) – The data to split into training and testing sets. The labels and the group should be passed to
y
andgroup
respectively.y (-) – Label data with shape
(n_samples)
, wheren_samples
is the number of samples. These are the labels that are used to group the data into either the training or testing set.group (-) – Event data with shape
(n_samples)
, wheren_samples
is the number of samples. These are the group ids that are used to group the data into either the training or testing set.test_size (-) – This dictates the size of the outputted test set. This should be used if
train_size=None
. If notest_size
ortrain_size
are given, thentest_size
will default to0.25
Defaults toNone
.train_size (-) – This dictates the size of the outputted train set. This should be used if
test_size=None
. Defaults toNone
.shuffle (-) – dictates whether the data should be shuffled before the split is made.
random_state (-) – This dictates the random seed that is used in the random operations for this function.
- Returns:
- split arrays – This is a list of the input data, split into the training and testing sets. See the Example for an understanding of the order of the outputted arrays.
- Return type:
list:
Pipeline
Here is the documentation for the pipeline functionality.
- class sku.pipeline.Pipeline(steps, *, memory=None, verbose=False)
Bases:
Pipeline
- predict(X: ndarray, **predict_params)
Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to
fit_params
.- Parameters:
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
**predict_params (dict of string -> object) – Parameters to the predict passed at each step
- Returns:
y_pred – Result of calling predict on the final estimator.
- Return type:
ndarray of shape (n_samples, n_classes)
- predict_log_proba(X: ndarray, **predict_log_proba_params)
Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to
fit_params
.- Parameters:
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
**predict_log_proba (dict of string -> object) – Parameters to the predict_log_proba passed at each step
- Returns:
y_log_proba – Result of calling predict_log_proba on the final estimator.
- Return type:
ndarray of shape (n_samples, n_classes)
- predict_proba(X: ndarray, **predict_proba_params)
Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to
fit_params
.- Parameters:
X (iterable) – Data to predict on. Must fulfill input requirements of first step of the pipeline.
**predict_proba_params (dict of string -> object) – Parameters to the predict_proba passed at each step
- Returns:
y_proba – Result of calling predict_proba on the final estimator.
- Return type:
ndarray of shape (n_samples, n_classes)
- transform(X, **transform_params)
Differs from the SKlearn implementation in that it allows you to pass parameters to each level of the pipeline, similarly to
fit_params
.- Parameters:
X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
**transform_params (dict of string -> object) – Parameters to the transform_params passed at each step
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_transformed_features)
- class sku.pipeline.PipelineDD(steps, *, memory=None, verbose=False)
Bases:
Pipeline
- fit(X: Dict[str, ndarray], y: None = None, *args, **kwargs) PipelineDD
This will fit the pipeline.
- Parameters:
X (-) – A dictionary containing the data. When supplying
X
, please be mindful of the objects in the pipeline and their data requirements. If you are using model wrappers and transformer wrappers from this package, then using a dictionary will be more powerful. However, you may supply anumpy.ndarray
, but allfit_on
, andtransform_on
andpredict_on
arguments will be ignored and sklearn defaults will be used. An exampleX
:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.y (-) – Ignored unless
X
is anumpy.ndarray
. If using a data dictionary, please pass labels in the dictionary toX
. Defaults toNone
.
- Returns:
- self – This object.
- Return type:
PipelineDD:
- predict(X: Dict[str, ndarray], *args, **kwargs) ndarray | Dict[str, ndarray]
This will predict using the fitted pipeline.
- Parameters:
X (-) – A dictionary containing the data. If
X
is anumpy.ndarray
, then thepredict_on
arguments will be ignored and the model will be passed.predict(X)
. In this case, consider using sklearn. In addition, this will be performed on the first fitted model if many are fitted. For example:X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.- Returns:
- predictions – The predictions, as a numpy array. If multiple inner lists are given as
predict_on
, then a list of predictions will be returned.- Return type:
numpy.ndarray:
- score(X: Dict[str, ndarray], y: str | ndarray, sample_weight=None)
Transform the data, and apply
score
with the final estimator. Calltransform
of each transformer in the pipeline. The transformed data are finally passed to the final estimator that callsscore
method. Only valid if the final estimator implementsscore
.- Parameters:
X (-) – A dictionary containing the data. For example:
X = {'X': X_DATA, 'y': Y_DATA, **kwargs}
.y (-) – Please either pass a string, which corresponds to the key in
X
which contains the labels, or pass the labels themselves.sample_weight (-) – If not None, this argument is passed as :code:
sample_weight
keyword argument to the :code:score
method of the final estimator. Defaults toNone
.
- Returns:
- score – Result of calling
score
on the final estimator.- Return type:
float`
- transform(X: Dict[str, ndarray]) Dict[str, ndarray]
Transform the data, and apply transform with the final estimator.
Call transform of each transformer in the pipeline. The transformed data are finally passed to the final estimator that calls transform method. Only valid if the final estimator implements transform.
This also works where final estimator is None in which case all prior transformations are applied.
- Parameters:
X (iterable) – Data to transform. Must fulfill input requirements of first step of the pipeline.
- Returns:
Xt – Transformed data.
- Return type:
ndarray of shape (n_samples, n_transformed_features)
- sku.pipeline.pipeline_constructor(name: str, name_to_object: Dict[str, BaseEstimator], **pipeline_kwargs) PipelineDD
A function that constructs a pipeline from a string of pipeline keys and a dictionary mapping the keys to pipeline objects.
This function will copy each object before adding it to the pipeline. This means that repeating objects in the
name
will produce separate objects for each.Example
>>> name_to_object = { 'standard_scaler': sku.SKTransformerWrapperDD(StandardScaler, fit_on=['X'], transform_on=['X'], ), 'ae': sku.SKTransformerWrapperDD( transformer=AEModel, n_input=19, n_embedding=19, n_epochs=5, n_layers=2, dropout=0.2, , optimizer={'adam':{'lr':0.01}}, criterion=nn.MSELoss(), fit_on=['X_unlabelled'], transform_on =[['X'], ['X_unlabelled']], ), 'mlp': sku.SKModelWrapperDD(MLPModel, n_input=19, n_output=2, hidden_layer_sizes=(100,50), dropout=0.2, n_epochs=10, , optimizer={'adam':{'lr':0.01}}, criterion=nn.CrossEntropyLoss(), fit_on=['X', 'y'], predict_on=['X'], ) } >>> name = 'standard_scaler--ae--standard_scaler--mlp' >>> pipeline = pipeline_constructor(name, name_to_object)
Here, the pipeline will be returned as: .. code-block:
[ ['standard_scaler1', sku.SKTransformerWrapperDD], ['ae', sku.SKTransformerWrapperDD], ['standard_scaler2', sku.SKTransformerWrapperDD], ['mlp', sku.SKModelWrapperDD], ]
Note the change in names of the standard scalers to ensure that object names are unique.
- Parameters:
name (-) – A string in which the pipeline object names are split by
'--'
. For example:'standard_scaler--ae--standard_scaler--mlp'
.name_to_object (-) – A dictionary mapping the strings in the
name
argument to objects that will be placed in the pipeline. These objects must have a.fit()
and.predict()
method and use data dictionaries. Please seesku.SKTransformerWrapperDD
andsku.SKModelWrapperDD
for further explanation of structure.pipeline_kwargs (-) – These are any extra keyword arguments that will be passed to
sku.PipelineDD
.
- Returns:
- out – A data dictionary pipeline.
- Return type:
sku.pipeline.PipelineDD:
Pipeline Searching
Here is the documentation for the pipeline searching functionality.
- class sku.pipeline_searcher.PipelineBasicSearchCV(pipeline_names: List[str], name_to_object: Dict[str, BaseEstimator], metrics: Dict[str, Callable], metrics_probability: Dict[str, Callable] = {}, cv=None, repeat: int = 1, split_fit_on: List[str] = ['X', 'y'], split_transform_on: List[str] = ['X', 'y'], verbose: bool = False, n_jobs: int = 1, combine_splits: bool = False, combine_runs: bool = False, opt_metric: str | None = None, minimise: bool = True, bootstrap: bool = False)
Bases:
BaseEstimator
- Parameters:
- __init__(pipeline_names: List[str], name_to_object: Dict[str, BaseEstimator], metrics: Dict[str, Callable], metrics_probability: Dict[str, Callable] = {}, cv=None, repeat: int = 1, split_fit_on: List[str] = ['X', 'y'], split_transform_on: List[str] = ['X', 'y'], verbose: bool = False, n_jobs: int = 1, combine_splits: bool = False, combine_runs: bool = False, opt_metric: str | None = None, minimise: bool = True, bootstrap: bool = False)
This class allows you to test multiple pipelines on a supervised task, reporting on the metrics given in a table of results. Given a splitting function, it will perform cross validation on these pipelines. You may also pass your own splits.
Careful not to set a random_state in the objects passed in the
name_to_object
dictionary, since each model gets cloned each time it is used in a pipeline, and the random_state will be the same in every run, and every split. Future code will allow the passing of a random state to this object directly.Example
name_to_object = { 'gbt': sku.SKModelWrapperDD( HistGradientBoostingClassifier, fit_on=['X', 'y'], predict_on=['X'], ), 'standard_scaler': sku.SKTransformerWrapperDD( StandardScaler, fit_on=['X'], transform_on=['X'], ), } pipeline_names = [ 'standard_scaler--gbt', 'gbt' ] metrics = { 'accuracy': accuracy_score, 'recall': recall_score, 'precision': precision_score, 'f1': f1_score, } splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=1024) pscv = PipelineBasicSearchCV( pipeline_names=pipeline_names, name_to_object=name_to_object, metrics=metrics, cv=splitter, split_fit_on=['X', 'y'], split_transform_on=['X', 'y', 'id'], verbose=True, ) X_data = { 'X': X_labelled, 'y': y_labelled, 'id': id_labelled, 'X_unlabelled': X_unlabelled, 'id_unlabelled': id_unlabelled, } pscv.fit(X_data) results = pscv.cv_results_
- Parameters:
pipeline_names (-) – This is a list of strings that describe the pipelines An example would be
standard_scaler--ae--mlp
. The objects, separated by :code:’–’ should be keys inname_to_object
.name_to_object (-) – A dictionary mapping the keys in
pipeline_names
to the objects that will be used as transformers and models in the pipeline.metrics (-) – A dictionary mapping the metric names to their callable functions. These functions should take the form:
func(labels, predictions)
.metrics_probability (-) – A dictionary mapping the metric names to their callable functions. These functions should take the form:
func(labels, prediction_probabilities)
. Defaults to{}
.cv (-) – This is the class that is used to produce the cross validation data. It should have the method
.split(X, y, event)
, which returns the indices of the training and testing set, and the methodget_n_splits()
, which should return the number of splits that this splitter was indended to make. Alternatively, if you passNone
, you may pass the training and validation data dictionaries themselves, in a tuple or list of tuples in the structure(data_train, data_val)
to thefit
method in place of the argumentX
. Defaults toNone
.repeat (-) – The number of times to repeat the experiment. Defaults to
1
.split_fit_on (-) – The keys corresponding to the values in the data dictionary passed in
.fit()
that thecv
will take as positional arguments to thesplit()
function. Ifcv=None
then this is ignored. Defaults to['X', 'y']
.split_transform_on (-) – The keys corresponding to the values in the data dictionary passed in
.fit()
that thecv
will split into training and testing data. This allows you to split data that isn’t used in finding the splitting indices. Ifcv=None
then this is ignored. Defaults to['X', 'y']
.verbose (-) – Whether to print progress as the models are being tested. Remeber that you might also need to change the verbose, options in each of the objects given in
name_to_object
. Defaults toFalse
.n_jobs (-) – The number of parallel jobs.
-1
will run the searches on all cores, but will incur significant memory and cpu cost. Defaults to1
.combine_splits (-) – Whether to combine the predictions over the splits before calculating the metrics. This can help reduce the variance in results when using Leave-One-Out. Defaults to
False
.combine_runs (-) – Whether to combine the predictions over the runs before calculating the metrics. If
True
,combine_splits
must also beTrue
. Defaults toFalse
.opt_metric (-) – The metric values to use when determining the optimal parameters. If
None
, the first metric given inmetrics.keys()
will be used. If astr
, this should be a key inmetrics
. Defaults toNone
.minimise (-) – Whether to minimise the metric given in
opt_metric
. IfFalse
, the metric will be maximised. Defaults toTrue
.bootstrap (-) – Whether to build a bootstrap training sample data before fitting the model. This will be done on every run of every split. Defaults to
False
.
- property best_params_
- fit(X: Dict[str, ndarray] | Tuple[Dict[str, ndarray], Dict[str, ndarray]] | List[Tuple[Dict[str, ndarray], Dict[str, ndarray]]], y: str) DataFrame
This function fits and predicts the pipelines, with the, optional parameters and splitting arguments and produces a table of results given the metrics.
- Parameters:
X (-) – The data dictionary that will be used to run the experiments. This may also be a tuple of data dictionaries used for training and validation splits if
cv=None
. In this case, you may also pass a list of tuples of data dictionaries in the form[(data_train, data_val), (data_train, data_val), ...]
, wheredata_train
anddata_val
are data dictionaries.y (-) – Please pass a string, which corresponds to the key in
X
which contains the labels.
- Return type:
DataFrame
- class sku.pipeline_searcher.PipelineBayesSearchCV(*args, param_grid: List[Dict[str, List[Any]]] | None = None, max_iter: int = 10, **kwargs)
Bases:
PipelineBasicSearchCV
- __init__(*args, param_grid: List[Dict[str, List[Any]]] | None = None, max_iter: int = 10, **kwargs)
This class allows you to test multiple pipelines on a supervised task, reporting on the metrics given in a table of results. Given a splitting function, it will perform cross validation on these pipelines. A parameter grid can also be passed, allowing the user to test multiple configurations of each pipeline. This class allows you to perform a bayesian parameter search over the parameters given in
param_grid
. Note: At the moment this only supports real value param searches.Example
pscv = PipelineBayesSearchCV( pipeline_names=pipeline_names, name_to_object=name_to_object, metrics=metrics, cv=splitter, split_fit_on=['X', 'y'], split_transform_on=['X', 'y', 'id'], verbose=True, param_grid= [ {'flatten_gbt__learning_rate': [5, 20]}, {'flatten_standard_scaler__with_mean': [True, False]}, {'flatten_mlp__dropout': [0.2, 0.9]}, ], ) X_data = { 'X': X_labelled, 'y': y_labelled, 'id': id_labelled, 'X_unlabelled': X_unlabelled, 'id_unlabelled': id_unlabelled, } pscv.fit(X_data) results = pscv.cv_results_
- Parameters:
param_grid (-) – A dictionary or list of dictionaries that are used as a parameter grid for testing performance of pipelines with multiple hyper-parameters. This should be of the usual format given to
sklearn.model_selection.ParameterGrid
when used withsklearn.pipeline.Pipeline
. IfNone
, then the pipeline is tested with the parameters given to the objects inname_to_object
. All pipelines will be tested with the given parameters in addition to the parameter grid passed. Only those parameters relevant to a pipeline will be used. Defaults toNone
.max_iter (-) – The number of calls to make on each pipeline when finding the, optimum params. Defaults to
10
.
- class sku.pipeline_searcher.PipelineSearchCV(*args, param_grid: Dict[str, Any] | List[Dict[str, Any]] | None = None, **kwargs)
Bases:
PipelineBasicSearchCV
- __init__(*args, param_grid: Dict[str, Any] | List[Dict[str, Any]] | None = None, **kwargs)
This is the same as PipelineBasicSearchCV except a parameter grid can also be passed, allowing the user to test multiple configurations of each pipeline.
Example
pscv = PipelineSearchCV( pipeline_names=pipeline_names, name_to_object=name_to_object, metrics=metrics, cv=splitter, split_fit_on=['X', 'y'], split_transform_on=['X', 'y', 'id'], verbose=True, param_grid={ 'gbt__learning_rate':[0.1, 0.01], 'gbt__max_depth':[3, 10], }, ) X_data = { 'X': X_labelled, 'y': y_labelled, 'id': id_labelled, 'X_unlabelled': X_unlabelled, 'id_unlabelled': id_unlabelled, } pscv.fit(X_data) results = pscv.cv_results_
- Parameters:
args (-) – All the same args as PipelineSearchCV.
param_grid (-) – A dictionary or list of dictionaries that are used as a parameter grid for testing performance of pipelines with multiple hyper-parameters. This should be of the usual format given to
sklearn.model_selection.ParameterGrid
when used withsklearn.pipeline.Pipeline
. IfNone
, then the pipeline is tested with the parameters given to the objects inname_to_object
. All pipelines will be tested with the given parameters in addition to the parameter grid passed. Only those parameters relevant to a pipeline will be used. Defaults toNone
.kwargs (-) – All the same kwargs as PipelineSearchCV.
Metrics
Here is the documentation for the metric functionality.
- sku.metric.positive_split(labels: ndarray, predictions: None = None)
Calculates the proportion of positive labels in all of the labels.
- Parameters:
labels (-) – Labels should be from
[0,1]
or[False, True]
.predictions (-) – Ignored. Defaults to
None
.
- Returns:
- out – A proportion. This is calculated using the function
np.sum(labels)/labels.shape[0]
.- Return type:
float:
Pre-Processing
Here is the documentation for the pre-processing functionality.
- class sku.preprocessing.Flatten(start_dim: int = 0, end_dim: int = -1, copy=False)
Bases:
BaseEstimator
,TransformerMixin
- __init__(start_dim: int = 0, end_dim: int = -1, copy=False) None
This class allows you to flatten an array inside a pipeline. This class was implemented to mirror the behaviour in
https://pytorch.org/docs/stable/generated/torch.flatten.html
.Examples
>>> flat = Flatten(start_dim=1, end_dim=-1) >>> flat.fit(None, None) # ignored >>> flat.transform( [[[1, 2], [3, 4]], [[5, 6], [7, 8]]] ) [[1,2,3,4], [5,6,7,8]]
- Parameters:
start_dim (-) – The first dim to flatten. Defaults to
0
.end_dim (-) – The last dim to flatten. Defaults to
-1
.copy (-) – Whether to return a copied version of the array during the transform method. Defaults to
False
.
- Return type:
None
- fit(X: ndarray | None = None, y: None = None) Flatten
This function is required for the pipelines to work, but is ignored.
- Parameters:
X (-) – Ignored. Defaults to
None
.y (-) – Igorned. Defaults to
None
.
- Returns:
The fitted scaler.
- Return type:
self
- transform(X: ndarray) ndarray
This will transform the array by returning a flattened version.
- Parameters:
X (-) – The array to be flattened.
- Returns:
- out – The flattened array.
- Return type:
np.ndarray:
- sku.preprocessing.FlattenStandardScaler
alias of
FlattenWrapper
- class sku.preprocessing.StandardGroupScaler(warn: bool = True)
Bases:
BaseEstimator
,TransformerMixin
- Parameters:
warn (bool) –
- __init__(warn: bool = True)
This class allows you to scale the data based on a group.
When calling transform, if the group has not been seen in the fitting method, then the global statistics will be used to scale the data (global = across all groups).
Where the mean or standard deviation are equal to
NaN
, in any axis on any group, that particular value will be replaced with the global mean or standard deviation for that axis (global = across all groups). If the standard deviation is returned as0.0
then the global standard deviation and mean is used.- Parameters:
warn (-) – Whether to warn the user if they are using the grouped version and no groups are passed. Defaults to
True
.
- fit(X: ndarray, y: None | ndarray = None, groups: None | ndarray = None) StandardGroupScaler
Compute the mean and std to be used for later scaling.
- Parameters:
X (-) – The data used to compute the mean and standard deviation used for later scaling along the features axis. This should be of shape
(n_samples, n_features)
.y (-) – Igorned. Defaults to
None
.groups (-) – The groups to split the scaling by. This should be of shape
(n_samples,)
. Defaults toNone
.
- Returns:
- self – The fitted scaler.
- Return type:
sku.StandardGroupScaler:
- fit_transform(X: ndarray, y: ndarray | None = None, groups: ndarray | None = None) ndarray
Fit to data, then transform it. Fits transformer to X using the groups and returns a transformed version of X.
- Parameters:
X (-) – The data used to compute the mean and standard deviation used for later scaling along the features axis. This should be of shape
(n_samples, n_features)
.groups (-) – The groups to split the scaling by. This should be of shape
(n_samples,)
. Defaults toNone
.y (-) – Igorned. Defaults to
None
.
- Returns:
The fitted scaler.
- Return type:
self
- transform(X: ndarray, y: ndarray | None = None, groups: ndarray | None = None) ndarray
Perform standardization by centering and scaling by group.
- Parameters:
X (-) – The data used to scale along the features axis. This should be of shape
(n_samples, n_features)
.y (-) – Ignored. Defaults to
None
.groups (-) – The groups to split the scaling by. This should be of shape
(n_samples,)
. Defaults toNone
.
- Returns:
- X_norm – The transformed version of
X
.- Return type:
np.ndarray:
Progress Utils
Here is the documentation for the progress util functionality.
Utils
Here is the documentation for the util functionality.
- sku.utils.get_default_args(func: Callable)
https://stackoverflow.com/a/12627202
Allows you to collect the default arguments of a function. This is useful for setting params in sklearn wrappers.
- Parameters:
func (-) – Function to get parameter defaults for.
- Returns:
- out – _description_
- Return type:
_type_:
- sku.utils.hasarg(func: Callable, name: str) bool
Checks if the function can take the name as an argument
- Parameters:
func (-) – Function
name (-) – Name to check in the args
- Returns:
- out – True or False
- Return type:
bool:
- sku.utils.partialclass(cls, *args, **kwds)
- sku.utils.partialclass_pickleable(name, cls, *args, **kwds)