Skip to content

OrdinalCategoricalBucketer

The OrdinalCategoricalBucketer replaces categories by ordinal numbers.

Support badge badge badge

When sort_by_target is false the buckets are assigned in order of frequency. When sort_by_target is true the buckets are ordered based on the mean of the target per category.

For example, if for a variable colour the means of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, grey will be the first bucket (0), blue the second (1) and red the third (3). If new data contains unknown labels (f.e. yellow), they will be replaced by the 'Other' bucket (-2), and if new data contains missing values, they will be replaced by the 'Missing' bucket (-1).

Examples:

from skorecard import datasets
from skorecard.bucketers import OrdinalCategoricalBucketer

X, y = datasets.load_uci_credit_card(return_X_y=True)
bucketer = OrdinalCategoricalBucketer(variables=['EDUCATION'])
bucketer.fit_transform(X, y)
bucketer = OrdinalCategoricalBucketer(max_n_categories=2, variables=['EDUCATION'])
bucketer.fit_transform(X, y)

Credits: Code & ideas adapted from:

  • feature_engine.categorical_encoders.OrdinalCategoricalEncoder
  • feature_engine.categorical_encoders.RareLabelCategoricalEncoder

variables_type property readonly

Signals variables type supported by this bucketer.

__init__(self, tol=0.05, max_n_categories=None, variables=[], specials={}, encoding_method='frequency', missing_treatment='separate', remainder='passthrough') special

Init the class.

Parameters:

Name Type Description Default
tol float

the minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be grouped together (in the 'other' bucket).

0.05
max_n_categories int

the maximum number of categories that should be considered frequent. If None, all categories with frequency above the tolerance (tol) will be considered.

None
variables list

The features to bucket. Uses all features if not defined.

[]
specials dict

(nested) dictionary of special values that require their own binning. The dictionary has the following format: {"" : {"name of special bucket" : }} For every feature that needs a special value, a dictionary must be passed as value. This dictionary contains a name of a bucket (key) and an array of unique values that should be put in that bucket. When special values are defined, they are not considered in the fitting procedure.

{}
encoding_method string

encoding method. - "frequency" (default): orders the buckets based on the frequency of observations in the bucket. The lower the number of the bucket the most frequent are the observations in that bucket. - "ordered": orders the buckets based on the average class 1 rate in the bucket. The lower the number of the bucket the lower the fraction of class 1 in that bucket.

'frequency'
missing_treatment str or dict

Defines how we treat the missing values present in the data. If a string, it must be one of the following options: separate: Missing values get put in a separate 'Other' bucket: -1 most_risky: Missing values are put into the bucket containing the largest percentage of Class 1. least_risky: Missing values are put into the bucket containing the largest percentage of Class 0. most_frequent: Missing values are put into the most common bucket. neutral: Missing values are put into the bucket with WoE closest to 0. similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values. passthrough: Leaves missing values untouched. If a dict, it must be of the following format: {"": } This bucket number is where we will put the missing values.

'separate'
remainder str

How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped.

'passthrough'

bucket_table(self, column) inherited

Generates the statistics for the buckets of a particular column.

The pre-buckets are matched to the post-buckets, so that the user has a much clearer understanding of how the BucketingProcess ends up with the final buckets. An example:

bucket label Count Count (%) Non-event Event % Event % Non-event Event Rate WoE IV
0 (-inf, 25.0) 61.0 1.36 57.0 4.0 0.41 1.62 0.066 1.380 0.017
1 [25.0, 45.0) 2024.0 44.98 1536.0 488.0 49.64 43.67 0.241 -0.128 0.008

Parameters:

Name Type Description Default
column

The column we wish to analyse

required

Returns:

Type Description
df (pd.DataFrame)

A pandas dataframe of the format above

fit(self, X, y=None) inherited

Fit X, y.

fit_interactive(self, X, y=None, mode='external', **server_kwargs) inherited

Fit a bucketer and then interactive edit the fit using a dash app.

Note we are using a jupyterdash app, which supports 3 different modes:

  • 'external' (default): Start dash server and print URL
  • 'inline': Start dash app inside an Iframe in the jupyter notebook
  • 'jupyterlab': Start dash app as a new tab inside jupyterlab

fit_transform(self, X, y=None, **fit_params) inherited

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X : array-like of shape (n_samples, n_features) Input samples.

y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None Target values (None for unsupervised transformations).

**fit_params : dict Additional fit parameters.

Returns

X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.

get_params(self, deep=True) inherited

Get parameters for this estimator.

Parameters

deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict Parameter names mapped to their values.

plot_bucket(self, column, line='event_rate', format=None, scale=None, width=None, height=None) inherited

Plot the buckets.

Parameters:

Name Type Description Default
column

The column we want to visualise

required
line

The line to plot on the secondary axis. Default is Event Rate.

'event_rate'
format

The format of the image, such as 'png'. The default None returns a plotly image.

None
scale

If format is specified, the scale of the image

None
width

If format is specified, the width of the image

None
height

If format is specified, the image of the image

None

Returns:

Type Description
plot

plotly fig

predict(self, X) inherited

Applies the transform method. To be used for the grid searches.

Parameters:

Name Type Description Default
X pd.DataFrame

The numerical data which will be transformed into the corresponding buckets

required

Returns:

Type Description
y (np.array)

Transformed X, such that the values of X are replaced by the corresponding bucket numbers

predict_proba(self, X) inherited

Applies the transform method. To be used for the grid searches.

Parameters:

Name Type Description Default
X pd.DataFrame

The numerical data which will be transformed into the corresponding buckets

required

Returns:

Type Description
yhat (np.array)

transformed X, such that the values of X are replaced by the corresponding bucket numbers

save_yml(self, fout) inherited

Save the features bucket to a yaml file.

Parameters:

Name Type Description Default
fout ~PathLike

file output

required

set_params(self, **params) inherited

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as :class:~sklearn.pipeline.Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters

**params : dict Estimator parameters.

Returns

self : estimator instance Estimator instance.

summary(self) inherited

Display a summary table for columns passed to .fit().

The format is the following:

column num_prebuckets num_buckets dtype
LIMIT_BAL 15 10 float64
BILL_AMT1 15 6 float64

transform(self, X, y=None) inherited

Transforms an array into the corresponding buckets fitted by the Transformer.

Parameters:

Name Type Description Default
X pd.DataFrame

dataframe which will be transformed into the corresponding buckets

required
y array

target

None

Returns:

Type Description
df (pd.DataFrame)

dataset with transformed features


Last update: 2021-11-24
Back to top