Skip to content

OptimalBucketer

The OptimalBucketer transformer uses the optbinning package to find optimal buckets.

Support: badge badge badge

This bucketer basically wraps optbinning.OptimalBinning to be consistent with skorecard. Requires a feature to be pre-bucketed to max 100 buckets. Optbinning uses a constrained programming solver to merge buckets, taking into account the following constraints 1) monotonicity in bad rate, 2) at least 5% of records per bin.

Examples:

from skorecard import datasets
from skorecard.bucketers import OptimalBucketer

X, y = datasets.load_uci_credit_card(return_X_y=True)
bucketer = OptimalBucketer(variables = ['LIMIT_BAL'])
bucketer.fit_transform(X, y)

__init__(self, variables=[], specials={}, variables_type='numerical', max_n_bins=10, missing_treatment='separate', min_bin_size=0.05, cat_cutoff=None, time_limit=25, remainder='passthrough', solver='cp', monotonic_trend='auto_asc_desc', gamma=0, ob_kwargs={}) special

Initialize Optimal Bucketer.

Parameters:

Name Type Description Default
variables

List of variables to bucket.

[]
specials

(nested) dictionary of special values that require their own binning. The dictionary has the following format: {"" : {"name of special bucket" : }} For every feature that needs a special value, a dictionary must be passed as value. This dictionary contains a name of a bucket (key) and an array of unique values that should be put in that bucket. When special values are passed, they are not considered in the fitting procedure.

{}
variables_type

Passed to optbinning.OptimalBinning: Type of the variables. Must be either 'categorical' or 'numerical'.

'numerical'
missing_treatment

Defines how we treat the missing values present in the data. If a string, it must be one of the following options: separate: Missing values get put in a separate 'Other' bucket: -1 most_risky: Missing values are put into the bucket containing the largest percentage of Class 1. least_risky: Missing values are put into the bucket containing the largest percentage of Class 0. most_frequent: Missing values are put into the most common bucket. neutral: Missing values are put into the bucket with WoE closest to 0. similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values. passthrough: Leaves missing values untouched. If a dict, it must be of the following format: {"": } This bucket number is where we will put the missing values.

'separate'
min_bin_size

Passed to optbinning.OptimalBinning: Minimum fraction of observations in a bucket.

0.05
max_n_bins

Passed to optbinning.OptimalBinning: Maximum numbers of bins to return.

10
cat_cutoff

Passed to optbinning.OptimalBinning: Threshold ratio (None, or >0 and <=1) below which categories are grouped together in a bucket 'other'.

None
time_limit float

Passed to optbinning.OptimalBinning: Time limit in seconds to find an optimal solution.

25
remainder

How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped.

'passthrough'
solver str

Passed to optbinning.OptimalBinning: The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver.

'cp'
monotonic_trend str

Passed to optbinning.OptimalBinning: The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (max_n_prebins > 20). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.

'auto_asc_desc'
gamma float

Passed to optbinning.OptimalBinning: Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization. Default is 0. Option supported by solvers “cp” and “mip”.

0
ob_kwargs dict

Other parameters passed to optbinning.OptimalBinning.

{}

bucket_table(self, column) inherited

Generates the statistics for the buckets of a particular column.

The pre-buckets are matched to the post-buckets, so that the user has a much clearer understanding of how the BucketingProcess ends up with the final buckets. An example:

bucket label Count Count (%) Non-event Event % Event % Non-event Event Rate WoE IV
0 (-inf, 25.0) 61.0 1.36 57.0 4.0 0.41 1.62 0.066 1.380 0.017
1 [25.0, 45.0) 2024.0 44.98 1536.0 488.0 49.64 43.67 0.241 -0.128 0.008

Parameters:

Name Type Description Default
column

The column we wish to analyse

required

Returns:

Type Description
df (pd.DataFrame)

A pandas dataframe of the format above

fit(self, X, y=None) inherited

Fit X, y.

fit_interactive(self, X, y=None, mode='external', **server_kwargs) inherited

Fit a bucketer and then interactive edit the fit using a dash app.

Note we are using a jupyterdash app, which supports 3 different modes:

  • 'external' (default): Start dash server and print URL
  • 'inline': Start dash app inside an Iframe in the jupyter notebook
  • 'jupyterlab': Start dash app as a new tab inside jupyterlab

fit_transform(self, X, y=None, **fit_params) inherited

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X : array-like of shape (n_samples, n_features) Input samples.

y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None Target values (None for unsupervised transformations).

**fit_params : dict Additional fit parameters.

Returns

X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.

get_params(self, deep=True) inherited

Get parameters for this estimator.

Parameters

deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict Parameter names mapped to their values.

plot_bucket(self, column, line='event_rate', format=None, scale=None, width=None, height=None) inherited

Plot the buckets.

Parameters:

Name Type Description Default
column

The column we want to visualise

required
line

The line to plot on the secondary axis. Default is Event Rate.

'event_rate'
format

The format of the image, such as 'png'. The default None returns a plotly image.

None
scale

If format is specified, the scale of the image

None
width

If format is specified, the width of the image

None
height

If format is specified, the image of the image

None

Returns:

Type Description
plot

plotly fig

predict(self, X) inherited

Applies the transform method. To be used for the grid searches.

Parameters:

Name Type Description Default
X pd.DataFrame

The numerical data which will be transformed into the corresponding buckets

required

Returns:

Type Description
y (np.array)

Transformed X, such that the values of X are replaced by the corresponding bucket numbers

predict_proba(self, X) inherited

Applies the transform method. To be used for the grid searches.

Parameters:

Name Type Description Default
X pd.DataFrame

The numerical data which will be transformed into the corresponding buckets

required

Returns:

Type Description
yhat (np.array)

transformed X, such that the values of X are replaced by the corresponding bucket numbers

save_yml(self, fout) inherited

Save the features bucket to a yaml file.

Parameters:

Name Type Description Default
fout ~PathLike

file output

required

set_params(self, **params) inherited

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as :class:~sklearn.pipeline.Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters

**params : dict Estimator parameters.

Returns

self : estimator instance Estimator instance.

summary(self) inherited

Display a summary table for columns passed to .fit().

The format is the following:

column num_prebuckets num_buckets dtype
LIMIT_BAL 15 10 float64
BILL_AMT1 15 6 float64

transform(self, X, y=None) inherited

Transforms an array into the corresponding buckets fitted by the Transformer.

Parameters:

Name Type Description Default
X pd.DataFrame

dataframe which will be transformed into the corresponding buckets

required
y array

target

None

Returns:

Type Description
df (pd.DataFrame)

dataset with transformed features


Last update: 2021-11-24
Back to top