OptimalBucketer
The OptimalBucketer
transformer uses the optbinning package to find optimal buckets.
Support:
This bucketer basically wraps optbinning.OptimalBinning to be consistent with skorecard. Requires a feature to be prebucketed to max 100 buckets. Optbinning uses a constrained programming solver to merge buckets, taking into account the following constraints 1) monotonicity in bad rate, 2) at least 5% of records per bin.
Examples:
from skorecard import datasets
from skorecard.bucketers import OptimalBucketer
X, y = datasets.load_uci_credit_card(return_X_y=True)
bucketer = OptimalBucketer(variables = ['LIMIT_BAL'])
bucketer.fit_transform(X, y)
__init__(self, variables=[], specials={}, variables_type='numerical', max_n_bins=10, missing_treatment='separate', min_bin_size=0.05, cat_cutoff=None, time_limit=25, remainder='passthrough', solver='cp', monotonic_trend='auto_asc_desc', gamma=0, ob_kwargs={})
special
¶
Initialize Optimal Bucketer.
Parameters:
Name  Type  Description  Default 

variables 
List of variables to bucket. 
[] 

specials 
(nested) dictionary of special values that require their own binning.
The dictionary has the following format:
{" 
{} 

variables_type 
Passed to optbinning.OptimalBinning: Type of the variables. Must be either 'categorical' or 'numerical'. 
'numerical' 

missing_treatment 
Defines how we treat the missing values present in the data.
If a string, it must be one of the following options:
separate: Missing values get put in a separate 'Other' bucket: 
'separate' 

min_bin_size 
Passed to optbinning.OptimalBinning: Minimum fraction of observations in a bucket. 
0.05 

max_n_bins 
Passed to optbinning.OptimalBinning: Maximum numbers of bins to return. 
10 

cat_cutoff 
Passed to optbinning.OptimalBinning: Threshold ratio (None, or >0 and <=1) below which categories are grouped together in a bucket 'other'. 
None 

time_limit 
float 
Passed to optbinning.OptimalBinning: Time limit in seconds to find an optimal solution. 
25 
remainder 
How we want the nonspecified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped. 
'passthrough' 

solver 
str 
Passed to optbinning.OptimalBinning: The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixedinteger programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver. 
'cp' 
monotonic_trend 
str 
Passed to optbinning.OptimalBinning: The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (max_n_prebins > 20). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled. 
'auto_asc_desc' 
gamma 
float 
Passed to optbinning.OptimalBinning: Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization. Default is 0. Option supported by solvers “cp” and “mip”. 
0 
ob_kwargs 
dict 
Other parameters passed to optbinning.OptimalBinning. 
{} 
bucket_table(self, column)
inherited
¶
Generates the statistics for the buckets of a particular column.
The prebuckets are matched to the postbuckets, so that the user has a much clearer understanding of how the BucketingProcess ends up with the final buckets. An example:
bucket  label  Count  Count (%)  Nonevent  Event  % Event  % Nonevent  Event Rate  WoE  IV 

0  (inf, 25.0)  61.0  1.36  57.0  4.0  0.41  1.62  0.066  1.380  0.017 
1  [25.0, 45.0)  2024.0  44.98  1536.0  488.0  49.64  43.67  0.241  0.128  0.008 
Parameters:
Name  Type  Description  Default 

column 
The column we wish to analyse 
required 
Returns:
Type  Description 

df (pd.DataFrame) 
A pandas dataframe of the format above 
fit(self, X, y=None)
inherited
¶
Fit X, y.
fit_interactive(self, X, y=None, mode='external', **server_kwargs)
inherited
¶
Fit a bucketer and then interactive edit the fit using a dash app.
Note we are using a jupyterdash app, which supports 3 different modes:
 'external' (default): Start dash server and print URL
 'inline': Start dash app inside an Iframe in the jupyter notebook
 'jupyterlab': Start dash app as a new tab inside jupyterlab
fit_transform(self, X, y=None, **fit_params)
inherited
¶
Fit to data, then transform it.
Fits transformer to X
and y
with optional parameters fit_params
and returns a transformed version of X
.
Parameters¶
X : arraylike of shape (n_samples, n_features) Input samples.
y : arraylike of shape (n_samples,) or (n_samples, n_outputs), default=None Target values (None for unsupervised transformations).
**fit_params : dict Additional fit parameters.
Returns¶
X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.
get_params(self, deep=True)
inherited
¶
plot_bucket(self, column, line='event_rate', format=None, scale=None, width=None, height=None)
inherited
¶
Plot the buckets.
Parameters:
Name  Type  Description  Default 

column 
The column we want to visualise 
required  
line 
The line to plot on the secondary axis. Default is Event Rate. 
'event_rate' 

format 
The format of the image, such as 'png'. The default None returns a plotly image. 
None 

scale 
If format is specified, the scale of the image 
None 

width 
If format is specified, the width of the image 
None 

height 
If format is specified, the image of the image 
None 
Returns:
Type  Description 

plot 
plotly fig 
predict(self, X)
inherited
¶
Applies the transform method. To be used for the grid searches.
Parameters:
Name  Type  Description  Default 

X 
pd.DataFrame 
The numerical data which will be transformed into the corresponding buckets 
required 
Returns:
Type  Description 

y (np.array) 
Transformed X, such that the values of X are replaced by the corresponding bucket numbers 
predict_proba(self, X)
inherited
¶
Applies the transform method. To be used for the grid searches.
Parameters:
Name  Type  Description  Default 

X 
pd.DataFrame 
The numerical data which will be transformed into the corresponding buckets 
required 
Returns:
Type  Description 

yhat (np.array) 
transformed X, such that the values of X are replaced by the corresponding bucket numbers 
save_yml(self, fout)
inherited
¶
Save the features bucket to a yaml file.
Parameters:
Name  Type  Description  Default 

fout 
~PathLike 
file output 
required 
set_params(self, **params)
inherited
¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as :class:~sklearn.pipeline.Pipeline
). The latter have
parameters of the form <component>__<parameter>
so that it's
possible to update each component of a nested object.
Parameters¶
**params : dict Estimator parameters.
Returns¶
self : estimator instance Estimator instance.
summary(self)
inherited
¶
Display a summary table for columns passed to .fit()
.
The format is the following:
column  num_prebuckets  num_buckets  dtype 

LIMIT_BAL  15  10  float64 
BILL_AMT1  15  6  float64 
transform(self, X, y=None)
inherited
¶
Transforms an array into the corresponding buckets fitted by the Transformer.
Parameters:
Name  Type  Description  Default 

X 
pd.DataFrame 
dataframe which will be transformed into the corresponding buckets 
required 
y 
array 
target 
None 
Returns:
Type  Description 

df (pd.DataFrame) 
dataset with transformed features 