Skip to content

BucketingProcess

A two-step bucketing pipeline allowing for pre-bucketing before bucketing.

Often you want to pre-bucket features (f.e. to 100 buckets) before bucketing to a smaller set. This brings some additional challenges around propagating specials and defining a bucketer that is able to go from raw data to final bucket. This class facilicates the process and also provides all regular methods and attributes:

  • .summary(): See which columns are bucketed
  • .plot_bucket(): Plot buckets of a column
  • .bucket_table(): Table with buckets of a column
  • .save_to_yaml(): Save information necessary for bucketing to a YAML file
  • .features_bucket_mapping_: Access bucketing information

Examples:

from skorecard import datasets
from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer, AsIsCategoricalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline

df = datasets.load_uci_credit_card(as_frame=True)
y = df["default"]
X = df.drop(columns=["default"])

num_cols = ["LIMIT_BAL", "BILL_AMT1"]
cat_cols = ["EDUCATION", "MARRIAGE"]

bucketing_process = BucketingProcess(
    specials={'LIMIT_BAL': {'=400000.0' : [400000.0]}},
    prebucketing_pipeline=make_pipeline(
        DecisionTreeBucketer(variables=num_cols, max_n_bins=100, min_bin_size=0.05),
        AsIsCategoricalBucketer(variables=cat_cols),
    ),
    bucketing_pipeline=make_pipeline(
        OptimalBucketer(variables=num_cols, max_n_bins=10, min_bin_size=0.05),
        OptimalBucketer(variables=cat_cols, variables_type='categorical', max_n_bins=10, min_bin_size=0.05),
    )
)

bucketing_process.fit(X, y)

# Details
bucketing_process.summary() # all vars, and # buckets
bucketing_process.bucket_table("LIMIT_BAL")
bucketing_process.plot_bucket("LIMIT_BAL")
bucketing_process.prebucket_table("LIMIT_BAL")
bucketing_process.plot_prebucket("LIMIT_BAL")

features_bucket_mapping_ property readonly

Returns a FeaturesBucketMapping instance.

In normal bucketers, you can access .features_bucket_mapping_ to retrieve a FeaturesBucketMapping instance. This contains all the info you need to transform values into their buckets.

In this class, we basically have a two step bucketing process: first prebucketing, and then we bucket the prebuckets.

In order to still be able to use BucketingProcess as if it were a normal bucketer, we'll need to merge both into one.

name property readonly

To be able to identity the bucketingprocess in a pipeline.

__init__(self, prebucketing_pipeline=Pipeline(steps=[('decisiontreebucketer',DecisionTreeBucketer(max_n_bins=50, min_bin_size=0.02))]), bucketing_pipeline=Pipeline(steps=[('optimalbucketer', OptimalBucketer(max_n_bins=6))]), variables=[], specials={}, random_state=None, remainder='passthrough') special

Define a BucketingProcess to first prebucket and then bucket multiple columns in one go.

Parameters:

Name Type Description Default
prebucketing_pipeline Pipeline

The scikit-learn pipeline that does pre-bucketing. Defaults to an all-numeric DecisionTreeBucketer pipeline.

Pipeline(steps=[('decisiontreebucketer', DecisionTreeBucketer(max_n_bins=50, min_bin_size=0.02))])
bucketing_pipeline Pipeline

The scikit-learn pipeline that does bucketing. Defaults to an all-numeric OptimalBucketer pipeline. Must transform same features as the prebucketing pipeline.

Pipeline(steps=[('optimalbucketer', OptimalBucketer(max_n_bins=6))])
variables list

The features to bucket. Uses all features if not defined.

[]
specials Dict

(nested) dictionary of special values that require their own binning. Will merge when specials are also defined in any bucketers in a (pre)bucketing pipeline, and overwrite in case there are shared keys. The dictionary has the following format: {"" : {"name of special bucket" : }} For every feature that needs a special value, a dictionary must be passed as value. This dictionary contains a name of a bucket (key) and an array of unique values that should be put in that bucket. When special values are defined, they are not considered in the fitting procedure.

{}
remainder str

How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped.

'passthrough'

bucket_table(self, column) inherited

Generates the statistics for the buckets of a particular column.

The pre-buckets are matched to the post-buckets, so that the user has a much clearer understanding of how the BucketingProcess ends up with the final buckets. An example:

bucket label Count Count (%) Non-event Event % Event % Non-event Event Rate WoE IV
0 (-inf, 25.0) 61.0 1.36 57.0 4.0 0.41 1.62 0.066 1.380 0.017
1 [25.0, 45.0) 2024.0 44.98 1536.0 488.0 49.64 43.67 0.241 -0.128 0.008

Parameters:

Name Type Description Default
column

The column we wish to analyse

required

Returns:

Type Description
df (pd.DataFrame)

A pandas dataframe of the format above

fit(self, X, y=None)

Fit the prebucketing and bucketing pipeline with X, y.

Parameters:

Name Type Description Default
X pd.DataFrame

Data to fit on.

required
y np.array

target. Defaults to None.

None

fit_interactive(self, X, y=None, mode='external', **server_kwargs)

Fit a bucketer and then interactive edit the fit using a dash app.

Note we are using a jupyterdash app, which supports 3 different modes:

  • 'external' (default): Start dash server and print URL
  • 'inline': Start dash app inside an Iframe in the jupyter notebook
  • 'jupyterlab': Start dash app as a new tab inside jupyterlab

fit_transform(self, X, y=None, **fit_params) inherited

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X : array-like of shape (n_samples, n_features) Input samples.

y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None Target values (None for unsupervised transformations).

**fit_params : dict Additional fit parameters.

Returns

X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.

get_params(self, deep=True) inherited

Get parameters for this estimator.

Parameters

deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict Parameter names mapped to their values.

plot_bucket(self, column, line='event_rate', format=None, scale=None, width=None, height=None) inherited

Plot the buckets.

Parameters:

Name Type Description Default
column

The column we want to visualise

required
line

The line to plot on the secondary axis. Default is Event Rate.

'event_rate'
format

The format of the image, such as 'png'. The default None returns a plotly image.

None
scale

If format is specified, the scale of the image

None
width

If format is specified, the width of the image

None
height

If format is specified, the image of the image

None

Returns:

Type Description
plot

plotly fig

plot_prebucket(self, column, line='event_rate', format=None, scale=None, width=None, height=None) inherited

Generates the prebucket table and produces a corresponding plotly plot.

Parameters:

Name Type Description Default
column

The column we want to visualise

required
line

The line to plot on the secondary axis. Default is Event Rate.

'event_rate'
format

The format of the image, such as 'png'. The default None returns a plotly image.

None
scale

If format is specified, the scale of the image

None
width

If format is specified, the width of the image

None
height

If format is specified, the image of the image

None

Returns:

Type Description
plot

plotly fig

prebucket_table(self, column)

Generates the statistics for the buckets of a particular column.

An example is seen below:

pre-bucket label Count Count (%) Non-event Event Event Rate WoE IV bucket
0 (-inf, 1.0) 479 7.98 300 179 37.37 0.73 0.05 0
1 [1.0, 2.0) 370 6.17 233 137 37.03 0.71 0.04 0

Parameters:

Name Type Description Default
column str

The column we wish to analyse

required

Returns:

Type Description
df (pd.DataFrame)

A pandas dataframe of the format above

save_yml(self, fout)

Save the features bucket to a yaml file.

Parameters:

Name Type Description Default
fout ~PathLike

path for output file

required

set_params(self, **params) inherited

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as :class:~sklearn.pipeline.Pipeline). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.

Parameters

**params : dict Estimator parameters.

Returns

self : estimator instance Estimator instance.

summary(self) inherited

Display a summary table for columns passed to .fit().

The format is the following:

column num_prebuckets num_buckets dtype
LIMIT_BAL 15 10 float64
BILL_AMT1 15 6 float64

transform(self, X)

Transform X through the prebucketing and bucketing pipelines.


Last update: 2021-11-24
Back to top