BucketingProcess
A two-step bucketing pipeline allowing for pre-bucketing before bucketing.
Often you want to pre-bucket features (f.e. to 100 buckets) before bucketing to a smaller set. This brings some additional challenges around propagating specials and defining a bucketer that is able to go from raw data to final bucket. This class facilicates the process and also provides all regular methods and attributes:
.summary()
: See which columns are bucketed.plot_bucket()
: Plot buckets of a column.bucket_table()
: Table with buckets of a column.save_to_yaml()
: Save information necessary for bucketing to a YAML file.features_bucket_mapping_
: Access bucketing information
Examples:
from skorecard import datasets
from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer, AsIsCategoricalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline
df = datasets.load_uci_credit_card(as_frame=True)
y = df["default"]
X = df.drop(columns=["default"])
num_cols = ["LIMIT_BAL", "BILL_AMT1"]
cat_cols = ["EDUCATION", "MARRIAGE"]
bucketing_process = BucketingProcess(
specials={'LIMIT_BAL': {'=400000.0' : [400000.0]}},
prebucketing_pipeline=make_pipeline(
DecisionTreeBucketer(variables=num_cols, max_n_bins=100, min_bin_size=0.05),
AsIsCategoricalBucketer(variables=cat_cols),
),
bucketing_pipeline=make_pipeline(
OptimalBucketer(variables=num_cols, max_n_bins=10, min_bin_size=0.05),
OptimalBucketer(variables=cat_cols, variables_type='categorical', max_n_bins=10, min_bin_size=0.05),
)
)
bucketing_process.fit(X, y)
# Details
bucketing_process.summary() # all vars, and # buckets
bucketing_process.bucket_table("LIMIT_BAL")
bucketing_process.plot_bucket("LIMIT_BAL")
bucketing_process.prebucket_table("LIMIT_BAL")
bucketing_process.plot_prebucket("LIMIT_BAL")
features_bucket_mapping_
property
readonly
¶
Returns a FeaturesBucketMapping
instance.
In normal bucketers, you can access .features_bucket_mapping_
to retrieve a FeaturesBucketMapping
instance. This contains
all the info you need to transform values into their buckets.
In this class, we basically have a two step bucketing process: first prebucketing, and then we bucket the prebuckets.
In order to still be able to use BucketingProcess as if it were a normal bucketer, we'll need to merge both into one.
name
property
readonly
¶
To be able to identity the bucketingprocess in a pipeline.
__init__(self, prebucketing_pipeline=Pipeline(steps=[('decisiontreebucketer',DecisionTreeBucketer(max_n_bins=50, min_bin_size=0.02))]), bucketing_pipeline=Pipeline(steps=[('optimalbucketer', OptimalBucketer(max_n_bins=6))]), variables=[], specials={}, random_state=None, remainder='passthrough')
special
¶
Define a BucketingProcess to first prebucket and then bucket multiple columns in one go.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prebucketing_pipeline |
Pipeline |
The scikit-learn pipeline that does pre-bucketing. Defaults to an all-numeric DecisionTreeBucketer pipeline. |
Pipeline(steps=[('decisiontreebucketer',
DecisionTreeBucketer(max_n_bins=50, min_bin_size=0.02))]) |
bucketing_pipeline |
Pipeline |
The scikit-learn pipeline that does bucketing. Defaults to an all-numeric OptimalBucketer pipeline. Must transform same features as the prebucketing pipeline. |
Pipeline(steps=[('optimalbucketer', OptimalBucketer(max_n_bins=6))]) |
variables |
list |
The features to bucket. Uses all features if not defined. |
[] |
specials |
Dict |
(nested) dictionary of special values that require their own binning.
Will merge when specials are also defined in any bucketers in a (pre)bucketing pipeline, and overwrite in case there are shared keys.
The dictionary has the following format:
{" |
{} |
remainder |
str |
How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped. |
'passthrough' |
bucket_table(self, column)
inherited
¶
Generates the statistics for the buckets of a particular column.
The pre-buckets are matched to the post-buckets, so that the user has a much clearer understanding of how the BucketingProcess ends up with the final buckets. An example:
bucket | label | Count | Count (%) | Non-event | Event | % Event | % Non-event | Event Rate | WoE | IV |
---|---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 25.0) | 61.0 | 1.36 | 57.0 | 4.0 | 0.41 | 1.62 | 0.066 | 1.380 | 0.017 |
1 | [25.0, 45.0) | 2024.0 | 44.98 | 1536.0 | 488.0 | 49.64 | 43.67 | 0.241 | -0.128 | 0.008 |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
The column we wish to analyse |
required |
Returns:
Type | Description |
---|---|
df (pd.DataFrame) |
A pandas dataframe of the format above |
fit(self, X, y=None)
¶
Fit the prebucketing and bucketing pipeline with X
, y
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
pd.DataFrame |
Data to fit on. |
required |
y |
np.array |
target. Defaults to None. |
None |
fit_interactive(self, X, y=None, mode='external', **server_kwargs)
¶
Fit a bucketer and then interactive edit the fit using a dash app.
Note we are using a jupyterdash app, which supports 3 different modes:
- 'external' (default): Start dash server and print URL
- 'inline': Start dash app inside an Iframe in the jupyter notebook
- 'jupyterlab': Start dash app as a new tab inside jupyterlab
fit_transform(self, X, y=None, **fit_params)
inherited
¶
Fit to data, then transform it.
Fits transformer to X
and y
with optional parameters fit_params
and returns a transformed version of X
.
Parameters¶
X : array-like of shape (n_samples, n_features) Input samples.
y : array-like of shape (n_samples,) or (n_samples, n_outputs), default=None Target values (None for unsupervised transformations).
**fit_params : dict Additional fit parameters.
Returns¶
X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.
get_params(self, deep=True)
inherited
¶
plot_bucket(self, column, line='event_rate', format=None, scale=None, width=None, height=None)
inherited
¶
Plot the buckets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
The column we want to visualise |
required | |
line |
The line to plot on the secondary axis. Default is Event Rate. |
'event_rate' |
|
format |
The format of the image, such as 'png'. The default None returns a plotly image. |
None |
|
scale |
If format is specified, the scale of the image |
None |
|
width |
If format is specified, the width of the image |
None |
|
height |
If format is specified, the image of the image |
None |
Returns:
Type | Description |
---|---|
plot |
plotly fig |
plot_prebucket(self, column, line='event_rate', format=None, scale=None, width=None, height=None)
inherited
¶
Generates the prebucket table and produces a corresponding plotly plot.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
The column we want to visualise |
required | |
line |
The line to plot on the secondary axis. Default is Event Rate. |
'event_rate' |
|
format |
The format of the image, such as 'png'. The default None returns a plotly image. |
None |
|
scale |
If format is specified, the scale of the image |
None |
|
width |
If format is specified, the width of the image |
None |
|
height |
If format is specified, the image of the image |
None |
Returns:
Type | Description |
---|---|
plot |
plotly fig |
prebucket_table(self, column)
¶
Generates the statistics for the buckets of a particular column.
An example is seen below:
pre-bucket | label | Count | Count (%) | Non-event | Event | Event Rate | WoE | IV | bucket |
---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 1.0) | 479 | 7.98 | 300 | 179 | 37.37 | 0.73 | 0.05 | 0 |
1 | [1.0, 2.0) | 370 | 6.17 | 233 | 137 | 37.03 | 0.71 | 0.04 | 0 |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
str |
The column we wish to analyse |
required |
Returns:
Type | Description |
---|---|
df (pd.DataFrame) |
A pandas dataframe of the format above |
save_yml(self, fout)
¶
Save the features bucket to a yaml file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fout |
~PathLike |
path for output file |
required |
set_params(self, **params)
inherited
¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as :class:~sklearn.pipeline.Pipeline
). The latter have
parameters of the form <component>__<parameter>
so that it's
possible to update each component of a nested object.
Parameters¶
**params : dict Estimator parameters.
Returns¶
self : estimator instance Estimator instance.
summary(self)
inherited
¶
Display a summary table for columns passed to .fit()
.
The format is the following:
column | num_prebuckets | num_buckets | dtype |
---|---|---|---|
LIMIT_BAL | 15 | 10 | float64 |
BILL_AMT1 | 15 | 6 | float64 |
transform(self, X)
¶
Transform X
through the prebucketing and bucketing pipelines.