Skip to content

Optimizing the bucketing process

import pandas as pd
from IPython.display import display

from skorecard.datasets import load_credit_card

df = load_credit_card(as_frame=True)

# Show
display(df.head(4))

num_feats = ["x1", "x15", "x16"]

X = df[num_feats]
y = df["y"]
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 ... x15 x16 x17 x18 x19 x20 x21 x22 x23 y
0 20000.0 2.0 2.0 1.0 24.0 2.0 2.0 -1.0 -1.0 -2.0 ... 0.0 0.0 0.0 0.0 689.0 0.0 0.0 0.0 0.0 1
1 120000.0 2.0 2.0 2.0 26.0 -1.0 2.0 0.0 0.0 0.0 ... 3272.0 3455.0 3261.0 0.0 1000.0 1000.0 1000.0 0.0 2000.0 1
2 90000.0 2.0 2.0 2.0 34.0 0.0 0.0 0.0 0.0 0.0 ... 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 1000.0 1000.0 5000.0 0
3 50000.0 2.0 2.0 1.0 37.0 0.0 0.0 0.0 0.0 0.0 ... 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 1100.0 1069.0 1000.0 0

4 rows × 24 columns

Finding the best bucketing

The art of building a good scorecard model lies in finding the best bucketing strategy.
Good buckets improve the predicitve power of the model, as well as guarantee stability of the predictions.

This is normally a very manual, labour intensive process (and for a good reason).

A good bucketing strategy follows the following principles: - maximizes the Information Values, defined as

IV=i(%Gi%Bi)log(%Gi%Bi)
  • avoids buckets that contain a very large or very small fraction of the population wherever the business sense requires it,

The skorecard package provides some tooling to automate part of the process, namely:

  • Grid search the hyper-parameters of the bucketers in order to maximise the information value
  • Run the optimal bucketer within the bucketing process

Grid search the bucketers to maximise the information value

skorecard implements an IV_scorer, that can be used as a custom scoring function for grid searching.
The following snippets of code show how to integrate it in the grid search.
The DecisionTreeBucketer applied on numerical features is the best use case, as there are some hyper-parameters that influence the bucketing quality.

from skorecard.metrics import IV_scorer
from skorecard.bucketers import DecisionTreeBucketer
from sklearn.model_selection import GridSearchCV

The DecisionTreeBucketer has two main hyperparameters to grid-search: - max_n_bins, maximum number of bins allowed for the bucketing - min_bin_size minimum fraction of data in the buckets

gs_params = {
    "max_n_bins": [3, 4, 5, 6],
    "min_bin_size": [0.05, 0.06, 0.07, 0.08],  # , 0.12]
}

The optimization has to be done for every feature indipendently, therefore we need a loop, and all the parameters are best stored in a data collector, like a dictionary

# Define the specials
best_params = dict()
max_iv = dict()
cv_results = dict()

# Add a special for demo purposes
specials = {"x1": {"special 0": ["50000.0"]}}

for feat in num_feats:
    # This snippet illustrates what to do with special values
    if feat in specials.keys():
        # This construct is needed to remap the specials, because skorecard validates that the key
        # of the dictionary is present in the variables
        special = {feat: specials[feat]}
    else:
        special = {}
    bucketer = DecisionTreeBucketer(variables=[feat], specials=special)
    gs = GridSearchCV(bucketer, gs_params, scoring=IV_scorer, cv=3, return_train_score=True)
    gs.fit(X[[feat]], y)

    best_params[feat] = gs.best_params_
    max_iv[feat] = gs.best_score_
    cv_results[feat] = gs.cv_results_

Checking the best parameters per feature

best_params
{'x1': {'max_n_bins': 3, 'min_bin_size': 0.05},
 'x15': {'max_n_bins': 3, 'min_bin_size': 0.05},
 'x16': {'max_n_bins': 3, 'min_bin_size': 0.05}}

Because of its additive nature, IV is likely to be maximal for the highest max_n_bins. Therefore it is worth looking analysing the CV results!

cv_results["x1"]
{'mean_fit_time': array([0.14118997, 0.13273303, 0.13474902, 0.15843304, 0.17114846,
        0.1259594 , 0.12854441, 0.13791513, 0.14939396, 0.12906257,
        0.15454125, 0.11709793, 0.1234947 , 0.11326059, 0.11524073,
        0.11928709]),
 'std_fit_time': array([0.01615798, 0.00538481, 0.00918157, 0.02513249, 0.02130305,
        0.0088008 , 0.0078898 , 0.00226504, 0.01736914, 0.00537724,
        0.04489044, 0.00418452, 0.00750423, 0.00055744, 0.00241629,
        0.01126566]),
 'mean_score_time': array([0.03244432, 0.03500628, 0.03295326, 0.04452038, 0.04895496,
        0.03155041, 0.03200722, 0.03328069, 0.0405368 , 0.03386513,
        0.02966809, 0.03014151, 0.03117593, 0.02836776, 0.02895562,
        0.02856787]),
 'std_score_time': array([0.00520814, 0.00130717, 0.00347241, 0.00365442, 0.01206228,
        0.00173939, 0.00279055, 0.0009404 , 0.01539335, 0.00089022,
        0.00115331, 0.00201206, 0.00159311, 0.00142928, 0.0013334 ,
        0.00086222]),
 'param_max_n_bins': masked_array(data=[3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_min_bin_size': masked_array(data=[0.05, 0.06, 0.07, 0.08, 0.05, 0.06, 0.07, 0.08, 0.05,
                    0.06, 0.07, 0.08, 0.05, 0.06, 0.07, 0.08],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_n_bins': 3, 'min_bin_size': 0.05},
  {'max_n_bins': 3, 'min_bin_size': 0.06},
  {'max_n_bins': 3, 'min_bin_size': 0.07},
  {'max_n_bins': 3, 'min_bin_size': 0.08},
  {'max_n_bins': 4, 'min_bin_size': 0.05},
  {'max_n_bins': 4, 'min_bin_size': 0.06},
  {'max_n_bins': 4, 'min_bin_size': 0.07},
  {'max_n_bins': 4, 'min_bin_size': 0.08},
  {'max_n_bins': 5, 'min_bin_size': 0.05},
  {'max_n_bins': 5, 'min_bin_size': 0.06},
  {'max_n_bins': 5, 'min_bin_size': 0.07},
  {'max_n_bins': 5, 'min_bin_size': 0.08},
  {'max_n_bins': 6, 'min_bin_size': 0.05},
  {'max_n_bins': 6, 'min_bin_size': 0.06},
  {'max_n_bins': 6, 'min_bin_size': 0.07},
  {'max_n_bins': 6, 'min_bin_size': 0.08}],
 'split0_test_score': array([0.079, 0.079, 0.079, 0.079, 0.097, 0.097, 0.097, 0.097, 0.106,
        0.106, 0.106, 0.106, 0.107, 0.107, 0.107, 0.107]),
 'split1_test_score': array([4.491, 4.491, 4.491, 4.491, 4.308, 4.308, 4.308, 4.308, 4.19 ,
        4.19 , 4.19 , 4.19 , 4.043, 4.043, 4.043, 4.043]),
 'split2_test_score': array([4.442, 4.442, 4.442, 4.442, 4.305, 4.305, 4.305, 4.305, 4.07 ,
        4.07 , 4.07 , 4.07 , 3.975, 3.975, 3.975, 3.975]),
 'mean_test_score': array([3.004     , 3.004     , 3.004     , 3.004     , 2.90333333,
        2.90333333, 2.90333333, 2.90333333, 2.78866667, 2.78866667,
        2.78866667, 2.78866667, 2.70833333, 2.70833333, 2.70833333,
        2.70833333]),
 'std_test_score': array([2.06838407, 2.06838407, 2.06838407, 2.06838407, 1.98437771,
        1.98437771, 1.98437771, 1.98437771, 1.89756429, 1.89756429,
        1.89756429, 1.89756429, 1.83962991, 1.83962991, 1.83962991,
        1.83962991]),
 'rank_test_score': array([ 1,  1,  1,  1,  5,  5,  5,  5,  9,  9,  9,  9, 13, 13, 13, 13],
       dtype=int32),
 'split0_train_score': array([0.049, 0.049, 0.049, 0.049, 0.049, 0.049, 0.049, 0.049, 0.05 ,
        0.05 , 0.05 , 0.05 , 0.05 , 0.05 , 0.05 , 0.05 ]),
 'split1_train_score': array([0.102, 0.102, 0.102, 0.102, 0.112, 0.112, 0.112, 0.112, 0.116,
        0.116, 0.116, 0.116, 0.119, 0.119, 0.119, 0.119]),
 'split2_train_score': array([0.119, 0.119, 0.119, 0.119, 0.144, 0.144, 0.144, 0.144, 0.156,
        0.156, 0.156, 0.156, 0.159, 0.159, 0.159, 0.159]),
 'mean_train_score': array([0.09      , 0.09      , 0.09      , 0.09      , 0.10166667,
        0.10166667, 0.10166667, 0.10166667, 0.10733333, 0.10733333,
        0.10733333, 0.10733333, 0.10933333, 0.10933333, 0.10933333,
        0.10933333]),
 'std_train_score': array([0.02981051, 0.02981051, 0.02981051, 0.02981051, 0.03946588,
        0.03946588, 0.03946588, 0.03946588, 0.04370609, 0.04370609,
        0.04370609, 0.04370609, 0.04502098, 0.04502098, 0.04502098,
        0.04502098])}

RandomizedSearchCV to maximise AUC

As Skorecard is scikit-learn compatibile we can use scikit-learn methods such as RandomizedSearchCV to maximise the AUC of our model. Shown below is one such example

from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer
from skorecard.pipeline import BucketingProcess
from skorecard.linear_model import LogisticRegression
from skorecard.preprocessing import WoeEncoder
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import uniform


def get_pipeline():
    bucketing_process = BucketingProcess(
        prebucketing_pipeline=make_pipeline(
            DecisionTreeBucketer(max_n_bins=100, min_bin_size=0.05),
        ),
        bucketing_pipeline=make_pipeline(
            OptimalBucketer(max_n_bins=10, min_bin_size=0.04),
        ),
    )

    return make_pipeline(
        bucketing_process, WoeEncoder(), LogisticRegression(solver="liblinear", C=1.7, max_iter=150, random_state=0)
    )


pipe = get_pipeline()


param_grid = [
    {"logisticregression__C": uniform(loc=0, scale=4), "logisticregression__solver": ["liblinear"]},
]

search_cv = RandomizedSearchCV(
    pipe, param_distributions=param_grid, cv=5, verbose=True, scoring="roc_auc", n_jobs=-1, random_state=0, refit=True
)
search_cv.fit(X, y)
Fitting 5 folds for each of 10 candidates, totalling 50 fits

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('bucketingprocess',
                                              BucketingProcess(bucketing_pipeline=Pipeline(steps=[('optimalbucketer',
                                                                                                   OptimalBucketer(min_bin_size=0.04))]),
                                                               prebucketing_pipeline=Pipeline(steps=[('decisiontreebucketer',
                                                                                                      DecisionTreeBucketer())]))),
                                             ('woeencoder', WoeEncoder()),
                                             ('logisticregression',
                                              LogisticRegression(C=1.7,
                                                                 max_iter=150,
                                                                 random_state=0,
                                                                 solver='liblinear'))]),
                   n_jobs=-1,
                   param_distributions=[{'logisticregression__C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f97de25c5d0>,
                                         'logisticregression__solver': ['liblinear']}],
                   random_state=0, scoring='roc_auc', verbose=True)
search_cv.best_params_, search_cv.best_score_
({'logisticregression__C': 2.860757465489678,
  'logisticregression__solver': 'liblinear'},
 0.6187444445104318)