Optimizing the bucketing process¶
import pandas as pd
from IPython.display import display
from skorecard.datasets import load_credit_card
df = load_credit_card(as_frame=True)
# Show
display(df.head(4))
num_feats = ["x1", "x15", "x16"]
X = df[num_feats]
y = df["y"]
Finding the best bucketing¶
The art of building a good scorecard model lies in finding the best bucketing strategy.
Good buckets improve the predicitve power of the model, as well as guarantee stability of the predictions.
This is normally a very manual, labour intensive process (and for a good reason).
A good bucketing strategy follows the following principles: - maximizes the Information Values, defined as
- avoids buckets that contain a very large or very small fraction of the population wherever the business sense requires it,
The skorecard
package provides some tooling to automate part of the process, namely:
- Grid search the hyper-parameters of the bucketers in order to maximise the information value
- Run the optimal bucketer within the bucketing process
Grid search the bucketers to maximise the information value¶
skorecard
implements an IV_scorer
, that can be used as a custom scoring function for grid searching.
The following snippets of code show how to integrate it in the grid search.
The DecisionTreeBucketer applied on numerical features is the best use case, as there are some hyper-parameters that influence the bucketing quality.
from skorecard.metrics import IV_scorer
from skorecard.bucketers import DecisionTreeBucketer
from sklearn.model_selection import GridSearchCV
The DecisionTreeBucketer has two main hyperparameters to grid-search:
- max_n_bins
, maximum number of bins allowed for the bucketing
- min_bin_size
minimum fraction of data in the buckets
gs_params = {
"max_n_bins": [3, 4, 5, 6],
"min_bin_size": [0.05, 0.06, 0.07, 0.08], # , 0.12]
}
The optimization has to be done for every feature indipendently, therefore we need a loop, and all the parameters are best stored in a data collector, like a dictionary
# Define the specials
best_params = dict()
max_iv = dict()
cv_results = dict()
# Add a special for demo purposes
specials = {"x1": {"special 0": ["50000.0"]}}
for feat in num_feats:
# This snippet illustrates what to do with special values
if feat in specials.keys():
# This construct is needed to remap the specials, because skorecard validates that the key
# of the dictionary is present in the variables
special = {feat: specials[feat]}
else:
special = {}
bucketer = DecisionTreeBucketer(variables=[feat], specials=special)
gs = GridSearchCV(bucketer, gs_params, scoring=IV_scorer, cv=3, return_train_score=True)
gs.fit(X[[feat]], y)
best_params[feat] = gs.best_params_
max_iv[feat] = gs.best_score_
cv_results[feat] = gs.cv_results_
Checking the best parameters per feature
best_params
Because of its additive nature, IV is likely to be maximal for the highest max_n_bins
.
Therefore it is worth looking analysing the CV results!
cv_results["x1"]
RandomizedSearchCV to maximise AUC¶
As Skorecard
is scikit-learn compatibile we can use scikit-learn methods such as RandomizedSearchCV to maximise the AUC of our model. Shown below is one such example
from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer
from skorecard.pipeline import BucketingProcess
from skorecard.linear_model import LogisticRegression
from skorecard.preprocessing import WoeEncoder
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import uniform
def get_pipeline():
bucketing_process = BucketingProcess(
prebucketing_pipeline=make_pipeline(
DecisionTreeBucketer(max_n_bins=100, min_bin_size=0.05),
),
bucketing_pipeline=make_pipeline(
OptimalBucketer(max_n_bins=10, min_bin_size=0.04),
),
)
return make_pipeline(
bucketing_process, WoeEncoder(), LogisticRegression(solver="liblinear", C=1.7, max_iter=150, random_state=0)
)
pipe = get_pipeline()
param_grid = [
{"logisticregression__C": uniform(loc=0, scale=4), "logisticregression__solver": ["liblinear"]},
]
search_cv = RandomizedSearchCV(
pipe, param_distributions=param_grid, cv=5, verbose=True, scoring="roc_auc", n_jobs=-1, random_state=0, refit=True
)
search_cv.fit(X, y)
search_cv.best_params_, search_cv.best_score_