Bucketing¶
The core of a good skorecard model is to bucket the features.
This section showcases how to use skorecard to achieve this.
Let's start by loading the demo data
from skorecard.datasets import load_credit_card
data = load_credit_card(as_frame=True)
data.head()
The dataset used contains 30K rows, 23 features and a binary target.
Let's start by splitting the data in train and test sample, as per common practice when building a model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data.drop(["y"], axis=1), data["y"], test_size=0.25, random_state=42
)
Define the numerical and categorical features¶
The features x2
, x3
and x4
are of a categorical nature
cat_cols = ["x2", "x3", "x4"]
num_cols = [col for col in X_train.columns if col not in cat_cols]
print(f"Total categorical columns: {len(cat_cols)}")
print(f"Total numerical columns: {len(num_cols)}")
Bucketing features¶
Skorecard implements different bucketers, but they are not applicable to all the features.
-
Categorical features: In most of the cases, there is no numerical relationship between categories.
Therefore automatic bucketing is very difficult to perform.
skorecard
implements theOrdinalCategoricalBucketer
that orders the buckets either by the count or by the target rate.
It includes a tolerance (tol
) input, which represents the lower bound of the fraction of data allowed to keep a category in the same bucket. -
Numerical features: differently from categorical features, algorithmic bucketing can be applied to numerical features.
skorecard
implements different bucketers that can be used for numerical features:DecisionTreeBucketer
: fits a univariate decision tree to find the optimal splits (requires the targety
)EqualWidthBucketer
: generates buckets with equal spacing in the bucket boundaries (i.e. historgrams)EqualFrequencyBucketer
: generates buckets with equal counts in the buckets (i.e. quantiles)AgglomerativeClusteringBucketer
: generates bucketes by applying AgglomerativeClustering (density-based bucketer)
Manual correction¶
Note: to build a high quality scorecard, it's highly recommended to manually assess every bucket.
Algorithms implemented in skorecard
are very helpful, but are obscure to the business sense.
This is especially true for categorical features, where business sense should prevail.
Default bucketers¶
Categorical features¶
Due to (generally speaking) no relationship between categories, it's not possible to implement an algorithmic way of bucketing the values (in the same way as it is possible for numerical features).
The only suitable bucketer for categorical features in skorecard
is the OrdinalCategoricalBucketer
, which groups together low-frequency categorical variables (all variables with a frequency below the tol
threshold are put in the other
bucket)
Let's fix the tol to 5%
, as this is the recommended minimum.
# Start with categorical features
from skorecard.bucketers import OrdinalCategoricalBucketer
cat_bucketer = OrdinalCategoricalBucketer(variables=cat_cols, tol=0.05, encoding_method="ordered")
cat_bucketer.fit(X_train, y_train)
Example: Look up the feature 'x3'
The feature 'x3' has a few categories that are sparsely-populated
X_train["x3"].value_counts(normalize=True)
The OrdinalCategorical bucketer populates the other
category with sparse values
cat_bucketer.bucket_table("x3")
Note
There might be a different strategy applied to different bucketers.
This is addressed by defining a pipeline of bucketers (see the numerical features section for details)
Numerical features¶
Numerical features allow for different bucketers (as described above).
However, the recommended approach for bucketing is to use either the DecisionTreeBucketer
or the BucketingProcess
.
X_train["x1"].value_counts()
from skorecard.bucketers import DecisionTreeBucketer
specials = {"x1": {"special_demo": [50000]}}
num_bucketer = DecisionTreeBucketer(
variables=num_cols,
max_n_bins=5, # max number of bins allowed
min_bin_size=0.06, # min fraction of data allowed in the bin
dt_kwargs={
"criterion": "entropy",
"min_impurity_decrease": 0.0005, # as in sklearn. Helps to decide how to split the buckets
},
specials=specials,
)
num_bucketer.fit(X_train, y_train)
num_bucketer.bucket_table("x1")
See the bucket outputs (for the first three features)¶
from IPython.display import display
for x in num_cols[:3]:
display(num_bucketer.fit(X_train, y_train).bucket_table(x))
With different bucketers for different features in one go¶
Note that below a warning is given to alert you that there are too many unique values in the numerical features. It is good to pay attention to these warnings, as the quantiles are approximate.
from sklearn.pipeline import make_pipeline
from skorecard.bucketers import EqualFrequencyBucketer, DecisionTreeBucketer
pipe = make_pipeline(
EqualFrequencyBucketer(n_bins=5, variables=num_cols[:5]), DecisionTreeBucketer(max_n_bins=5, variables=num_cols[5:])
)
pipe.fit(X_train, y_train)
pipe.transform(X_test)
Parenthesis: compare the buckets from two different algorithms¶
By comparing the DecisionTreeBucketer
in the first example, and the EqualFrequencyBucketer
from the pipeline example, here comes a quick preview on assessing the two bucketing results.
The first case results in the higher IV, with less buckets, hence it's definetely a better result!
print(f"\nDecisionTreeBucketer, with IV: {num_bucketer.bucket_table(num_cols[0])['IV'].sum()}")
display(num_bucketer.bucket_table(num_cols[0]))
print(f"\nEqualFrequencyBucketer, with IV: {pipe.steps[0][1].bucket_table(num_cols[0])['IV'].sum()}")
display(pipe.steps[0][1].bucket_table(num_cols[0]))
Make a pipeline for all the features¶
So far we have shown how to deal with bucketers for categoricals/numericals.
The whole process can be put together as in a scikit-learn pipeline.
bucketing_pipe = make_pipeline(num_bucketer, cat_bucketer)
bucketing_pipe.fit(X_train, y_train)
Save the bucketers to file¶
Once the buckets are satisfactory, save the ouputs to a yaml file
from skorecard.pipeline import to_skorecard_pipeline
to_skorecard_pipeline(bucketing_pipe).save_yml(open("buckets.yml", "w"))
Using the bucketing process¶
The most common approach in bucketing is to perform what is known as fine-coarse classing
.
In less fancy words:
- you start with very loose bucketing requirements (many buckets, where some minimal (hopefully significant) aggregations and statistics can be computed - this is known as fine classing
- for numerical features, it starts by merging together adjacent buckets with similar default rate/WoE
- for categorical features, one should merge together categories with similar default rate/WoE, but only when it makes sense
- the last two steps (or merging together buckets) is known as coarse classing
In skorecard, this process is known as Bucketing Process
, as shown below:
-
The bucketing process starts by defining the loose (fine) buckets (prebucketing pipeline)
-
It then runs an optimization algorithm, that merges the buckets together according to an optimization algorithm (bucketing pipeline)
from skorecard.bucketers import DecisionTreeBucketer, OrdinalCategoricalBucketer, OptimalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline
specials = {"x1": {"special_demo": [50000]}}
prebucketing_pipeline = make_pipeline(
DecisionTreeBucketer(variables=num_cols, max_n_bins=40, min_bin_size=0.03), # loose requirements
OrdinalCategoricalBucketer(variables=cat_cols, tol=0.02),
)
bucketing_pipeline = make_pipeline(
OptimalBucketer(variables=num_cols, max_n_bins=6, min_bin_size=0.05),
OptimalBucketer(variables=cat_cols, variables_type="categorical", max_n_bins=10, min_bin_size=0.05),
)
bucketing_process = BucketingProcess(
prebucketing_pipeline=prebucketing_pipeline,
bucketing_pipeline=bucketing_pipeline,
specials=specials,
)
bucketing_process = bucketing_process.fit(X_train, y_train)
Let's see the output of this optimization step¶
print(f"Prebucketing step (fine classing), IV {bucketing_process.prebucket_table('x1')['IV'].sum()}")
display(bucketing_process.prebucket_table("x1"))
print(f"\nBucketing step (coarse classing), IV {bucketing_process.bucket_table('x1')['IV'].sum()}")
display(bucketing_process.bucket_table("x1"))
Notice in the first table there is an additional column that is the optimized buckets (those indexes are aggregated in the second table).
Because of the additive nature of the IV, by reducing the number of buckets, the IV will normally decrease.
The goal is to reduce the number of buckets as much as possible, by keeping a high IV value, and check for monotonicity wherever needed.
y_train.value_counts(normalize=True)
bucketing_process.plot_prebucket("x1", format="png", scale=2, width=1050, height=525)
bucketing_process.plot_bucket("x1", format="png", scale=2, width=1050, height=525)
bucketing_process.summary()
bucketing_process.save_yml("best_bucketing.yml")
Manual bucket refinement¶
Besides manually inspecting, it's often necessary to manually refine the buckets.
skorecard
implements a handy dash web-app that allows the user to redefine the bucket allocation.
bucketing_process.fit(X_train, y_train)
Up next¶
How to perform feature selection in a skorecard
model