Once the buckets are defined, the next step is to perform the feature selection.¶
In building a skorecard model, there are a few recommended steps to felect the features.
- Calculate the information values (IV) to identify the very predictive features
- Calculate the Population Stability Index (PSI) to identify the unstable features
- Evaluate the multicollinearity of the features that pass the previous two steps
The information value IV¶
The information value is defined by the following equation
where \(i\) is the bucket index, \(\%n_{i}^{y=0}\) represents the fraction counts of target 0 in the bucket, while \(\%n_{i}^{y=1}\) represents the fraction of the counts of target 1 in the bucket \(i\).
The IV is a weighted sum of the Weight of Evidences (WoE) of every bin.
The higher the value, the larger the separation between the classes: in other words the more predictive the feature is.
As a rule of thumb:
- IV < 0.02 non-predictive feature
- 0.02 < IV < 0.1 predictive feature
- IV > 0.1 very predictive feature
The Population Stability index (PSI)¶
The PSI measures the similarity between two samples. The PSI is defined as
where \(i\) is the bucket index, \(\%n_{i}^{X1}\) represents the fraction counts of the feature in the sample X1, while \(\%n_{i}^{y=1}\) represents the fraction counts of the feature in the sample X2 in the bucket \(i\). It's the same definition as in the IV. However, here large values indicate a difference between two samples, therefore for the selection we look at the lower values: - IV < 0.02 stable feature - 0.02 < IV < 0.1 unstable, but acceptable, feature - IV > 0.1 unstable feature
Let's load the data as in the previous tutorials and split it into train and test.
import pandas as pd
from skorecard.datasets import load_credit_card
from sklearn.model_selection import train_test_split
data = load_credit_card(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
data.drop(["y"], axis=1), data["y"], test_size=0.25, random_state=42
)
Load the saved buckets¶
import yaml
buckets_dict = yaml.safe_load(open("buckets.yml"))
Define the bucketer, using the UserInputBucketer
from skorecard.bucketers import UserInputBucketer
uib = UserInputBucketer(buckets_dict)
X_train_bins = uib.fit_transform(X_train, y_train)
X_test_bins = uib.transform(X_test)
X_train_bins
uib.bucket_table("x1")
Calculating the Information Value¶
The information value can be calculated by the iv()
function in the reporting module.
from skorecard.reporting import iv
iv_dict = iv(X_train_bins, y_train)
iv_values = pd.Series(iv_dict).sort_values(ascending=False) # sort them by predicting power
iv_values.head(5)
As an abritrary threshold, we can select the features where the IV values are above 0.002
preselected_features = iv_values[iv_values > 0.002].index.tolist()
print(f"Total selected features by IV: {len(preselected_features)}")
Calculate the PSI¶
Similar to the IV, by using the psi
function in the report module.
from skorecard.reporting import psi
psi_dict = psi(X_train_bins, X_test_bins)
psi_values = pd.Series(psi_dict)
psi_values.sort_values(ascending=False).head(5)
In this particular case, all the features have a very low PSI, hence no instability is present and no feature selection is performed.
Removing multi-collinear features¶
A skorecard model is based on a logistic regression algorithm.
Logistic regression suffers from multi-collinearity (i.e. highly correlated features) by design.
There are multiple ways of measuring it, such as VIF and correlations
Here we are showing the approach with correlations.
First, let's build an easy visualization function
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
def plot_correlations(corr):
plt.figure(figsize=(10, 8), constrained_layout=True)
cmap = plt.cm.get_cmap("RdBu")
matrix = np.triu(corr)
sns.heatmap(corr, vmin=-1, vmax=1, annot=True, mask=matrix, cmap=cmap, annot_kws={"fontsize": 6})
Adding the WoE Encoder¶
A default scorecard model is defined by the following steps: - bucketing - encoder (a default one is a WoE encoder) - logistic regression model
As the input of the logistic regression model is the dataset transformed to its WoE, first let's build the pipline with the first two steps and transform the dataset
from skorecard.preprocessing import WoeEncoder
from sklearn.pipeline import make_pipeline
woe_pipe = make_pipeline(uib, WoeEncoder())
X_train_woe = woe_pipe.fit_transform(X_train, y_train)
preselected_features
X_train_corr = X_train_woe[preselected_features].corr()
plot_correlations(X_train_corr)
As a rule of thumb, correlations above 0.6 can be considered problematic for the logisitc regression model (this threshold might depend heeavily on dataset and use case).
The following code snippet illustrates a recursive feature elimination step, where features are sorted by their IV importance, and correlated features with lower IV importance are removed
corr_limit = 0.6 # correlation threshold
drop_feats = list()
# keep_feats = list()
for ix, feature in enumerate(preselected_features):
if feature in drop_feats:
continue
remaining_features = [
feat
for feat in preselected_features[ix:] # check the next feature in the preselected step
if feat not in drop_feats and feat != feature
]
if len(remaining_features) == 0:
continue # go to the next step if the features at step x have already been removeed
# find the correlated features with the remaining preselected features
# both positive and negative correlations matter, hence the abs()
corr_feats = X_train_corr.loc[remaining_features, feature].apply(lambda x: abs(x))
drop_at_step = corr_feats[corr_feats > corr_limit].index.tolist()
# append the new features to the list
drop_feats += drop_at_step
# Select thee features with low correlations
good_feats = [feat for feat in preselected_features if feat not in drop_feats]
print(f"Total preselected features: {len(preselected_features)}")
print(f"Total features dropped due too high correlations: {len(drop_feats)}")
print(f"Total selected features: {len(good_feats)}")
Visualizing the correlation of the good features to verify that the RFE step worked
plot_correlations(X_train_woe[good_feats].corr())
and the final list of selected features is shown below
good_feats
Up next¶
After performing the feature selection, it's time to build the final Skorecard
model.