Measuring bucketed distribution shifts.¶

Population staibility index - PSI¶

The PSI (population stability index), is a common measure to evaluate how similar two univariate distributions are.

It's given by the following formula

\[PSI=\sum_{i}^{N_{bins}} (\%x_{i}^{actual} - \%x_{i}^{expected}) log\frac{\%x_{i}^{actual}}{\%x_{i}^{expected}}\]

where the sum runs over all the buckets of the feature x.

skorecard implements a simple functionality to calculate the PSI between two datasets.
As two datasets are needed, we split the X and y into a train and test set.

from skorecard import datasets
from sklearn.model_selection import train_test_split
from skorecard.bucketers import DecisionTreeBucketer

X, y = datasets.load_uci_credit_card(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

By definition, the PSI acts on bucketed features.

Failing to bucket the features would still yield a value of the PSI. However, in this case the PSI will be computed over all the unique values. For numerical features, this will return artifically high and meaningless values.

dbt = DecisionTreeBucketer()

X_train_bins = dbt.fit_transform(X_train, y_train)
X_test_bins = dbt.transform(X_test)

Calculating the PSI

from skorecard.reporting import psi

psi_dict = psi(X_train_bins, X_test_bins)
psi_dict

{'EDUCATION': 0.0005202506508081382,
 'MARRIAGE': 0.0003497580712116056,
 'LIMIT_BAL': 0.013577676978376134,
 'BILL_AMT1': 0.017027519474734677}

Univariate predictive power¶

Information value (IV)¶

The information value is nothing else than the PSI, but it's computed between the features set defined by the target y=0 and y=1.

In other words, it can be summarized by the formula.

\[IV=\sum_{i}^{N_{bins}} (\%x_{i}^{y=0} - \%x_{i}^{y=1}) log\frac{\%x_{i}^{y=0}}{\%x_{i}^{y=1}}\]

dbt = DecisionTreeBucketer()
X_bins = dbt.fit_transform(X, y)

To compute the iv, skorecard implements a handy function.
The function consumes the (binned) feature set X, and the target y

from skorecard.reporting import iv

iv_result = iv(X_bins, y)
iv_result

{'EDUCATION': 0.036451028950383324,
 'MARRIAGE': 0.009494315565036299,
 'LIMIT_BAL': 0.17922043483265943,
 'BILL_AMT1': 0.05239237644085838}