# Measuring bucketed distribution shifts.¶

## Population staibility index - PSI¶

The PSI (population stability index), is a common measure to evaluate how similar two univariate distributions are.

It's given by the following formula

where the sum runs over all the buckets of the feature `x`

.

`skorecard`

implements a simple functionality to calculate the PSI between two datasets.

As two datasets are needed, we split the X and y into a train and test set.

```
from skorecard import datasets
from sklearn.model_selection import train_test_split
from skorecard.bucketers import DecisionTreeBucketer
X, y = datasets.load_uci_credit_card(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)
```

By definition, the PSI acts on bucketed features.

Failing to bucket the features would still yield a value of the PSI. However, in this case the PSI will be computed over all the unique values. For numerical features, this will return artifically high and meaningless values.

```
dbt = DecisionTreeBucketer()
X_train_bins = dbt.fit_transform(X_train,y_train)
X_test_bins = dbt.transform(X_test)
```

Calculating the PSI

```
from skorecard.reporting import psi
psi_dict = psi(X_train_bins, X_test_bins)
psi_dict
```

# Univariate predictive power¶

## Information value (IV)¶

The information value is nothing else than the PSI, but it's computed between the features set defined by the target y=0 and y=1.

In other words, it can be summarized by the formula.

```
dbt = DecisionTreeBucketer()
X_bins = dbt.fit_transform(X,y)
```

To compute the iv, `skorecard`

implements a handy function.

The function consumes the (binned) feature set X, and the target y

```
from skorecard.reporting import iv
iv = iv(X_bins, y)
iv
```