Skip to content

Missing Values

skorecard bucketers offer native support for missing values and will put them in a separate bucket by default.

In the example below, you can see that the single missing value is put into a new bucket '-1'.

import numpy as np
import pandas as pd
from skorecard.bucketers import EqualFrequencyBucketer

df = pd.DataFrame({"counts": [1, 2, 2, 1, 4, 2, np.nan, 1, 3]})
EqualFrequencyBucketer(n_bins=2).fit_transform(df).value_counts()
counts
 0        6
 1        2
-1        1
dtype: int64

Specific

Alternatively, the user can give a specific bucket for the missing values.

In the example below, you can see we put the missing value into bucket 1

EqualFrequencyBucketer(n_bins=2, missing_treatment={"counts": 1}).fit_transform(df).value_counts()
counts
0         6
1         3
dtype: int64

Passthrough

If the user wishes the missing values to be left untouched, they can specify this with the passthrough argument

EqualFrequencyBucketer(n_bins=2, missing_treatment="passthrough").fit_transform(df)
counts
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 0.0
6 NaN
7 0.0
8 1.0

Most frequent

It's also possible to put the missing values into the most common bucket. Below, we see that the missing values are put into the '0' bucket

EqualFrequencyBucketer(n_bins=2, missing_treatment="most_frequent").fit_transform(df)
counts
0 0
1 0
2 0
3 0
4 1
5 0
6 0
7 0
8 1

Using the target to bucket

It's also possible to use the target to decide which bucket to use for the missing values. In the below examples, we use y as the target.

Neutral

Here the missing values are placed into the bucket that has a Weight of Evidence closest to 0

X = pd.DataFrame({"counts": [1, 2, 2, 1, 4, 2, np.nan, 1, 3]})
y = pd.DataFrame({"target": [0, 0, 1, 0, 1, 0, 1, 0, 1]})
EqualFrequencyBucketer(n_bins=2, missing_treatment="neutral").fit_transform(X, y)
counts
0 0
1 0
2 0
3 0
4 1
5 0
6 0
7 0
8 1

Similar

We can also put the missing values into the bucket that has a Weight of Evidence closest to the bucket containing only missing values

EqualFrequencyBucketer(n_bins=2, missing_treatment="similar").fit_transform(X, y)
counts
0 0
1 0
2 0
3 0
4 1
5 0
6 1
7 0
8 1

Least risky

Missing values are put into the bucket containing the largest percentage of Class 0.

a = EqualFrequencyBucketer(n_bins=2, missing_treatment="least_risky")  # .fit_transform(X, y)
a.fit_transform(X, y)
counts
0 0
1 0
2 0
3 0
4 1
5 0
6 0
7 0
8 1
EqualFrequencyBucketer(n_bins=2, missing_treatment="least_risky").fit_transform(X, y)
counts
0 0
1 0
2 0
3 0
4 1
5 0
6 0
7 0
8 1

Most risky

Missing values are put into the bucket containing the largest percentage of Class 1.

EqualFrequencyBucketer(n_bins=2, missing_treatment="most_risky").fit_transform(X, y)
counts
0 0
1 0
2 0
3 0
4 1
5 0
6 1
7 0
8 1

Last update: 2023-08-08