Missing Values¶

skorecard bucketers offer native support for missing values and will put them in a separate bucket by default.

In the example below, you can see that the single missing value is put into a new bucket '-1'.

import numpy as np
import pandas as pd
from skorecard.bucketers import EqualFrequencyBucketer

df = pd.DataFrame({"counts": [1, 2, 2, 1, 4, 2, np.nan, 1, 3]})
EqualFrequencyBucketer(n_bins=2).fit_transform(df).value_counts()

counts
 0        6
 1        2
-1        1
dtype: int64

Specific¶

Alternatively, the user can give a specific bucket for the missing values.

In the example below, you can see we put the missing value into bucket 1

EqualFrequencyBucketer(n_bins=2, missing_treatment={"counts": 1}).fit_transform(df).value_counts()

counts
0         6
1         3
dtype: int64

Passthrough¶

If the user wishes the missing values to be left untouched, they can specify this with the passthrough argument

EqualFrequencyBucketer(n_bins=2, missing_treatment="passthrough").fit_transform(df)

	counts
0	0.0
1	0.0
2	0.0
3	0.0
4	1.0
5	0.0
6	NaN
7	0.0
8	1.0

Most frequent¶

It's also possible to put the missing values into the most common bucket. Below, we see that the missing values are put into the '0' bucket

EqualFrequencyBucketer(n_bins=2, missing_treatment="most_frequent").fit_transform(df)

	counts
0	0
1	0
2	0
3	0
4	1
5	0
6	0
7	0
8	1

Using the target to bucket¶

It's also possible to use the target to decide which bucket to use for the missing values. In the below examples, we use y as the target.

Neutral¶

Here the missing values are placed into the bucket that has a Weight of Evidence closest to 0

X = pd.DataFrame({"counts": [1, 2, 2, 1, 4, 2, np.nan, 1, 3]})
y = pd.DataFrame({"target": [0, 0, 1, 0, 1, 0, 1, 0, 1]})
EqualFrequencyBucketer(n_bins=2, missing_treatment="neutral").fit_transform(X, y)

	counts
0	0
1	0
2	0
3	0
4	1
5	0
6	0
7	0
8	1

Similar¶

We can also put the missing values into the bucket that has a Weight of Evidence closest to the bucket containing only missing values

EqualFrequencyBucketer(n_bins=2, missing_treatment="similar").fit_transform(X, y)

	counts
0	0
1	0
2	0
3	0
4	1
5	0
6	1
7	0
8	1

Least risky¶

Missing values are put into the bucket containing the largest percentage of Class 0.

a = EqualFrequencyBucketer(n_bins=2, missing_treatment="least_risky")  # .fit_transform(X, y)
a.fit_transform(X, y)

	counts
0	0
1	0
2	0
3	0
4	1
5	0
6	0
7	0
8	1

EqualFrequencyBucketer(n_bins=2, missing_treatment="least_risky").fit_transform(X, y)

	counts
0	0
1	0
2	0
3	0
4	1
5	0
6	0
7	0
8	1

Most risky¶

Missing values are put into the bucket containing the largest percentage of Class 1.

EqualFrequencyBucketer(n_bins=2, missing_treatment="most_risky").fit_transform(X, y)

	counts
0	0
1	0
2	0
3	0
4	1
5	0
6	1
7	0
8	1