Skip to content

WoeEncoder

Bases: BaseEstimator, TransformerMixin

Transformer that encodes unique values in features to their Weight of Evidence estimation.

This class has been deprecated in favor of category_encoders.woe.WOEEncoder

Only works for binary classification (target y has 0 and 1 values).

The weight of evidence is given by: np.log( p(1) / p(0) ) The target probability ratio is given by: p(1) / p(0)

For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by: np.log(0.8/0.2) = 1.386 if log_ratio is selected. Alternatively, blue will be replaced by 0.8 / 0.2 = 4 if ratio is selected.

More formally:

  • for each unique value 𝑥, consider the corresponding rows in the training set
  • compute what percentage of positives is in these rows, compared to the whole set
  • compute what percentage of negatives is in these rows, compared to the whole set
  • take the ratio of these percentages
  • take the natural logarithm of that ratio to get the weight of evidence corresponding to 𝑥, so that 𝑊𝑂𝐸(𝑥) is either positive or negative according to whether 𝑥 is more representative of positives or negatives

More details:

Example:

from skorecard import datasets
from skorecard.preprocessing import WoeEncoder

X, y = datasets.load_uci_credit_card(return_X_y=True)
we = WoeEncoder(variables=['EDUCATION'])
we.fit_transform(X, y)
we.fit_transform(X, y)['EDUCATION'].value_counts()

Credits: Some inspiration taken from feature_engine.categorical_encoders.

Source code in skorecard/preprocessing/_WoEEncoder.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
class WoeEncoder(BaseEstimator, TransformerMixin):
    """
    Transformer that encodes unique values in features to their Weight of Evidence estimation.

    **This class has been deprecated in favor of category_encoders.woe.WOEEncoder**

    Only works for binary classification (target y has 0 and 1 values).

    The weight of evidence is given by: `np.log( p(1) / p(0) )`
    The target probability ratio is given by: `p(1) / p(0)`

    For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and
    the mean of the target = 0 is 0.2, blue will be replaced by: np.log(0.8/0.2) = 1.386
    if log_ratio is selected. Alternatively, blue will be replaced by 0.8 / 0.2 = 4 if ratio is selected.

    More formally:

    - for each unique value 𝑥,  consider the corresponding rows in the training set
    - compute what percentage of positives is in these rows, compared to the whole set
    - compute what percentage of negatives is in these rows, compared to the whole set
    - take the ratio of these percentages
    - take the natural logarithm of that ratio to get the weight of evidence corresponding to  𝑥,  so that  𝑊𝑂𝐸(𝑥)  is either positive or negative according to whether  𝑥  is more representative of positives or negatives

    More details:

    - [blogpost on weight of evidence](https://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/)

    Example:

    ```python
    from skorecard import datasets
    from skorecard.preprocessing import WoeEncoder

    X, y = datasets.load_uci_credit_card(return_X_y=True)
    we = WoeEncoder(variables=['EDUCATION'])
    we.fit_transform(X, y)
    we.fit_transform(X, y)['EDUCATION'].value_counts()
    ```

    Credits: Some inspiration taken from [feature_engine.categorical_encoders](https://feature-engine.readthedocs.io/en/latest/encoding/index.html).
    """  # noqa

    def __init__(self, epsilon=0.0001, variables=[], handle_unknown="value"):
        """
        Constructor for WoEEncoder.

        Args:
            epsilon (float): Amount to be added to relative counts in order to avoid division by zero in the WOE
                calculation.
            variables (list): The features to bucket. Uses all features if not defined.
            handle_unknown (str): How to handle any new values encountered in X on transform().
                options are 'return_nan', 'error' and 'value', defaults to 'value', which will assume WOE=0.
        """
        self.epsilon = epsilon
        self.variables = variables
        self.handle_unknown = handle_unknown

        warnings.warn(
            "This encoder will be deprecated. Please use category_encoders.woe.WOEEncoder instead.", DeprecationWarning
        )

    def fit(self, X, y):
        """Calculate the WOE for every column.

        Args:
            X (np.array): (binned) features
            y (np.array): target
        """
        assert self.epsilon >= 0
        # Check data
        X = ensure_dataframe(X)
        assert y is not None, "WoEBucketer needs a target y"
        y = BaseBucketer._check_y(y)

        y = y.astype(float)
        if len(np.unique(y)) > 2:
            raise AssertionError("WoEBucketer is only suited for binary classification")
        self.variables_ = BaseBucketer._check_variables(X, self.variables)

        # WoE currently does not support NAs
        # This is also flagged in self._more_tags()
        # We could treat missing values as a separate bin (-1) and thus handle seamlessly.
        BaseBucketer._check_contains_na(X, self.variables_)

        # scikit-learn requires checking that X has same shape on transform
        # this is because scikit-learn is still positional based (no column names used)
        self.n_train_features_ = X.shape[1]

        self.woe_mapping_ = {}
        for var in self.variables_:
            t = woe_1d(X[var], y, epsilon=self.epsilon)

            woe_dict = t["woe"].to_dict()
            # If new categories encountered, returns WoE = 0
            if self.handle_unknown == "value":
                woe_dict = defaultdict(int, woe_dict)

            self.woe_mapping_[var] = woe_dict

        return self

    def transform(self, X):
        """Transform X to weight of evidence encoding.

        Args:
            X (pd.DataFrame): dataset
        """
        assert self.handle_unknown in ["value", "error", "return_nan"]
        check_is_fitted(self)
        X = ensure_dataframe(X)

        if X.shape[1] != self.n_train_features_:
            msg = f"Number of features in X ({X.shape[1]}) is different "
            msg += f"from the number of features in X during fit ({self.n_train_features_})"
            raise ValueError(msg)

        for feature in self.variables_:
            woe_dict = self.woe_mapping_.get(feature)
            if self.handle_unknown == "error":
                new_cats = [x for x in list(X[feature].unique()) if x not in list(woe_dict.keys())]
                if len(new_cats) > 0:
                    msg = "WoEEncoder encountered unknown new categories "
                    msg += f"in column {feature} on .transform(): {new_cats}"
                    raise AssertionError(msg)

            X[feature] = X[feature].map(woe_dict)

        return X

    def _more_tags(self):
        """
        Estimator tags are annotations of estimators that allow programmatic inspection of their capabilities.

        See https://scikit-learn.org/stable/developers/develop.html#estimator-tags
        """  # noqa
        return {"binary_only": True, "allow_nan": False}

__init__(epsilon=0.0001, variables=[], handle_unknown='value')

Constructor for WoEEncoder.

Parameters:

Name Type Description Default
epsilon float

Amount to be added to relative counts in order to avoid division by zero in the WOE calculation.

0.0001
variables list

The features to bucket. Uses all features if not defined.

[]
handle_unknown str

How to handle any new values encountered in X on transform(). options are 'return_nan', 'error' and 'value', defaults to 'value', which will assume WOE=0.

'value'
Source code in skorecard/preprocessing/_WoEEncoder.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def __init__(self, epsilon=0.0001, variables=[], handle_unknown="value"):
    """
    Constructor for WoEEncoder.

    Args:
        epsilon (float): Amount to be added to relative counts in order to avoid division by zero in the WOE
            calculation.
        variables (list): The features to bucket. Uses all features if not defined.
        handle_unknown (str): How to handle any new values encountered in X on transform().
            options are 'return_nan', 'error' and 'value', defaults to 'value', which will assume WOE=0.
    """
    self.epsilon = epsilon
    self.variables = variables
    self.handle_unknown = handle_unknown

    warnings.warn(
        "This encoder will be deprecated. Please use category_encoders.woe.WOEEncoder instead.", DeprecationWarning
    )

fit(X, y)

Calculate the WOE for every column.

Parameters:

Name Type Description Default
X np.array

(binned) features

required
y np.array

target

required
Source code in skorecard/preprocessing/_WoEEncoder.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
def fit(self, X, y):
    """Calculate the WOE for every column.

    Args:
        X (np.array): (binned) features
        y (np.array): target
    """
    assert self.epsilon >= 0
    # Check data
    X = ensure_dataframe(X)
    assert y is not None, "WoEBucketer needs a target y"
    y = BaseBucketer._check_y(y)

    y = y.astype(float)
    if len(np.unique(y)) > 2:
        raise AssertionError("WoEBucketer is only suited for binary classification")
    self.variables_ = BaseBucketer._check_variables(X, self.variables)

    # WoE currently does not support NAs
    # This is also flagged in self._more_tags()
    # We could treat missing values as a separate bin (-1) and thus handle seamlessly.
    BaseBucketer._check_contains_na(X, self.variables_)

    # scikit-learn requires checking that X has same shape on transform
    # this is because scikit-learn is still positional based (no column names used)
    self.n_train_features_ = X.shape[1]

    self.woe_mapping_ = {}
    for var in self.variables_:
        t = woe_1d(X[var], y, epsilon=self.epsilon)

        woe_dict = t["woe"].to_dict()
        # If new categories encountered, returns WoE = 0
        if self.handle_unknown == "value":
            woe_dict = defaultdict(int, woe_dict)

        self.woe_mapping_[var] = woe_dict

    return self

transform(X)

Transform X to weight of evidence encoding.

Parameters:

Name Type Description Default
X pd.DataFrame

dataset

required
Source code in skorecard/preprocessing/_WoEEncoder.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def transform(self, X):
    """Transform X to weight of evidence encoding.

    Args:
        X (pd.DataFrame): dataset
    """
    assert self.handle_unknown in ["value", "error", "return_nan"]
    check_is_fitted(self)
    X = ensure_dataframe(X)

    if X.shape[1] != self.n_train_features_:
        msg = f"Number of features in X ({X.shape[1]}) is different "
        msg += f"from the number of features in X during fit ({self.n_train_features_})"
        raise ValueError(msg)

    for feature in self.variables_:
        woe_dict = self.woe_mapping_.get(feature)
        if self.handle_unknown == "error":
            new_cats = [x for x in list(X[feature].unique()) if x not in list(woe_dict.keys())]
            if len(new_cats) > 0:
                msg = "WoEEncoder encountered unknown new categories "
                msg += f"in column {feature} on .transform(): {new_cats}"
                raise AssertionError(msg)

        X[feature] = X[feature].map(woe_dict)

    return X

Last update: 2023-08-08