Bases: BaseEstimator
, TransformerMixin
Transformer that encodes unique values in features to their Weight of Evidence estimation.
This class has been deprecated in favor of category_encoders.woe.WOEEncoder
Only works for binary classification (target y has 0 and 1 values).
The weight of evidence is given by: np.log( p(1) / p(0) )
The target probability ratio is given by: p(1) / p(0)
For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and
the mean of the target = 0 is 0.2, blue will be replaced by: np.log(0.8/0.2) = 1.386
if log_ratio is selected. Alternatively, blue will be replaced by 0.8 / 0.2 = 4 if ratio is selected.
More formally:
- for each unique value 𝑥, consider the corresponding rows in the training set
- compute what percentage of positives is in these rows, compared to the whole set
- compute what percentage of negatives is in these rows, compared to the whole set
- take the ratio of these percentages
- take the natural logarithm of that ratio to get the weight of evidence corresponding to 𝑥, so that 𝑊𝑂𝐸(𝑥) is either positive or negative according to whether 𝑥 is more representative of positives or negatives
More details:
Example:
from skorecard import datasets
from skorecard.preprocessing import WoeEncoder
X, y = datasets.load_uci_credit_card(return_X_y=True)
we = WoeEncoder(variables=['EDUCATION'])
we.fit_transform(X, y)
we.fit_transform(X, y)['EDUCATION'].value_counts()
Credits: Some inspiration taken from feature_engine.categorical_encoders.
Source code in skorecard/preprocessing/_WoEEncoder.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148 | class WoeEncoder(BaseEstimator, TransformerMixin):
"""
Transformer that encodes unique values in features to their Weight of Evidence estimation.
**This class has been deprecated in favor of category_encoders.woe.WOEEncoder**
Only works for binary classification (target y has 0 and 1 values).
The weight of evidence is given by: `np.log( p(1) / p(0) )`
The target probability ratio is given by: `p(1) / p(0)`
For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and
the mean of the target = 0 is 0.2, blue will be replaced by: np.log(0.8/0.2) = 1.386
if log_ratio is selected. Alternatively, blue will be replaced by 0.8 / 0.2 = 4 if ratio is selected.
More formally:
- for each unique value 𝑥, consider the corresponding rows in the training set
- compute what percentage of positives is in these rows, compared to the whole set
- compute what percentage of negatives is in these rows, compared to the whole set
- take the ratio of these percentages
- take the natural logarithm of that ratio to get the weight of evidence corresponding to 𝑥, so that 𝑊𝑂𝐸(𝑥) is either positive or negative according to whether 𝑥 is more representative of positives or negatives
More details:
- [blogpost on weight of evidence](https://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/)
Example:
```python
from skorecard import datasets
from skorecard.preprocessing import WoeEncoder
X, y = datasets.load_uci_credit_card(return_X_y=True)
we = WoeEncoder(variables=['EDUCATION'])
we.fit_transform(X, y)
we.fit_transform(X, y)['EDUCATION'].value_counts()
```
Credits: Some inspiration taken from [feature_engine.categorical_encoders](https://feature-engine.readthedocs.io/en/latest/encoding/index.html).
""" # noqa
def __init__(self, epsilon=0.0001, variables=[], handle_unknown="value"):
"""
Constructor for WoEEncoder.
Args:
epsilon (float): Amount to be added to relative counts in order to avoid division by zero in the WOE
calculation.
variables (list): The features to bucket. Uses all features if not defined.
handle_unknown (str): How to handle any new values encountered in X on transform().
options are 'return_nan', 'error' and 'value', defaults to 'value', which will assume WOE=0.
"""
self.epsilon = epsilon
self.variables = variables
self.handle_unknown = handle_unknown
warnings.warn(
"This encoder will be deprecated. Please use category_encoders.woe.WOEEncoder instead.", DeprecationWarning
)
def fit(self, X, y):
"""Calculate the WOE for every column.
Args:
X (np.array): (binned) features
y (np.array): target
"""
assert self.epsilon >= 0
# Check data
X = ensure_dataframe(X)
assert y is not None, "WoEBucketer needs a target y"
y = BaseBucketer._check_y(y)
y = y.astype(float)
if len(np.unique(y)) > 2:
raise AssertionError("WoEBucketer is only suited for binary classification")
self.variables_ = BaseBucketer._check_variables(X, self.variables)
# WoE currently does not support NAs
# This is also flagged in self._more_tags()
# We could treat missing values as a separate bin (-1) and thus handle seamlessly.
BaseBucketer._check_contains_na(X, self.variables_)
# scikit-learn requires checking that X has same shape on transform
# this is because scikit-learn is still positional based (no column names used)
self.n_train_features_ = X.shape[1]
self.woe_mapping_ = {}
for var in self.variables_:
t = woe_1d(X[var], y, epsilon=self.epsilon)
woe_dict = t["woe"].to_dict()
# If new categories encountered, returns WoE = 0
if self.handle_unknown == "value":
woe_dict = defaultdict(int, woe_dict)
self.woe_mapping_[var] = woe_dict
return self
def transform(self, X):
"""Transform X to weight of evidence encoding.
Args:
X (pd.DataFrame): dataset
"""
assert self.handle_unknown in ["value", "error", "return_nan"]
check_is_fitted(self)
X = ensure_dataframe(X)
if X.shape[1] != self.n_train_features_:
msg = f"Number of features in X ({X.shape[1]}) is different "
msg += f"from the number of features in X during fit ({self.n_train_features_})"
raise ValueError(msg)
for feature in self.variables_:
woe_dict = self.woe_mapping_.get(feature)
if self.handle_unknown == "error":
new_cats = [x for x in list(X[feature].unique()) if x not in list(woe_dict.keys())]
if len(new_cats) > 0:
msg = "WoEEncoder encountered unknown new categories "
msg += f"in column {feature} on .transform(): {new_cats}"
raise AssertionError(msg)
X[feature] = X[feature].map(woe_dict)
return X
def _more_tags(self):
"""
Estimator tags are annotations of estimators that allow programmatic inspection of their capabilities.
See https://scikit-learn.org/stable/developers/develop.html#estimator-tags
""" # noqa
return {"binary_only": True, "allow_nan": False}
|
__init__(epsilon=0.0001, variables=[], handle_unknown='value')
Constructor for WoEEncoder.
Parameters:
Name |
Type |
Description |
Default |
epsilon |
float
|
Amount to be added to relative counts in order to avoid division by zero in the WOE
calculation.
|
0.0001
|
variables |
list
|
The features to bucket. Uses all features if not defined.
|
[]
|
handle_unknown |
str
|
How to handle any new values encountered in X on transform().
options are 'return_nan', 'error' and 'value', defaults to 'value', which will assume WOE=0.
|
'value'
|
Source code in skorecard/preprocessing/_WoEEncoder.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72 | def __init__(self, epsilon=0.0001, variables=[], handle_unknown="value"):
"""
Constructor for WoEEncoder.
Args:
epsilon (float): Amount to be added to relative counts in order to avoid division by zero in the WOE
calculation.
variables (list): The features to bucket. Uses all features if not defined.
handle_unknown (str): How to handle any new values encountered in X on transform().
options are 'return_nan', 'error' and 'value', defaults to 'value', which will assume WOE=0.
"""
self.epsilon = epsilon
self.variables = variables
self.handle_unknown = handle_unknown
warnings.warn(
"This encoder will be deprecated. Please use category_encoders.woe.WOEEncoder instead.", DeprecationWarning
)
|
fit(X, y)
Calculate the WOE for every column.
Parameters:
Name |
Type |
Description |
Default |
X |
array
|
|
required
|
y |
array
|
|
required
|
Source code in skorecard/preprocessing/_WoEEncoder.py
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112 | def fit(self, X, y):
"""Calculate the WOE for every column.
Args:
X (np.array): (binned) features
y (np.array): target
"""
assert self.epsilon >= 0
# Check data
X = ensure_dataframe(X)
assert y is not None, "WoEBucketer needs a target y"
y = BaseBucketer._check_y(y)
y = y.astype(float)
if len(np.unique(y)) > 2:
raise AssertionError("WoEBucketer is only suited for binary classification")
self.variables_ = BaseBucketer._check_variables(X, self.variables)
# WoE currently does not support NAs
# This is also flagged in self._more_tags()
# We could treat missing values as a separate bin (-1) and thus handle seamlessly.
BaseBucketer._check_contains_na(X, self.variables_)
# scikit-learn requires checking that X has same shape on transform
# this is because scikit-learn is still positional based (no column names used)
self.n_train_features_ = X.shape[1]
self.woe_mapping_ = {}
for var in self.variables_:
t = woe_1d(X[var], y, epsilon=self.epsilon)
woe_dict = t["woe"].to_dict()
# If new categories encountered, returns WoE = 0
if self.handle_unknown == "value":
woe_dict = defaultdict(int, woe_dict)
self.woe_mapping_[var] = woe_dict
return self
|
Transform X to weight of evidence encoding.
Parameters:
Name |
Type |
Description |
Default |
X |
DataFrame
|
|
required
|
Source code in skorecard/preprocessing/_WoEEncoder.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140 | def transform(self, X):
"""Transform X to weight of evidence encoding.
Args:
X (pd.DataFrame): dataset
"""
assert self.handle_unknown in ["value", "error", "return_nan"]
check_is_fitted(self)
X = ensure_dataframe(X)
if X.shape[1] != self.n_train_features_:
msg = f"Number of features in X ({X.shape[1]}) is different "
msg += f"from the number of features in X during fit ({self.n_train_features_})"
raise ValueError(msg)
for feature in self.variables_:
woe_dict = self.woe_mapping_.get(feature)
if self.handle_unknown == "error":
new_cats = [x for x in list(X[feature].unique()) if x not in list(woe_dict.keys())]
if len(new_cats) > 0:
msg = "WoEEncoder encountered unknown new categories "
msg += f"in column {feature} on .transform(): {new_cats}"
raise AssertionError(msg)
X[feature] = X[feature].map(woe_dict)
return X
|