BucketingProcess
Bases: BaseEstimator
, TransformerMixin
, BucketTableMethod
, PlotBucketMethod
, PlotPreBucketMethod
, SummaryMethod
A two-step bucketing pipeline allowing for pre-bucketing before bucketing.
Often you want to pre-bucket features (f.e. to 100 buckets) before bucketing to a smaller set. This brings some additional challenges around propagating specials and defining a bucketer that is able to go from raw data to final bucket. This class facilicates the process and also provides all regular methods and attributes:
.summary()
: See which columns are bucketed.plot_bucket()
: Plot buckets of a column.bucket_table()
: Table with buckets of a column.save_to_yaml()
: Save information necessary for bucketing to a YAML file.features_bucket_mapping_
: Access bucketing information
Example:
from skorecard import datasets
from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer, AsIsCategoricalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline
df = datasets.load_uci_credit_card(as_frame=True)
y = df["default"]
X = df.drop(columns=["default"])
num_cols = ["LIMIT_BAL", "BILL_AMT1"]
cat_cols = ["EDUCATION", "MARRIAGE"]
bucketing_process = BucketingProcess(
specials={'LIMIT_BAL': {'=400000.0' : [400000.0]}},
prebucketing_pipeline=make_pipeline(
DecisionTreeBucketer(variables=num_cols, max_n_bins=100, min_bin_size=0.05),
AsIsCategoricalBucketer(variables=cat_cols),
),
bucketing_pipeline=make_pipeline(
OptimalBucketer(variables=num_cols, max_n_bins=10, min_bin_size=0.05),
OptimalBucketer(variables=cat_cols, variables_type='categorical', max_n_bins=10, min_bin_size=0.05),
)
)
bucketing_process.fit(X, y)
# Details
bucketing_process.summary() # all vars, and # buckets
bucketing_process.bucket_table("LIMIT_BAL")
bucketing_process.plot_bucket("LIMIT_BAL")
bucketing_process.prebucket_table("LIMIT_BAL")
bucketing_process.plot_prebucket("LIMIT_BAL")
Source code in skorecard/pipeline/bucketing_process.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 |
|
features_bucket_mapping_
property
¶
Returns a FeaturesBucketMapping
instance.
In normal bucketers, you can access .features_bucket_mapping_
to retrieve a FeaturesBucketMapping
instance. This contains
all the info you need to transform values into their buckets.
In this class, we basically have a two step bucketing process: first prebucketing, and then we bucket the prebuckets.
In order to still be able to use BucketingProcess as if it were a normal bucketer, we'll need to merge both into one.
name
property
¶
To be able to identity the bucketingprocess in a pipeline.
__init__(prebucketing_pipeline=make_pipeline(DecisionTreeBucketer(max_n_bins=50, min_bin_size=0.02)), bucketing_pipeline=make_pipeline(OptimalBucketer(max_n_bins=6, min_bin_size=0.05)), variables=[], specials={}, random_state=None, remainder='passthrough')
¶
Define a BucketingProcess to first prebucket and then bucket multiple columns in one go.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prebucketing_pipeline |
Pipeline
|
The scikit-learn pipeline that does pre-bucketing. Defaults to an all-numeric DecisionTreeBucketer pipeline. |
make_pipeline(DecisionTreeBucketer(max_n_bins=50, min_bin_size=0.02))
|
bucketing_pipeline |
Pipeline
|
The scikit-learn pipeline that does bucketing. Defaults to an all-numeric OptimalBucketer pipeline. Must transform same features as the prebucketing pipeline. |
make_pipeline(OptimalBucketer(max_n_bins=6, min_bin_size=0.05))
|
variables |
list
|
The features to bucket. Uses all features if not defined. |
[]
|
specials |
Dict
|
(nested) dictionary of special values that require their own binning.
Will merge when specials are also defined in any bucketers in a (pre)bucketing pipeline, and overwrite in case there are shared keys.
The dictionary has the following format:
{" |
{}
|
remainder |
str
|
How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped. |
'passthrough'
|
Source code in skorecard/pipeline/bucketing_process.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|
fit(X, y=None)
¶
Fit the prebucketing and bucketing pipeline with X
, y
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
DataFrame
|
Data to fit on. |
required |
y |
array
|
target. Defaults to None. |
None
|
Source code in skorecard/pipeline/bucketing_process.py
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
|
fit_interactive(X, y=None, mode='external', **server_kwargs)
¶
Fit a bucketer and then interactive edit the fit using a dash app.
Note we are using a jupyterdash app, which supports 3 different modes:
- 'external' (default): Start dash server and print URL
- 'inline': Start dash app inside an Iframe in the jupyter notebook
- 'jupyterlab': Start dash app as a new tab inside jupyterlab
Source code in skorecard/pipeline/bucketing_process.py
285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 |
|
prebucket_table(column)
¶
Generates the statistics for the buckets of a particular column.
An example is seen below:
pre-bucket | label | Count | Count (%) | Non-event | Event | Event Rate | WoE | IV | bucket |
---|---|---|---|---|---|---|---|---|---|
0 | (-inf, 1.0) | 479 | 7.98 | 300 | 179 | 37.37 | 0.73 | 0.05 | 0 |
1 | [1.0, 2.0) | 370 | 6.17 | 233 | 137 | 37.03 | 0.71 | 0.04 | 0 |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column |
str
|
The column we wish to analyse |
required |
Returns:
Name | Type | Description |
---|---|---|
df |
DataFrame
|
A pandas dataframe of the format above |
Source code in skorecard/pipeline/bucketing_process.py
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 |
|
save_yml(fout)
¶
Save the features bucket to a yaml file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fout |
PathLike
|
path for output file |
required |
Source code in skorecard/pipeline/bucketing_process.py
322 323 324 325 326 327 328 329 330 331 332 333 334 |
|
transform(X)
¶
Transform X
through the prebucketing and bucketing pipelines.
Source code in skorecard/pipeline/bucketing_process.py
308 309 310 311 312 313 314 315 316 317 318 319 320 |
|