Skip to content

OptimalBucketer

Bases: BaseBucketer

The OptimalBucketer transformer uses the optbinning package to find optimal buckets.

Support: badge badge badge

This bucketer basically wraps optbinning.OptimalBinning to be consistent with skorecard. Requires a feature to be pre-bucketed to max 100 buckets. Optbinning uses a constrained programming solver to merge buckets, taking into account the following constraints 1) monotonicity in bad rate, 2) at least 5% of records per bin.

Example:

from skorecard import datasets
from skorecard.bucketers import OptimalBucketer

X, y = datasets.load_uci_credit_card(return_X_y=True)
bucketer = OptimalBucketer(variables = ['LIMIT_BAL'])
bucketer.fit_transform(X, y)
Source code in skorecard/bucketers/bucketers.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
class OptimalBucketer(BaseBucketer):
    """
    The `OptimalBucketer` transformer uses the [optbinning](http://gnpalencia.org/optbinning) package to find optimal buckets.

    Support: ![badge](https://img.shields.io/badge/numerical-true-green) ![badge](https://img.shields.io/badge/categorical-true-green) ![badge](https://img.shields.io/badge/supervised-true-green)

    This bucketer basically wraps optbinning.OptimalBinning to be consistent with skorecard.
    Requires a feature to be pre-bucketed to max 100 buckets.
    Optbinning uses a constrained programming solver to merge buckets,
    taking into account the following constraints 1) monotonicity in bad rate, 2) at least 5% of records per bin.

    Example:

    ```python
    from skorecard import datasets
    from skorecard.bucketers import OptimalBucketer

    X, y = datasets.load_uci_credit_card(return_X_y=True)
    bucketer = OptimalBucketer(variables = ['LIMIT_BAL'])
    bucketer.fit_transform(X, y)
    ```
    """  # noqa

    def __init__(
        self,
        variables=[],
        specials={},
        variables_type="numerical",
        max_n_bins=10,
        missing_treatment="separate",
        min_bin_size=0.05,
        cat_cutoff=None,
        time_limit=25,
        remainder="passthrough",
        get_statistics=True,
        solver="cp",
        monotonic_trend="auto_asc_desc",
        gamma=0,
        ob_kwargs={},
    ) -> None:
        """Initialize Optimal Bucketer.

        Args:
            variables: List of variables to bucket.
            specials: (nested) dictionary of special values that require their own binning.
                The dictionary has the following format:
                 {"<column name>" : {"name of special bucket" : <list with 1 or more values>}}
                For every feature that needs a special value, a dictionary must be passed as value.
                This dictionary contains a name of a bucket (key) and an array of unique values that should be put
                in that bucket.
                When special values are passed, they are not considered in the fitting procedure.
            variables_type: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Type of the variables. Must be either 'categorical' or 'numerical'.
            missing_treatment: Defines how we treat the missing values present in the data.
                If a string, it must be one of the following options:
                    separate: Missing values get put in a separate 'Other' bucket: `-1`
                    most_risky: Missing values are put into the bucket containing the largest percentage of Class 1.
                    least_risky: Missing values are put into the bucket containing the largest percentage of Class 0.
                    most_frequent: Missing values are put into the most common bucket.
                    neutral: Missing values are put into the bucket with WoE closest to 0.
                    similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values.
                    passthrough: Leaves missing values untouched.
                If a dict, it must be of the following format:
                    {"<column name>": <bucket_number>}
                    This bucket number is where we will put the missing values.
            min_bin_size: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Minimum fraction of observations in a bucket.
            max_n_bins: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Maximum numbers of bins to return.
            cat_cutoff: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Threshold ratio (None, or >0 and <=1) below which categories are grouped
                together in a bucket 'other'.
            time_limit (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Time limit in seconds to find an optimal solution.
            remainder: How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"].
                passthrough (Default): all columns that were not specified in "variables" will be passed through.
                drop: all remaining columns that were not specified in "variables" will be dropped.
            solver (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html): The optimizer to solve the optimal binning problem.
                Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver.
            monotonic_trend (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc”
                to automatically determine the trend maximizing IV using a machine learning classifier,
                “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point,
                and “valley” and “valley_heuristic” to allow a valley change point.
                Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point,
                and are significantly faster for large size instances (max_n_prebins > 20).
                Trend “auto_asc_desc” is used to automatically select the best monotonic trend
                between “ascending” and “descending”. If None, then the monotonic constraint is disabled.
            gamma (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Regularization strength to reduce the number of dominating bins.
                Larger values specify stronger regularization. Default is 0.
                Option supported by solvers “cp” and “mip”.
            ob_kwargs (dict): Other parameters passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html).
        """  # noqa
        self.variables = variables
        self.specials = specials
        self.variables_type = variables_type
        self.max_n_bins = max_n_bins
        self.missing_treatment = missing_treatment
        self.min_bin_size = min_bin_size
        self.cat_cutoff = cat_cutoff
        self.time_limit = time_limit
        self.remainder = remainder
        self.get_statistics = get_statistics
        self.solver = solver
        self.monotonic_trend = monotonic_trend
        self.gamma = gamma
        self.ob_kwargs = ob_kwargs

        check_args(ob_kwargs, OptimalBinning)

    def _get_feature_splits(self, feature, X, y, X_unfiltered=None):
        """
        Finds the splits for a single feature.

        X and y have already been preprocessed, and have specials removed.

        Args:
            feature (str): Name of the feature.
            X (pd.Series): df with single column of feature to bucket
            y (np.ndarray): array with target
            X_unfiltered (pd.Series): df with single column of feature to bucket before any filtering was applied

        Returns:
            splits, right (tuple): The splits (dict or array), and whether right=True or False.
        """
        # Normally Optbinning uses a DecisionTreeBucketer to do automatic prebinning
        # We require the user to pre-bucket explicitly before using this.
        if self.variables_type == "numerical":
            uniq_values = np.sort(np.unique(X.values))
            if len(uniq_values) > 100:
                raise NotPreBucketedError(
                    f"""
                    OptimalBucketer requires numerical feature '{feature}' to be pre-bucketed
                    to max 100 unique values (for performance reasons).
                    Currently there are {len(uniq_values)} unique values present.

                    Apply pre-binning, f.e. with skorecard.bucketers.DecisionTreeBucketer.
                    """
                )
            user_splits = uniq_values
        else:
            user_splits = None

        # Fit estimator
        binner = OptimalBinning(
            name=str(feature),
            dtype=self.variables_type,
            solver=self.solver,
            monotonic_trend=self.monotonic_trend,
            gamma=self.gamma,
            # On user_splits:
            # We want skorecard users to explicitly define pre-binning for numerical features
            # Setting the user_splits prevents OptimalBinning from doing pre-binning again.
            user_splits=user_splits,
            min_bin_size=self.min_bin_size,
            max_n_bins=self.max_n_bins,
            cat_cutoff=self.cat_cutoff,
            time_limit=self.time_limit,
            **self.ob_kwargs,
        )
        binner.fit(X.values, y)

        # Extract fitted boundaries
        if self.variables_type == "categorical":
            splits = {}
            for bucket_nr, values in enumerate(binner.splits):
                for value in values:
                    splits[value] = bucket_nr
        else:
            splits = binner.splits

        # Note that optbinning transform uses right=False
        # https://github.com/guillermo-navas-palencia/optbinning/blob/396b9bed97581094167c9eb4744c2fd1fb5c7408/optbinning/binning/transformations.py#L126-L132
        return (splits, False)

__init__(variables=[], specials={}, variables_type='numerical', max_n_bins=10, missing_treatment='separate', min_bin_size=0.05, cat_cutoff=None, time_limit=25, remainder='passthrough', get_statistics=True, solver='cp', monotonic_trend='auto_asc_desc', gamma=0, ob_kwargs={})

Initialize Optimal Bucketer.

Parameters:

Name Type Description Default
variables

List of variables to bucket.

[]
specials

(nested) dictionary of special values that require their own binning. The dictionary has the following format: {"" : {"name of special bucket" : }} For every feature that needs a special value, a dictionary must be passed as value. This dictionary contains a name of a bucket (key) and an array of unique values that should be put in that bucket. When special values are passed, they are not considered in the fitting procedure.

{}
variables_type

Passed to optbinning.OptimalBinning: Type of the variables. Must be either 'categorical' or 'numerical'.

'numerical'
missing_treatment

Defines how we treat the missing values present in the data. If a string, it must be one of the following options: separate: Missing values get put in a separate 'Other' bucket: -1 most_risky: Missing values are put into the bucket containing the largest percentage of Class 1. least_risky: Missing values are put into the bucket containing the largest percentage of Class 0. most_frequent: Missing values are put into the most common bucket. neutral: Missing values are put into the bucket with WoE closest to 0. similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values. passthrough: Leaves missing values untouched. If a dict, it must be of the following format: {"": } This bucket number is where we will put the missing values.

'separate'
min_bin_size

Passed to optbinning.OptimalBinning: Minimum fraction of observations in a bucket.

0.05
max_n_bins

Passed to optbinning.OptimalBinning: Maximum numbers of bins to return.

10
cat_cutoff

Passed to optbinning.OptimalBinning: Threshold ratio (None, or >0 and <=1) below which categories are grouped together in a bucket 'other'.

None
time_limit float

Passed to optbinning.OptimalBinning: Time limit in seconds to find an optimal solution.

25
remainder

How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped.

'passthrough'
solver str

Passed to optbinning.OptimalBinning: The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver.

'cp'
monotonic_trend str

Passed to optbinning.OptimalBinning: The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (max_n_prebins > 20). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.

'auto_asc_desc'
gamma float

Passed to optbinning.OptimalBinning: Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization. Default is 0. Option supported by solvers “cp” and “mip”.

0
ob_kwargs dict

Other parameters passed to optbinning.OptimalBinning.

{}
Source code in skorecard/bucketers/bucketers.py
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
def __init__(
    self,
    variables=[],
    specials={},
    variables_type="numerical",
    max_n_bins=10,
    missing_treatment="separate",
    min_bin_size=0.05,
    cat_cutoff=None,
    time_limit=25,
    remainder="passthrough",
    get_statistics=True,
    solver="cp",
    monotonic_trend="auto_asc_desc",
    gamma=0,
    ob_kwargs={},
) -> None:
    """Initialize Optimal Bucketer.

    Args:
        variables: List of variables to bucket.
        specials: (nested) dictionary of special values that require their own binning.
            The dictionary has the following format:
             {"<column name>" : {"name of special bucket" : <list with 1 or more values>}}
            For every feature that needs a special value, a dictionary must be passed as value.
            This dictionary contains a name of a bucket (key) and an array of unique values that should be put
            in that bucket.
            When special values are passed, they are not considered in the fitting procedure.
        variables_type: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Type of the variables. Must be either 'categorical' or 'numerical'.
        missing_treatment: Defines how we treat the missing values present in the data.
            If a string, it must be one of the following options:
                separate: Missing values get put in a separate 'Other' bucket: `-1`
                most_risky: Missing values are put into the bucket containing the largest percentage of Class 1.
                least_risky: Missing values are put into the bucket containing the largest percentage of Class 0.
                most_frequent: Missing values are put into the most common bucket.
                neutral: Missing values are put into the bucket with WoE closest to 0.
                similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values.
                passthrough: Leaves missing values untouched.
            If a dict, it must be of the following format:
                {"<column name>": <bucket_number>}
                This bucket number is where we will put the missing values.
        min_bin_size: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Minimum fraction of observations in a bucket.
        max_n_bins: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Maximum numbers of bins to return.
        cat_cutoff: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Threshold ratio (None, or >0 and <=1) below which categories are grouped
            together in a bucket 'other'.
        time_limit (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Time limit in seconds to find an optimal solution.
        remainder: How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"].
            passthrough (Default): all columns that were not specified in "variables" will be passed through.
            drop: all remaining columns that were not specified in "variables" will be dropped.
        solver (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html): The optimizer to solve the optimal binning problem.
            Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver.
        monotonic_trend (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc”
            to automatically determine the trend maximizing IV using a machine learning classifier,
            “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point,
            and “valley” and “valley_heuristic” to allow a valley change point.
            Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point,
            and are significantly faster for large size instances (max_n_prebins > 20).
            Trend “auto_asc_desc” is used to automatically select the best monotonic trend
            between “ascending” and “descending”. If None, then the monotonic constraint is disabled.
        gamma (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Regularization strength to reduce the number of dominating bins.
            Larger values specify stronger regularization. Default is 0.
            Option supported by solvers “cp” and “mip”.
        ob_kwargs (dict): Other parameters passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html).
    """  # noqa
    self.variables = variables
    self.specials = specials
    self.variables_type = variables_type
    self.max_n_bins = max_n_bins
    self.missing_treatment = missing_treatment
    self.min_bin_size = min_bin_size
    self.cat_cutoff = cat_cutoff
    self.time_limit = time_limit
    self.remainder = remainder
    self.get_statistics = get_statistics
    self.solver = solver
    self.monotonic_trend = monotonic_trend
    self.gamma = gamma
    self.ob_kwargs = ob_kwargs

    check_args(ob_kwargs, OptimalBinning)