OptimalBucketer

Bases: BaseBucketer

The OptimalBucketer transformer uses the optbinning package to find optimal buckets.

Support:

This bucketer basically wraps optbinning.OptimalBinning to be consistent with skorecard. Requires a feature to be pre-bucketed to max 100 buckets. Optbinning uses a constrained programming solver to merge buckets, taking into account the following constraints 1) monotonicity in bad rate, 2) at least 5% of records per bin.

Example:

from skorecard import datasets
from skorecard.bucketers import OptimalBucketer

X, y = datasets.load_uci_credit_card(return_X_y=True)
bucketer = OptimalBucketer(variables = ['LIMIT_BAL'])
bucketer.fit_transform(X, y)

Source code in skorecard/bucketers/bucketers.py

class OptimalBucketer(BaseBucketer):
    """
    The `OptimalBucketer` transformer uses the [optbinning](http://gnpalencia.org/optbinning) package to find optimal buckets.

    Support: ![badge](https://img.shields.io/badge/numerical-true-green) ![badge](https://img.shields.io/badge/categorical-true-green) ![badge](https://img.shields.io/badge/supervised-true-green)

    This bucketer basically wraps optbinning.OptimalBinning to be consistent with skorecard.
    Requires a feature to be pre-bucketed to max 100 buckets.
    Optbinning uses a constrained programming solver to merge buckets,
    taking into account the following constraints 1) monotonicity in bad rate, 2) at least 5% of records per bin.

    Example:

    ```python
    from skorecard import datasets
    from skorecard.bucketers import OptimalBucketer

    X, y = datasets.load_uci_credit_card(return_X_y=True)
    bucketer = OptimalBucketer(variables = ['LIMIT_BAL'])
    bucketer.fit_transform(X, y)
    ```
    """  # noqa

    def __init__(
        self,
        variables=[],
        specials={},
        variables_type="numerical",
        max_n_bins=10,
        missing_treatment="separate",
        min_bin_size=0.05,
        cat_cutoff=None,
        time_limit=25,
        remainder="passthrough",
        get_statistics=True,
        solver="cp",
        monotonic_trend="auto_asc_desc",
        gamma=0,
        ob_kwargs={},
    ) -> None:
        """Initialize Optimal Bucketer.

        Args:
            variables: List of variables to bucket.
            specials: (nested) dictionary of special values that require their own binning.
                The dictionary has the following format:
                 {"<column name>" : {"name of special bucket" : <list with 1 or more values>}}
                For every feature that needs a special value, a dictionary must be passed as value.
                This dictionary contains a name of a bucket (key) and an array of unique values that should be put
                in that bucket.
                When special values are passed, they are not considered in the fitting procedure.
            variables_type: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Type of the variables. Must be either 'categorical' or 'numerical'.
            missing_treatment: Defines how we treat the missing values present in the data.
                If a string, it must be one of the following options:
                    separate: Missing values get put in a separate 'Other' bucket: `-1`
                    most_risky: Missing values are put into the bucket containing the largest percentage of Class 1.
                    least_risky: Missing values are put into the bucket containing the largest percentage of Class 0.
                    most_frequent: Missing values are put into the most common bucket.
                    neutral: Missing values are put into the bucket with WoE closest to 0.
                    similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values.
                    passthrough: Leaves missing values untouched.
                If a dict, it must be of the following format:
                    {"<column name>": <bucket_number>}
                    This bucket number is where we will put the missing values.
            min_bin_size: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Minimum fraction of observations in a bucket.
            max_n_bins: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Maximum numbers of bins to return.
            cat_cutoff: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Threshold ratio (None, or >0 and <=1) below which categories are grouped
                together in a bucket 'other'.
            time_limit (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Time limit in seconds to find an optimal solution.
            remainder: How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"].
                passthrough (Default): all columns that were not specified in "variables" will be passed through.
                drop: all remaining columns that were not specified in "variables" will be dropped.
            solver (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html): The optimizer to solve the optimal binning problem.
                Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver.
            monotonic_trend (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc”
                to automatically determine the trend maximizing IV using a machine learning classifier,
                “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point,
                and “valley” and “valley_heuristic” to allow a valley change point.
                Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point,
                and are significantly faster for large size instances (max_n_prebins > 20).
                Trend “auto_asc_desc” is used to automatically select the best monotonic trend
                between “ascending” and “descending”. If None, then the monotonic constraint is disabled.
            gamma (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
                Regularization strength to reduce the number of dominating bins.
                Larger values specify stronger regularization. Default is 0.
                Option supported by solvers “cp” and “mip”.
            ob_kwargs (dict): Other parameters passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html).
        """  # noqa
        self.variables = variables
        self.specials = specials
        self.variables_type = variables_type
        self.max_n_bins = max_n_bins
        self.missing_treatment = missing_treatment
        self.min_bin_size = min_bin_size
        self.cat_cutoff = cat_cutoff
        self.time_limit = time_limit
        self.remainder = remainder
        self.get_statistics = get_statistics
        self.solver = solver
        self.monotonic_trend = monotonic_trend
        self.gamma = gamma
        self.ob_kwargs = ob_kwargs

        check_args(ob_kwargs, OptimalBinning)

    def _get_feature_splits(self, feature, X, y, X_unfiltered=None):
        """
        Finds the splits for a single feature.

        X and y have already been preprocessed, and have specials removed.

        Args:
            feature (str): Name of the feature.
            X (pd.Series): df with single column of feature to bucket
            y (np.ndarray): array with target
            X_unfiltered (pd.Series): df with single column of feature to bucket before any filtering was applied

        Returns:
            splits, right (tuple): The splits (dict or array), and whether right=True or False.
        """
        # Normally Optbinning uses a DecisionTreeBucketer to do automatic prebinning
        # We require the user to pre-bucket explicitly before using this.
        if self.variables_type == "numerical":
            uniq_values = np.sort(np.unique(X.values))
            if len(uniq_values) > 100:
                raise NotPreBucketedError(
                    f"""
                    OptimalBucketer requires numerical feature '{feature}' to be pre-bucketed
                    to max 100 unique values (for performance reasons).
                    Currently there are {len(uniq_values)} unique values present.

                    Apply pre-binning, f.e. with skorecard.bucketers.DecisionTreeBucketer.
                    """
                )
            user_splits = uniq_values
        else:
            user_splits = None

        # Fit estimator
        binner = OptimalBinning(
            name=str(feature),
            dtype=self.variables_type,
            solver=self.solver,
            monotonic_trend=self.monotonic_trend,
            gamma=self.gamma,
            # On user_splits:
            # We want skorecard users to explicitly define pre-binning for numerical features
            # Setting the user_splits prevents OptimalBinning from doing pre-binning again.
            user_splits=user_splits,
            min_bin_size=self.min_bin_size,
            max_n_bins=self.max_n_bins,
            cat_cutoff=self.cat_cutoff,
            time_limit=self.time_limit,
            **self.ob_kwargs,
        )
        binner.fit(X.values, y)

        # Extract fitted boundaries
        if self.variables_type == "categorical":
            splits = {}
            for bucket_nr, values in enumerate(binner.splits):
                for value in values:
                    splits[value] = bucket_nr
        else:
            splits = binner.splits

        # Note that optbinning transform uses right=False
        # https://github.com/guillermo-navas-palencia/optbinning/blob/396b9bed97581094167c9eb4744c2fd1fb5c7408/optbinning/binning/transformations.py#L126-L132
        return (splits, False)

`init(variables=[], specials={}, variables_type='numerical', max_n_bins=10, missing_treatment='separate', min_bin_size=0.05, cat_cutoff=None, time_limit=25, remainder='passthrough', get_statistics=True, solver='cp', monotonic_trend='auto_asc_desc', gamma=0, ob_kwargs={})` ¶

Initialize Optimal Bucketer.

Parameters:

Name	Type	Description	Default
`variables`		List of variables to bucket.	`[]`
`specials`		(nested) dictionary of special values that require their own binning. The dictionary has the following format: {"" : {"name of special bucket" : }} For every feature that needs a special value, a dictionary must be passed as value. This dictionary contains a name of a bucket (key) and an array of unique values that should be put in that bucket. When special values are passed, they are not considered in the fitting procedure.	`{}`
`variables_type`		Passed to optbinning.OptimalBinning: Type of the variables. Must be either 'categorical' or 'numerical'.	`'numerical'`
`missing_treatment`		Defines how we treat the missing values present in the data. If a string, it must be one of the following options: separate: Missing values get put in a separate 'Other' bucket: `-1` most_risky: Missing values are put into the bucket containing the largest percentage of Class 1. least_risky: Missing values are put into the bucket containing the largest percentage of Class 0. most_frequent: Missing values are put into the most common bucket. neutral: Missing values are put into the bucket with WoE closest to 0. similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values. passthrough: Leaves missing values untouched. If a dict, it must be of the following format: {"": } This bucket number is where we will put the missing values.	`'separate'`
`min_bin_size`		Passed to optbinning.OptimalBinning: Minimum fraction of observations in a bucket.	`0.05`
`max_n_bins`		Passed to optbinning.OptimalBinning: Maximum numbers of bins to return.	`10`
`cat_cutoff`		Passed to optbinning.OptimalBinning: Threshold ratio (None, or >0 and <=1) below which categories are grouped together in a bucket 'other'.	`None`
`time_limit`	`float`	Passed to optbinning.OptimalBinning: Time limit in seconds to find an optimal solution.	`25`
`remainder`		How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped.	`'passthrough'`
`solver`	`str`	Passed to optbinning.OptimalBinning: The optimizer to solve the optimal binning problem. Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver.	`'cp'`
`monotonic_trend`	`str`	Passed to optbinning.OptimalBinning: The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc” to automatically determine the trend maximizing IV using a machine learning classifier, “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point, and “valley” and “valley_heuristic” to allow a valley change point. Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point, and are significantly faster for large size instances (max_n_prebins > 20). Trend “auto_asc_desc” is used to automatically select the best monotonic trend between “ascending” and “descending”. If None, then the monotonic constraint is disabled.	`'auto_asc_desc'`
`gamma`	`float`	Passed to optbinning.OptimalBinning: Regularization strength to reduce the number of dominating bins. Larger values specify stronger regularization. Default is 0. Option supported by solvers “cp” and “mip”.	`0`
`ob_kwargs`	`dict`	Other parameters passed to optbinning.OptimalBinning.	`{}`

Source code in skorecard/bucketers/bucketers.py

def __init__(
    self,
    variables=[],
    specials={},
    variables_type="numerical",
    max_n_bins=10,
    missing_treatment="separate",
    min_bin_size=0.05,
    cat_cutoff=None,
    time_limit=25,
    remainder="passthrough",
    get_statistics=True,
    solver="cp",
    monotonic_trend="auto_asc_desc",
    gamma=0,
    ob_kwargs={},
) -> None:
    """Initialize Optimal Bucketer.

    Args:
        variables: List of variables to bucket.
        specials: (nested) dictionary of special values that require their own binning.
            The dictionary has the following format:
             {"<column name>" : {"name of special bucket" : <list with 1 or more values>}}
            For every feature that needs a special value, a dictionary must be passed as value.
            This dictionary contains a name of a bucket (key) and an array of unique values that should be put
            in that bucket.
            When special values are passed, they are not considered in the fitting procedure.
        variables_type: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Type of the variables. Must be either 'categorical' or 'numerical'.
        missing_treatment: Defines how we treat the missing values present in the data.
            If a string, it must be one of the following options:
                separate: Missing values get put in a separate 'Other' bucket: `-1`
                most_risky: Missing values are put into the bucket containing the largest percentage of Class 1.
                least_risky: Missing values are put into the bucket containing the largest percentage of Class 0.
                most_frequent: Missing values are put into the most common bucket.
                neutral: Missing values are put into the bucket with WoE closest to 0.
                similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values.
                passthrough: Leaves missing values untouched.
            If a dict, it must be of the following format:
                {"<column name>": <bucket_number>}
                This bucket number is where we will put the missing values.
        min_bin_size: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Minimum fraction of observations in a bucket.
        max_n_bins: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Maximum numbers of bins to return.
        cat_cutoff: Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Threshold ratio (None, or >0 and <=1) below which categories are grouped
            together in a bucket 'other'.
        time_limit (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Time limit in seconds to find an optimal solution.
        remainder: How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"].
            passthrough (Default): all columns that were not specified in "variables" will be passed through.
            drop: all remaining columns that were not specified in "variables" will be dropped.
        solver (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html): The optimizer to solve the optimal binning problem.
            Supported solvers are “mip” to choose a mixed-integer programming solver, “cp” (default) to choose a constrained programming solver or “ls” to choose LocalSolver.
        monotonic_trend (str): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            The event rate monotonic trend. Supported trends are “auto”, “auto_heuristic” and “auto_asc_desc”
            to automatically determine the trend maximizing IV using a machine learning classifier,
            “ascending”, “descending”, “concave”, “convex”, “peak” and “peak_heuristic” to allow a peak change point,
            and “valley” and “valley_heuristic” to allow a valley change point.
            Trends “auto_heuristic”, “peak_heuristic” and “valley_heuristic” use a heuristic to determine the change point,
            and are significantly faster for large size instances (max_n_prebins > 20).
            Trend “auto_asc_desc” is used to automatically select the best monotonic trend
            between “ascending” and “descending”. If None, then the monotonic constraint is disabled.
        gamma (float): Passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html):
            Regularization strength to reduce the number of dominating bins.
            Larger values specify stronger regularization. Default is 0.
            Option supported by solvers “cp” and “mip”.
        ob_kwargs (dict): Other parameters passed to [optbinning.OptimalBinning](http://gnpalencia.org/optbinning/binning_binary.html).
    """  # noqa
    self.variables = variables
    self.specials = specials
    self.variables_type = variables_type
    self.max_n_bins = max_n_bins
    self.missing_treatment = missing_treatment
    self.min_bin_size = min_bin_size
    self.cat_cutoff = cat_cutoff
    self.time_limit = time_limit
    self.remainder = remainder
    self.get_statistics = get_statistics
    self.solver = solver
    self.monotonic_trend = monotonic_trend
    self.gamma = gamma
    self.ob_kwargs = ob_kwargs

    check_args(ob_kwargs, OptimalBinning)

OptimalBucketer

__init__(variables=[], specials={}, variables_type='numerical', max_n_bins=10, missing_treatment='separate', min_bin_size=0.05, cat_cutoff=None, time_limit=25, remainder='passthrough', get_statistics=True, solver='cp', monotonic_trend='auto_asc_desc', gamma=0, ob_kwargs={}) ¶

`init(variables=[], specials={}, variables_type='numerical', max_n_bins=10, missing_treatment='separate', min_bin_size=0.05, cat_cutoff=None, time_limit=25, remainder='passthrough', get_statistics=True, solver='cp', monotonic_trend='auto_asc_desc', gamma=0, ob_kwargs={})` ¶