Skip to content

AsIsNumericalBucketer

Bases: BaseBucketer

The AsIsNumericalBucketer transformer creates buckets by treating the existing unique values as boundaries.

Support: badge badge badge

This is bucketer is useful when you have data that is already sufficiented bucketed, but you would like to be able to bucket new data in the same way.

Example:

from skorecard import datasets
from skorecard.bucketers import AsIsNumericalBucketer

X, y = datasets.load_uci_credit_card(return_X_y=True)
bucketer = AsIsNumericalBucketer(variables=['LIMIT_BAL'])
bucketer.fit_transform(X)
Source code in skorecard/bucketers/bucketers.py
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
class AsIsNumericalBucketer(BaseBucketer):
    """
    The `AsIsNumericalBucketer` transformer creates buckets by treating the existing unique values as boundaries.

    Support: ![badge](https://img.shields.io/badge/numerical-true-green) ![badge](https://img.shields.io/badge/categorical-false-red) ![badge](https://img.shields.io/badge/supervised-false-blue)

    This is bucketer is useful when you have data that is already sufficiented bucketed,
    but you would like to be able to bucket new data in the same way.

    Example:

    ```python
    from skorecard import datasets
    from skorecard.bucketers import AsIsNumericalBucketer

    X, y = datasets.load_uci_credit_card(return_X_y=True)
    bucketer = AsIsNumericalBucketer(variables=['LIMIT_BAL'])
    bucketer.fit_transform(X)
    ```
    """  # noqa

    def __init__(
        self,
        right=True,
        variables=[],
        specials={},
        missing_treatment="separate",
        remainder="passthrough",
        get_statistics=True,
    ):
        """
        Init the class.

        Args:
            right (boolean): Is the right value included in a range (default) or is 'up to not but including'.
                For example, if you have [5, 10], the ranges for right=True would be (-Inf, 5], (5, 10], (10, Inf]
                or [-Inf, 5), [5, 10), [10, Inf) for right=False
            variables (list): The features to bucket. Uses all features if not defined.
            specials (dict): (nested) dictionary of special values that require their own binning.
                The dictionary has the following format:
                 {"<column name>" : {"name of special bucket" : <list with 1 or more values>}}
                For every feature that needs a special value, a dictionary must be passed as value.
                This dictionary contains a name of a bucket (key) and an array of unique values that should be put
                in that bucket.
                When special values are defined, they are not considered in the fitting procedure.
            missing_treatment (str or dict): Defines how we treat the missing values present in the data.
                If a string, it must be one of the following options:
                    separate: Missing values get put in a separate 'Other' bucket: `-1`
                    most_risky: Missing values are put into the bucket containing the largest percentage of Class 1.
                    least_risky: Missing values are put into the bucket containing the largest percentage of Class 0.
                    most_frequent: Missing values are put into the most common bucket.
                    neutral: Missing values are put into the bucket with WoE closest to 0.
                    similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values.
                    passthrough: Leaves missing values untouched.
                If a dict, it must be of the following format:
                    {"<column name>": <bucket_number>}
                    This bucket number is where we will put the missing values..
            remainder (str): How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"].
                passthrough (Default): all columns that were not specified in "variables" will be passed through.
                drop: all remaining columns that were not specified in "variables" will be dropped.
        """  # noqa
        self.right = right
        self.variables = variables
        self.specials = specials
        self.missing_treatment = missing_treatment
        self.remainder = remainder
        self.get_statistics = get_statistics

    @property
    def variables_type(self):
        """
        Signals variables type supported by this bucketer.
        """
        return "numerical"

    def _get_feature_splits(self, feature, X, y, X_unfiltered=None):
        """
        Finds the splits for a single feature.

        X and y have already been preprocessed, and have specials removed.

        Args:
            feature (str): Name of the feature.
            X (pd.Series): df with single column of feature to bucket
            y (np.ndarray): array with target
            X_unfiltered (pd.Series): df with single column of feature to bucket before any filtering was applied

        Returns:
            splits, right (tuple): The splits (dict or array), and whether right=True or False.
        """
        boundaries = X.unique().tolist()
        boundaries.sort()

        if len(boundaries) > 100:
            msg = f"The column '{feature}' has more than 100 unique values "
            msg += "and cannot be used with the AsIsBucketer."
            msg += "Apply a different bucketer first."
            raise NotPreBucketedError(msg)

        return (boundaries, self.right)

variables_type property

Signals variables type supported by this bucketer.

__init__(right=True, variables=[], specials={}, missing_treatment='separate', remainder='passthrough', get_statistics=True)

Init the class.

Parameters:

Name Type Description Default
right boolean

Is the right value included in a range (default) or is 'up to not but including'. For example, if you have [5, 10], the ranges for right=True would be (-Inf, 5], (5, 10], (10, Inf] or [-Inf, 5), [5, 10), [10, Inf) for right=False

True
variables list

The features to bucket. Uses all features if not defined.

[]
specials dict

(nested) dictionary of special values that require their own binning. The dictionary has the following format: {"" : {"name of special bucket" : }} For every feature that needs a special value, a dictionary must be passed as value. This dictionary contains a name of a bucket (key) and an array of unique values that should be put in that bucket. When special values are defined, they are not considered in the fitting procedure.

{}
missing_treatment str or dict

Defines how we treat the missing values present in the data. If a string, it must be one of the following options: separate: Missing values get put in a separate 'Other' bucket: -1 most_risky: Missing values are put into the bucket containing the largest percentage of Class 1. least_risky: Missing values are put into the bucket containing the largest percentage of Class 0. most_frequent: Missing values are put into the most common bucket. neutral: Missing values are put into the bucket with WoE closest to 0. similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values. passthrough: Leaves missing values untouched. If a dict, it must be of the following format: {"": } This bucket number is where we will put the missing values..

'separate'
remainder str

How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"]. passthrough (Default): all columns that were not specified in "variables" will be passed through. drop: all remaining columns that were not specified in "variables" will be dropped.

'passthrough'
Source code in skorecard/bucketers/bucketers.py
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
def __init__(
    self,
    right=True,
    variables=[],
    specials={},
    missing_treatment="separate",
    remainder="passthrough",
    get_statistics=True,
):
    """
    Init the class.

    Args:
        right (boolean): Is the right value included in a range (default) or is 'up to not but including'.
            For example, if you have [5, 10], the ranges for right=True would be (-Inf, 5], (5, 10], (10, Inf]
            or [-Inf, 5), [5, 10), [10, Inf) for right=False
        variables (list): The features to bucket. Uses all features if not defined.
        specials (dict): (nested) dictionary of special values that require their own binning.
            The dictionary has the following format:
             {"<column name>" : {"name of special bucket" : <list with 1 or more values>}}
            For every feature that needs a special value, a dictionary must be passed as value.
            This dictionary contains a name of a bucket (key) and an array of unique values that should be put
            in that bucket.
            When special values are defined, they are not considered in the fitting procedure.
        missing_treatment (str or dict): Defines how we treat the missing values present in the data.
            If a string, it must be one of the following options:
                separate: Missing values get put in a separate 'Other' bucket: `-1`
                most_risky: Missing values are put into the bucket containing the largest percentage of Class 1.
                least_risky: Missing values are put into the bucket containing the largest percentage of Class 0.
                most_frequent: Missing values are put into the most common bucket.
                neutral: Missing values are put into the bucket with WoE closest to 0.
                similar: Missing values are put into the bucket with WoE closest to the bucket with only missing values.
                passthrough: Leaves missing values untouched.
            If a dict, it must be of the following format:
                {"<column name>": <bucket_number>}
                This bucket number is where we will put the missing values..
        remainder (str): How we want the non-specified columns to be transformed. It must be in ["passthrough", "drop"].
            passthrough (Default): all columns that were not specified in "variables" will be passed through.
            drop: all remaining columns that were not specified in "variables" will be dropped.
    """  # noqa
    self.right = right
    self.variables = variables
    self.specials = specials
    self.missing_treatment = missing_treatment
    self.remainder = remainder
    self.get_statistics = get_statistics