Automatic Feature selection techniques¶

Below we'll walk you through how to change selecting the best feature_set manually to automatically by using get_reduced_features_set. This function has many ways of selecting the most optimal feature selection for your use-case. Let's show you first how a simple 'manaul' probatus feature elimination works before we'll show you how to automate it.

In [ ]:

Copied!

%%capture
!pip install probatus
!pip install catboost
%%capture
!pip install probatus
!pip install catboost

In [13]:

Copied!

from probatus.feature_elimination import ShapRFECV
from sklearn.datasets import make_classification
from catboost import CatBoostClassifier
from probatus.feature_elimination import ShapRFECV
from sklearn.datasets import make_classification
from catboost import CatBoostClassifier

Basic example of ShapRFECV¶

In [14]:

Copied!





# Simple ShapRFECV example
X, y = make_classification(n_samples=500, n_informative=20, n_features=50)
model = CatBoostClassifier(n_estimators=100, verbose=0)
shap_elimination = ShapRFECV(model, step=0.2, min_features_to_select=5, cv=5, scoring="f1")
report = shap_elimination.fit_compute(X, y)
# Simple ShapRFECV example
X, y = make_classification(n_samples=500, n_informative=20, n_features=50)
model = CatBoostClassifier(n_estimators=100, verbose=0)
shap_elimination = ShapRFECV(model, step=0.2, min_features_to_select=5, cv=5, scoring="f1")
report = shap_elimination.fit_compute(X, y)

In [15]:

Copied!

# Inspect the report generated by ShapRFECV
# Can you find the best iteration? Look at `train_metric_mean`.
report
# Inspect the report generated by ShapRFECV
# Can you find the best iteration? Look at `train_metric_mean`.
report

Out[15]:

	num_features	features_set	eliminated_features	train_metric_mean	train_metric_std	val_metric_mean	val_metric_std
1	50	[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...	[27, 38, 41, 48, 28, 49, 9, 36, 12, 14]	1.000	0.000	0.792	0.029
2	40	[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 15, 16...	[37, 45, 23, 2, 17, 16, 31, 20]	1.000	0.000	0.834	0.016
3	32	[0, 1, 3, 4, 5, 6, 7, 8, 10, 11, 13, 15, 18, 1...	[40, 0, 18, 11, 39, 8]	1.000	0.000	0.849	0.011
4	26	[1, 3, 4, 5, 6, 7, 10, 13, 15, 19, 21, 22, 24,...	[33, 34, 32, 15, 46]	1.000	0.000	0.849	0.026
5	21	[1, 3, 4, 5, 6, 7, 10, 13, 19, 21, 22, 24, 25,...	[35, 1, 6, 19]	1.000	0.000	0.868	0.026
6	17	[3, 4, 5, 7, 10, 13, 21, 22, 24, 25, 26, 29, 3...	[7, 42, 5]	0.974	0.001	0.837	0.015
7	14	[3, 4, 10, 13, 21, 22, 24, 25, 26, 29, 30, 43,...	[43, 4]	0.970	0.006	0.811	0.020
8	12	[3, 10, 13, 21, 22, 24, 25, 26, 29, 30, 44, 47]	[25, 29]	0.955	0.006	0.797	0.023
9	10	[3, 10, 13, 21, 22, 24, 26, 30, 44, 47]	[3, 44]	0.942	0.005	0.802	0.021
10	8	[10, 13, 21, 22, 24, 26, 30, 47]	[26]	0.925	0.008	0.783	0.018
11	7	[10, 13, 21, 22, 24, 30, 47]	[13]	0.910	0.005	0.757	0.026
12	6	[10, 21, 22, 24, 30, 47]	[24]	0.876	0.006	0.769	0.012
13	5	[10, 21, 22, 30, 47]	[]	0.840	0.007	0.741	0.011

In [16]:

Copied!

# We can visually inspect the report and try identify the best iteration
shap_elimination.plot()
# We can visually inspect the report and try identify the best iteration
shap_elimination.plot()

No description has been provided for this image

Out[16]:

In [19]:

Copied!

# Once we have identified which iteration was best, we can get the feature names by:
shap_elimination.get_reduced_features_set(num_features=21)
# Once we have identified which iteration was best, we can get the feature names by:
shap_elimination.get_reduced_features_set(num_features=21)

Out[19]:

array([ 1,  3,  4,  5,  6,  7, 10, 13, 19, 21, 22, 24, 25, 26, 29, 30, 35,
       42, 43, 44, 47])

But, is there a more automated way of selecting "best" features?¶

Yes, probatus support a few strategies for selecting the best num_features for your needs.

Before we jump into these automated techniques, there are a number of factors you might want to consider when selecting the best iteration:

Model performance score: Most people will care about this the most. You'll try to maximize a score like f1 or accuracy or ROC_AUC. But there are more factors to consider...
Variance of model performance: If your model performance scores vary considerably across cross validation folds, it might be a sign that your model is not generalizing well. Some people may want to ensure model performance standard deviation is also minimized.
Number of features: You might want to keep the number of features to a minimum (smaller model, siplier data pipeline, easier to explain model)

To accomodate some of the above needs, probatus supports automated strategies to select the best "num_features". Strategies supported are:

num_features="best": If what you care about is maximizing val_metric_mean (model performance score), this strategy is for you.
num_features="best_coherent": If you care about high model performance whilst also minimizing variation of model performance between cross validation folds, this strategy is for you.
num_features="best_parsimonious": If you care about high model performance whilst also minimizing the number of features selected, this model is for you.

In [20]:

Copied!

# Best
best_features = shap_elimination.get_reduced_features_set(num_features="best")
print(f"The {len(best_features)} best features are: {best_features}")
# Best
best_features = shap_elimination.get_reduced_features_set(num_features="best")
print(f"The {len(best_features)} best features are: {best_features}")

The 21 best features are: [ 1  3  4  5  6  7 10 13 19 21 22 24 25 26 29 30 35 42 43 44 47]

In [21]:

Copied!

# Best coherent
best_features = shap_elimination.get_reduced_features_set(num_features="best_coherent")
print(f"The {len(best_features)} best coherent features are: {best_features}")
# Best coherent
best_features = shap_elimination.get_reduced_features_set(num_features="best_coherent")
print(f"The {len(best_features)} best coherent features are: {best_features}")

The 32 best coherent features are: [ 0  1  3  4  5  6  7  8 10 11 13 15 18 19 21 22 24 25 26 29 30 32 33 34
 35 39 40 42 43 44 46 47]

In [22]:

Copied!

# Best parsimonious
best_features = shap_elimination.get_reduced_features_set(num_features="best_parsimonious")
print(f"The {len(best_features)} best parsimonious features are: {best_features}")
# Best parsimonious
best_features = shap_elimination.get_reduced_features_set(num_features="best_parsimonious")
print(f"The {len(best_features)} best parsimonious features are: {best_features}")

The 21 best parsimonious features are: [ 1  3  4  5  6  7 10 13 19 21 22 24 25 26 29 30 35 42 43 44 47]

In [23]:

Copied!





# Note, you can change the behavior of `best_coherent` and `best_parsimonious` by changing the `standard_error_threshold` parameter
# Best coherent (standard_error_threshold=0.5)
best_features = shap_elimination.get_reduced_features_set(
    num_features="best_parsimonious", standard_error_threshold=0.5
)
print(f"The {len(best_features)} best coherent features are: {best_features}")
# Note, you can change the behavior of `best_coherent` and `best_parsimonious` by changing the `standard_error_threshold` parameter
# Best coherent (standard_error_threshold=0.5)
best_features = shap_elimination.get_reduced_features_set(
    num_features="best_parsimonious", standard_error_threshold=0.5
)
print(f"The {len(best_features)} best coherent features are: {best_features}")

The 21 best coherent features are: [ 1  3  4  5  6  7 10 13 19 21 22 24 25 26 29 30 35 42 43 44 47]

Support & Ranking¶

In alignment with the sklearn implementation of RFECV, you can get support and ranking output from ShapRFECV.

In [24]:

Copied!





# Get support (boolean mask of X feature columns)
best_features = shap_elimination.get_reduced_features_set(num_features="best", return_type="feature_names")
best_features_support = shap_elimination.get_reduced_features_set(num_features="best", return_type="support")
best_features_ranking = shap_elimination.get_reduced_features_set(num_features="best", return_type="ranking")
print(f"The {len(best_features)} best features are: {best_features}")
print(f"Support boolean mask: {best_features_support}")
print(f"Ranking: {best_features_ranking}")
# Get support (boolean mask of X feature columns)
best_features = shap_elimination.get_reduced_features_set(num_features="best", return_type="feature_names")
best_features_support = shap_elimination.get_reduced_features_set(num_features="best", return_type="support")
best_features_ranking = shap_elimination.get_reduced_features_set(num_features="best", return_type="ranking")
print(f"The {len(best_features)} best features are: {best_features}")
print(f"Support boolean mask: {best_features_support}")
print(f"Ranking: {best_features_ranking}")

The 21 best features are: [ 1  3  4  5  6  7 10 13 19 21 22 24 25 26 29 30 35 42 43 44 47]
Support boolean mask: [False, True, False, True, True, True, True, True, False, False, True, False, False, True, False, False, False, False, False, True, False, True, True, False, True, True, True, False, False, True, True, False, False, False, False, True, False, False, False, False, False, False, True, True, True, False, False, True, False, False]
Ranking: [23, 14, 31, 4, 9, 12, 15, 10, 27, 42, 0, 25, 44, 2, 45, 20, 33, 32, 24, 16, 35, 0, 0, 30, 1, 6, 3, 36, 40, 7, 0, 34, 19, 17, 18, 13, 43, 28, 37, 26, 22, 38, 11, 8, 5, 29, 21, 0, 39, 41]