How to ensure reproducibility of the results¶

This page describes how to make sure that the analysis that you perform using probatus is fully reproducible.

There are two factors that influence reproducibility of the results:

Inputs of probatus modules,
The random_state of probatus modules.

The below sections cover how to ensure reproducibility of the results by controling these aspects.

Inputs of probatus modules¶

There are various parameters that modules of probatus take as input. Below we will cover the most often occurring ones.

Static dataset¶

When using probatus, one of the most crucial aspects is the provided dataset. Therefore, the first thing to do is to ensure that the passed dataset does not change along the way.

Below is a code snipped of random data preparation. In sklearn, you can ensure this by setting the random_state parameter. You will probably use a different dataset in your projects, but always make sure that the input data is static.

In [ ]:

Copied!

%%capture
!pip install probatus
%%capture
!pip install probatus

In [1]:

Copied!

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=10, random_state=42)
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=10, random_state=42)

Static data splits¶

Whenever you split the data in any way, you need to make sure that the splits are always the same.

If you use the train_test_split functionality from sklearn, this can be enforced by setting the random_state parameter.

Another crucial aspect is how you use the cv parameter, which defines the folds settings that you will use in the experiments. If the cv is set to an integer, you don't need to worry about it - the random_state of probatus will take care of it. However, if you want to pass a custom cv generator object, you have to set the random_state there as well.

Below are some examples of static splits:

In [2]:

Copied!





from sklearn.model_selection import StratifiedKFold, train_test_split

# Static train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Static CV settings
cv1 = 5
cv2 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
from sklearn.model_selection import StratifiedKFold, train_test_split

# Static train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Static CV settings
cv1 = 5
cv2 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Static classifier¶

Most of probatus modules work with the provided classifiers. Whenever one needs to provide a not-fitted classifier, it is enough to set the random_state. However, if the classifier needs to be fitted beforehand, you have to make sure that the model training is reproducible as well.

In [3]:

Copied!

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)

Static search CV for hyperparameter tuning¶

Some of the modules e.g. ShapRFECV, allow you to perform optimization of the model. Whenever, you use such functionality, make sure that these classes have set the random_state. This way, in every round of optimization, you will explore the same set of parameter permutations. In case the search space is also generated based on randomness, make sure that the random_state is set to it as well.

In [4]:

Copied!





from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "n_estimators": [5, 7, 10],
    "max_leaf_nodes": [3, 5, 7, 10],
}
search = RandomizedSearchCV(model, param_grid, n_iter=1, random_state=42)
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "n_estimators": [5, 7, 10],
    "max_leaf_nodes": [3, 5, 7, 10],
}
search = RandomizedSearchCV(model, param_grid, n_iter=1, random_state=42)

Any other sources of randomness¶

Before running probatus modules think about the inputs, and consider if there is any other type of randomness involved. If there is, one option to possibly solve the issue is setting the random seed at the beginning of the code.

In [5]:

Copied!

# Optional step
import numpy as np

np.random.seed(42)
# Optional step
import numpy as np

np.random.seed(42)

Reproducibility in probatus¶

Most of the modules in probatus allow you to set the random_state. This setting essentially makes sure that any code that the functions operate on has a static flow. As long as it is seet and you ensure all other inputs do not cause additional fluctuations between runs, you can make sure that your results are reproducible.

In [6]:

Copied!

from probatus.feature_elimination import ShapRFECV

shap_elimination = ShapRFECV(model=search, step=0.2, cv=cv2, scoring="roc_auc", n_jobs=3, random_state=42)
report = shap_elimination.fit_compute(X, y)
from probatus.feature_elimination import ShapRFECV

shap_elimination = ShapRFECV(model=search, step=0.2, cv=cv2, scoring="roc_auc", n_jobs=3, random_state=42)
report = shap_elimination.fit_compute(X, y)

In [7]:

Copied!

report[["num_features", "eliminated_features", "val_metric_mean"]]
report[["num_features", "eliminated_features", "val_metric_mean"]]

Out[7]:

	num_features	eliminated_features	val_metric_mean
1	10	[8, 9]	0.983
2	8	[5]	0.969
3	7	[7]	0.984
4	6	[6]	0.979
5	5	[4]	0.983
6	4	[1]	0.987
7	3	[0]	0.991
8	2	[3]	0.991
9	1	[]	0.969