How to ensure reproducibility of the results¶
This page describes how to make sure that the analysis that you perform using probatus
is fully reproducible.
There are two factors that influence reproducibility of the results:
- Inputs of
probatus
modules, - The
random_state
ofprobatus
modules.
The below sections cover how to ensure reproducibility of the results by controling these aspects.
Inputs of probatus modules¶
There are various parameters that modules of probatus take as input. Below we will cover the most often occurring ones.
Static dataset¶
When using probatus
, one of the most crucial aspects is the provided dataset. Therefore, the first thing to do is to ensure that the passed dataset does not change along the way.
Below is a code snipped of random data preparation. In sklearn, you can ensure this by setting the random_state
parameter. You will probably use a different dataset in your projects, but always make sure that the input data is static.
%%capture
!pip install probatus
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=10, random_state=42)
Static data splits¶
Whenever you split the data in any way, you need to make sure that the splits are always the same.
If you use the train_test_split
functionality from sklearn, this can be enforced by setting the random_state
parameter.
Another crucial aspect is how you use the cv
parameter, which defines the folds settings that you will use in the experiments. If the cv
is set to an integer, you don't need to worry about it - the random_state
of probatus
will take care of it. However, if you want to pass a custom cv generator object, you have to set the random_state
there as well.
Below are some examples of static splits:
from sklearn.model_selection import StratifiedKFold, train_test_split
# Static train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Static CV settings
cv1 = 5
cv2 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
Static classifier¶
Most of probatus
modules work with the provided classifiers. Whenever one needs to provide a not-fitted classifier, it is enough to set the random_state
. However, if the classifier needs to be fitted beforehand, you have to make sure that the model training is reproducible as well.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
Static search CV for hyperparameter tuning¶
Some of the modules e.g. ShapRFECV
, allow you to perform optimization of the model. Whenever, you use such functionality, make sure that these classes have set the random_state
. This way, in every round of optimization, you will explore the same set of parameter permutations. In case the search space is also generated based on randomness, make sure that the random_state
is set to it as well.
from sklearn.model_selection import RandomizedSearchCV
param_grid = {
"n_estimators": [5, 7, 10],
"max_leaf_nodes": [3, 5, 7, 10],
}
search = RandomizedSearchCV(model, param_grid, n_iter=1, random_state=42)
Any other sources of randomness¶
Before running probatus
modules think about the inputs, and consider if there is any other type of randomness involved. If there is, one option to possibly solve the issue is setting the random seed at the beginning of the code.
# Optional step
import numpy as np
np.random.seed(42)
Reproducibility in probatus¶
Most of the modules in probatus
allow you to set the random_state
. This setting essentially makes sure that any code that the functions operate on has a static flow. As long as it is seet and you ensure all other inputs do not cause additional fluctuations between runs, you can make sure that your results are reproducible.
from probatus.feature_elimination import ShapRFECV
shap_elimination = ShapRFECV(model=search, step=0.2, cv=cv2, scoring="roc_auc", n_jobs=3, random_state=42)
report = shap_elimination.fit_compute(X, y)
report[["num_features", "eliminated_features", "val_metric_mean"]]
num_features | eliminated_features | val_metric_mean | |
---|---|---|---|
1 | 10 | [8, 9] | 0.983 |
2 | 8 | [5] | 0.969 |
3 | 7 | [7] | 0.984 |
4 | 6 | [6] | 0.979 |
5 | 5 | [4] | 0.983 |
6 | 4 | [1] | 0.987 |
7 | 3 | [0] | 0.991 |
8 | 2 | [3] | 0.991 |
9 | 1 | [] | 0.969 |