Skip to content

Sample Similarity

The goal of sample similarity module is understanding how different two samples are from a multivariate perspective.

One of the ways to indicate this is Resemblance Model. Having two datasets - say X1 and X2 - one can analyse how easy it is to recognize which dataset a randomly selected row comes from. The Resemblance model assigns label 0 to the dataset X1, and label 1 to X2 and trains a binary classification model to predict which sample a given row comes from. By looking at the test AUC, one can conclude that the samples have a different distribution if the AUC is significantly higher than 0.5. Furthermore, by analysing feature importance one can understand which of the features have predictive power.

The following features are implemented:

  • SHAPImportanceResemblance (Recommended): The class applies SHAP library, in order to interpret the tree based resemblance model.
  • PermutationImportanceResemblance: The class applies permutation feature importance in order to understand which features the current model relies on the most. The higher the importance of the feature, the more a given feature possibly differs in X2 compared to X1. The importance indicates how much the test AUC drops if a given feature is permuted.

BaseResemblanceModel

Bases: BaseFitComputePlotClass

This model checks for the similarity of two samples.

A possible use case is analysis of whether th train sample differs from the test sample, due to e.g. non-stationarity.

This is a base class and needs to be extended by a fit() method, which implements how the data is split, how the model is trained and evaluated. Further, inheriting classes need to implement how feature importance should be indicated.

Source code in probatus/sample_similarity/resemblance_model.py
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
class BaseResemblanceModel(BaseFitComputePlotClass):
    """
    This model checks for the similarity of two samples.

    A possible use case is analysis of whether th train sample differs
    from the test sample, due to e.g. non-stationarity.

    This is a base class and needs to be extended by a fit() method, which implements how the data is split,
    how the model is trained and evaluated.
    Further, inheriting classes need to implement how feature importance should be indicated.
    """

    def __init__(
        self,
        clf,
        scoring="roc_auc",
        test_prc=0.25,
        n_jobs=1,
        verbose=0,
        random_state=None,
    ):
        """
        Initializes the class.

        Args:
            clf (model object):
                Binary classification model or pipeline.

            scoring (string or probatus.utils.Scorer, optional):
                Metric for which the model performance is calculated. It can be either a metric name aligned with
                predefined
                [classification scorers names in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html).
                Another option is using probatus.utils.Scorer to define a custom metric. The recommended option for this
                class is 'roc_auc'.

            test_prc (float, optional):
                Percentage of data used to test the model. By default 0.25 is set.

            n_jobs (int, optional):
                Number of parallel executions. If -1 use all available cores. By default 1.

            verbose (int, optional):
                Controls verbosity of the output:

                - 0 - neither prints nor warnings are shown
                - 1 - 50 - only most important warnings
                - 51 - 100 - shows other warnings and prints
                - above 100 - presents all prints and all warnings (including SHAP warnings).

            random_state (int, optional):
                Random state set at each round of feature elimination. If it is None, the results will not be
                reproducible and in random search at each iteration a different hyperparameters might be tested. For
                reproducible results set it to an integer.
        """  # noqa
        self.clf = clf
        self.test_prc = test_prc
        self.n_jobs = n_jobs
        self.random_state = random_state
        self.verbose = verbose
        self.scorer = get_single_scorer(scoring)

    def _init_output_variables(self):
        """
        Initializes variables that will be filled in during fit() method, and are used as output.
        """
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.train_score = None
        self.test_score = None
        self.report = None

    def fit(self, X1, X2, column_names=None, class_names=None):
        """
        Base fit functionality that should be executed before each fit.

        Args:
            X1 (np.ndarray or pd.DataFrame):
                First sample to be compared. It needs to have the same number of columns as X2.

            X2 (np.ndarray or pd.DataFrame):
                Second sample to be compared. It needs to have the same number of columns as X1.

            column_names (list of str, optional):
                List of feature names of the provided samples. If provided it will be used to overwrite the existing
                feature names. If not provided the existing feature names are used or default feature names are
                generated.

            class_names (None, or list of str, optional):
                List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
                default ['First Sample', 'Second Sample'] are used.

        Returns:
            (BaseResemblanceModel):
                Fitted object
        """
        # Set seed for results reproducibility
        if self.random_state is not None:
            np.random.seed(self.random_state)

        # Set class names
        self.class_names = class_names
        if self.class_names is None:
            self.class_names = ["First Sample", "Second Sample"]

        # Ensure inputs are correct
        self.X1, self.column_names = preprocess_data(X1, X_name="X1", column_names=column_names, verbose=self.verbose)
        self.X2, _ = preprocess_data(X2, X_name="X2", column_names=column_names, verbose=self.verbose)

        # Prepare dataset for modelling
        self.X = pd.DataFrame(pd.concat([self.X1, self.X2], axis=0), columns=self.column_names).reset_index(drop=True)

        self.y = pd.Series(
            np.concatenate(
                [
                    np.zeros(self.X1.shape[0]),
                    np.ones(self.X2.shape[0]),
                ]
            )
        ).reset_index(drop=True)

        # Assure the type and number of classes for the variable
        self.X, _ = preprocess_data(self.X, X_name="X", column_names=self.column_names, verbose=self.verbose)

        self.y = preprocess_labels(self.y, y_name="y", index=self.X.index, verbose=self.verbose)

        # Reinitialize variables in case of multiple times being fit
        self._init_output_variables()

        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            self.X,
            self.y,
            test_size=self.test_prc,
            random_state=self.random_state,
            shuffle=True,
            stratify=self.y,
        )
        self.clf.fit(self.X_train, self.y_train)

        self.train_score = np.round(self.scorer.score(self.clf, self.X_train, self.y_train), 3)
        self.test_score = np.round(self.scorer.score(self.clf, self.X_test, self.y_test), 3)

        self.results_text = (
            f"Train {self.scorer.metric_name}: {np.round(self.train_score, 3)},\n"
            f"Test {self.scorer.metric_name}: {np.round(self.test_score, 3)}."
        )
        if self.verbose > 50:
            print(f"Finished model training: \n{self.results_text}")

        if self.verbose > 0:
            if self.train_score > self.test_score:
                warnings.warn(
                    f"Train {self.scorer.metric_name} > Test {self.scorer.metric_name}, which might indicate "
                    f"an overfit. \n Strong overfit might lead to misleading conclusions when analysing "
                    f"feature importance. Consider retraining with more regularization applied to the model."
                )
        self.fitted = True
        return self

    def get_data_splits(self):
        """
        Returns the data splits used to train the Resemblance model.

        Returns:
            (pd.DataFrame, pd.DataFrame, pd.Series, pd.Series):
                X_train, X_test, y_train, y_test.
        """
        self._check_if_fitted()
        return self.X_train, self.X_test, self.y_train, self.y_test

    def compute(self, return_scores=False):
        """
        Checks if fit() method has been run and computes the output variables.

        Args:
            return_scores (bool, optional):
                Flag indicating whether the method should return a tuple (feature importances, train score,
                test score), or feature importances. By default the second option is selected.

        Returns:
            (tuple(pd.DataFrame, float, float) or pd.DataFrame):
                Depending on value of return_tuple either returns a tuple (feature importances, train AUC, test AUC), or
                feature importances.
        """
        self._check_if_fitted()

        if return_scores:
            return self.report, self.train_score, self.test_score
        else:
            return self.report

    def fit_compute(
        self,
        X1,
        X2,
        column_names=None,
        class_names=None,
        return_scores=False,
        **fit_kwargs,
    ):
        """
        Fits the resemblance model and computes the report regarding feature importance.

        Args:
            X1 (np.ndarray or pd.DataFrame):
                First sample to be compared. It needs to have the same number of columns as X2.

            X2 (np.ndarray or pd.DataFrame):
                Second sample to be compared. It needs to have the same number of columns as X1.

            column_names (list of str, optional):
                List of feature names of the provided samples. If provided it will be used to overwrite the existing
                feature names. If not provided the existing feature names are used or default feature names are
                generated.

            class_names (None, or list of str, optional):
                List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
                default ['First Sample', 'Second Sample'] are used.

            return_scores (bool, optional):
                Flag indicating whether the method should return a tuple (feature importances, train score,
                test score), or feature importances. By default the second option is selected.

            **fit_kwargs:
                In case any other arguments are accepted by fit() method, they can be passed as keyword arguments.

        Returns:
            (tuple of (pd.DataFrame, float, float) or pd.DataFrame):
                Depending on value of return_tuple either returns a tuple (feature importances, train AUC, test AUC), or
                feature importances.
        """
        self.fit(X1, X2, column_names=column_names, class_names=class_names, **fit_kwargs)
        return self.compute(return_scores=return_scores)

    def plot(self):
        """
        Plot.
        """
        raise (NotImplementedError("Plot method has not been implemented."))

__init__(clf, scoring='roc_auc', test_prc=0.25, n_jobs=1, verbose=0, random_state=None)

Initializes the class.

Parameters:

Name Type Description Default
clf model object

Binary classification model or pipeline.

required
scoring string or Scorer

Metric for which the model performance is calculated. It can be either a metric name aligned with predefined classification scorers names in sklearn. Another option is using probatus.utils.Scorer to define a custom metric. The recommended option for this class is 'roc_auc'.

'roc_auc'
test_prc float

Percentage of data used to test the model. By default 0.25 is set.

0.25
n_jobs int

Number of parallel executions. If -1 use all available cores. By default 1.

1
verbose int

Controls verbosity of the output:

  • 0 - neither prints nor warnings are shown
  • 1 - 50 - only most important warnings
  • 51 - 100 - shows other warnings and prints
  • above 100 - presents all prints and all warnings (including SHAP warnings).
0
random_state int

Random state set at each round of feature elimination. If it is None, the results will not be reproducible and in random search at each iteration a different hyperparameters might be tested. For reproducible results set it to an integer.

None
Source code in probatus/sample_similarity/resemblance_model.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def __init__(
    self,
    clf,
    scoring="roc_auc",
    test_prc=0.25,
    n_jobs=1,
    verbose=0,
    random_state=None,
):
    """
    Initializes the class.

    Args:
        clf (model object):
            Binary classification model or pipeline.

        scoring (string or probatus.utils.Scorer, optional):
            Metric for which the model performance is calculated. It can be either a metric name aligned with
            predefined
            [classification scorers names in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html).
            Another option is using probatus.utils.Scorer to define a custom metric. The recommended option for this
            class is 'roc_auc'.

        test_prc (float, optional):
            Percentage of data used to test the model. By default 0.25 is set.

        n_jobs (int, optional):
            Number of parallel executions. If -1 use all available cores. By default 1.

        verbose (int, optional):
            Controls verbosity of the output:

            - 0 - neither prints nor warnings are shown
            - 1 - 50 - only most important warnings
            - 51 - 100 - shows other warnings and prints
            - above 100 - presents all prints and all warnings (including SHAP warnings).

        random_state (int, optional):
            Random state set at each round of feature elimination. If it is None, the results will not be
            reproducible and in random search at each iteration a different hyperparameters might be tested. For
            reproducible results set it to an integer.
    """  # noqa
    self.clf = clf
    self.test_prc = test_prc
    self.n_jobs = n_jobs
    self.random_state = random_state
    self.verbose = verbose
    self.scorer = get_single_scorer(scoring)

compute(return_scores=False)

Checks if fit() method has been run and computes the output variables.

Parameters:

Name Type Description Default
return_scores bool

Flag indicating whether the method should return a tuple (feature importances, train score, test score), or feature importances. By default the second option is selected.

False

Returns:

Type Description
tuple(DataFrame, float, float) or DataFrame

Depending on value of return_tuple either returns a tuple (feature importances, train AUC, test AUC), or feature importances.

Source code in probatus/sample_similarity/resemblance_model.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
def compute(self, return_scores=False):
    """
    Checks if fit() method has been run and computes the output variables.

    Args:
        return_scores (bool, optional):
            Flag indicating whether the method should return a tuple (feature importances, train score,
            test score), or feature importances. By default the second option is selected.

    Returns:
        (tuple(pd.DataFrame, float, float) or pd.DataFrame):
            Depending on value of return_tuple either returns a tuple (feature importances, train AUC, test AUC), or
            feature importances.
    """
    self._check_if_fitted()

    if return_scores:
        return self.report, self.train_score, self.test_score
    else:
        return self.report

fit(X1, X2, column_names=None, class_names=None)

Base fit functionality that should be executed before each fit.

Parameters:

Name Type Description Default
X1 ndarray or DataFrame

First sample to be compared. It needs to have the same number of columns as X2.

required
X2 ndarray or DataFrame

Second sample to be compared. It needs to have the same number of columns as X1.

required
column_names list of str

List of feature names of the provided samples. If provided it will be used to overwrite the existing feature names. If not provided the existing feature names are used or default feature names are generated.

None
class_names None, or list of str

List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the default ['First Sample', 'Second Sample'] are used.

None

Returns:

Type Description
BaseResemblanceModel

Fitted object

Source code in probatus/sample_similarity/resemblance_model.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
def fit(self, X1, X2, column_names=None, class_names=None):
    """
    Base fit functionality that should be executed before each fit.

    Args:
        X1 (np.ndarray or pd.DataFrame):
            First sample to be compared. It needs to have the same number of columns as X2.

        X2 (np.ndarray or pd.DataFrame):
            Second sample to be compared. It needs to have the same number of columns as X1.

        column_names (list of str, optional):
            List of feature names of the provided samples. If provided it will be used to overwrite the existing
            feature names. If not provided the existing feature names are used or default feature names are
            generated.

        class_names (None, or list of str, optional):
            List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
            default ['First Sample', 'Second Sample'] are used.

    Returns:
        (BaseResemblanceModel):
            Fitted object
    """
    # Set seed for results reproducibility
    if self.random_state is not None:
        np.random.seed(self.random_state)

    # Set class names
    self.class_names = class_names
    if self.class_names is None:
        self.class_names = ["First Sample", "Second Sample"]

    # Ensure inputs are correct
    self.X1, self.column_names = preprocess_data(X1, X_name="X1", column_names=column_names, verbose=self.verbose)
    self.X2, _ = preprocess_data(X2, X_name="X2", column_names=column_names, verbose=self.verbose)

    # Prepare dataset for modelling
    self.X = pd.DataFrame(pd.concat([self.X1, self.X2], axis=0), columns=self.column_names).reset_index(drop=True)

    self.y = pd.Series(
        np.concatenate(
            [
                np.zeros(self.X1.shape[0]),
                np.ones(self.X2.shape[0]),
            ]
        )
    ).reset_index(drop=True)

    # Assure the type and number of classes for the variable
    self.X, _ = preprocess_data(self.X, X_name="X", column_names=self.column_names, verbose=self.verbose)

    self.y = preprocess_labels(self.y, y_name="y", index=self.X.index, verbose=self.verbose)

    # Reinitialize variables in case of multiple times being fit
    self._init_output_variables()

    self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
        self.X,
        self.y,
        test_size=self.test_prc,
        random_state=self.random_state,
        shuffle=True,
        stratify=self.y,
    )
    self.clf.fit(self.X_train, self.y_train)

    self.train_score = np.round(self.scorer.score(self.clf, self.X_train, self.y_train), 3)
    self.test_score = np.round(self.scorer.score(self.clf, self.X_test, self.y_test), 3)

    self.results_text = (
        f"Train {self.scorer.metric_name}: {np.round(self.train_score, 3)},\n"
        f"Test {self.scorer.metric_name}: {np.round(self.test_score, 3)}."
    )
    if self.verbose > 50:
        print(f"Finished model training: \n{self.results_text}")

    if self.verbose > 0:
        if self.train_score > self.test_score:
            warnings.warn(
                f"Train {self.scorer.metric_name} > Test {self.scorer.metric_name}, which might indicate "
                f"an overfit. \n Strong overfit might lead to misleading conclusions when analysing "
                f"feature importance. Consider retraining with more regularization applied to the model."
            )
    self.fitted = True
    return self

fit_compute(X1, X2, column_names=None, class_names=None, return_scores=False, **fit_kwargs)

Fits the resemblance model and computes the report regarding feature importance.

Parameters:

Name Type Description Default
X1 ndarray or DataFrame

First sample to be compared. It needs to have the same number of columns as X2.

required
X2 ndarray or DataFrame

Second sample to be compared. It needs to have the same number of columns as X1.

required
column_names list of str

List of feature names of the provided samples. If provided it will be used to overwrite the existing feature names. If not provided the existing feature names are used or default feature names are generated.

None
class_names None, or list of str

List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the default ['First Sample', 'Second Sample'] are used.

None
return_scores bool

Flag indicating whether the method should return a tuple (feature importances, train score, test score), or feature importances. By default the second option is selected.

False
**fit_kwargs

In case any other arguments are accepted by fit() method, they can be passed as keyword arguments.

{}

Returns:

Type Description
tuple of (pd.DataFrame, float, float) or pd.DataFrame

Depending on value of return_tuple either returns a tuple (feature importances, train AUC, test AUC), or feature importances.

Source code in probatus/sample_similarity/resemblance_model.py
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
def fit_compute(
    self,
    X1,
    X2,
    column_names=None,
    class_names=None,
    return_scores=False,
    **fit_kwargs,
):
    """
    Fits the resemblance model and computes the report regarding feature importance.

    Args:
        X1 (np.ndarray or pd.DataFrame):
            First sample to be compared. It needs to have the same number of columns as X2.

        X2 (np.ndarray or pd.DataFrame):
            Second sample to be compared. It needs to have the same number of columns as X1.

        column_names (list of str, optional):
            List of feature names of the provided samples. If provided it will be used to overwrite the existing
            feature names. If not provided the existing feature names are used or default feature names are
            generated.

        class_names (None, or list of str, optional):
            List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
            default ['First Sample', 'Second Sample'] are used.

        return_scores (bool, optional):
            Flag indicating whether the method should return a tuple (feature importances, train score,
            test score), or feature importances. By default the second option is selected.

        **fit_kwargs:
            In case any other arguments are accepted by fit() method, they can be passed as keyword arguments.

    Returns:
        (tuple of (pd.DataFrame, float, float) or pd.DataFrame):
            Depending on value of return_tuple either returns a tuple (feature importances, train AUC, test AUC), or
            feature importances.
    """
    self.fit(X1, X2, column_names=column_names, class_names=class_names, **fit_kwargs)
    return self.compute(return_scores=return_scores)

get_data_splits()

Returns the data splits used to train the Resemblance model.

Returns:

Type Description
(DataFrame, DataFrame, Series, Series)

X_train, X_test, y_train, y_test.

Source code in probatus/sample_similarity/resemblance_model.py
194
195
196
197
198
199
200
201
202
203
def get_data_splits(self):
    """
    Returns the data splits used to train the Resemblance model.

    Returns:
        (pd.DataFrame, pd.DataFrame, pd.Series, pd.Series):
            X_train, X_test, y_train, y_test.
    """
    self._check_if_fitted()
    return self.X_train, self.X_test, self.y_train, self.y_test

plot()

Plot.

Source code in probatus/sample_similarity/resemblance_model.py
269
270
271
272
273
def plot(self):
    """
    Plot.
    """
    raise (NotImplementedError("Plot method has not been implemented."))

PermutationImportanceResemblance

Bases: BaseResemblanceModel

This model checks the similarity of two samples.

A possible use case is analysis of whether the train sample differs from the test sample, due to e.g. non-stationarity.

It assigns labels to each sample, 0 to the first sample, 1 to the second. Then, it randomly selects a portion of data to train on. The resulting model tries to distinguish which sample a given test row comes from. This provides insights on how distinguishable these samples are and which features contribute to that. The feature importance is calculated using permutation importance.

If the model achieves a test AUC significantly different than 0.5, it indicates that it is possible to distinguish between the samples, and therefore, the samples differ. Features with a high permutation importance contribute to that effect the most. Thus, their distribution might differ between two samples.

Examples:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from probatus.sample_similarity import PermutationImportanceResemblance
X1, _ = make_classification(n_samples=100, n_features=5)
X2, _ = make_classification(n_samples=100, n_features=5, shift=0.5)
clf = RandomForestClassifier(max_depth=2)
perm = PermutationImportanceResemblance(clf)
feature_importance = perm.fit_compute(X1, X2)
perm.plot()

Source code in probatus/sample_similarity/resemblance_model.py
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
class PermutationImportanceResemblance(BaseResemblanceModel):
    """
    This model checks the similarity of two samples.

    A possible use case is analysis of whether the train sample differs
    from the test sample, due to e.g. non-stationarity.

    It assigns labels to each sample, 0 to the first sample, 1 to the second. Then, it randomly selects a portion of
    data to train on. The resulting model tries to distinguish which sample a given test row comes from. This
    provides insights on how distinguishable these samples are and which features contribute to that. The feature
    importance is calculated using permutation importance.

    If the model achieves a test AUC significantly different than 0.5, it indicates that it is possible to distinguish
    between the samples, and therefore, the samples differ.
    Features with a high permutation importance contribute to that effect the most.
    Thus, their distribution might differ between two samples.

    Examples:
    ```python
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from probatus.sample_similarity import PermutationImportanceResemblance
    X1, _ = make_classification(n_samples=100, n_features=5)
    X2, _ = make_classification(n_samples=100, n_features=5, shift=0.5)
    clf = RandomForestClassifier(max_depth=2)
    perm = PermutationImportanceResemblance(clf)
    feature_importance = perm.fit_compute(X1, X2)
    perm.plot()
    ```
    <img src="../img/sample_similarity_permutation_importance.png" width="500" />
    """

    def __init__(
        self,
        clf,
        iterations=100,
        scoring="roc_auc",
        test_prc=0.25,
        n_jobs=1,
        verbose=0,
        random_state=None,
    ):
        """
        Initializes the class.

        Args:
            clf (model object):
                Binary classification model or pipeline.

            iterations (int, optional):
                Number of iterations performed to calculate permutation importance. By default 100 iterations per
                feature are done.

            scoring (string or probatus.utils.Scorer, optional):
                Metric for which the model performance is calculated. It can be either a metric name aligned with
                predefined
                [classification scorers names in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html).
                Another option is using probatus.utils.Scorer to define a custom metric. Recommended option for this
                class is 'roc_auc'.

            test_prc (float, optional):
                Percentage of data used to test the model. By default 0.25 is set.

            n_jobs (int, optional):
                Number of parallel executions. If -1 use all available cores. By default 1.

            verbose (int, optional):
                Controls verbosity of the output:

                - 0 - neither prints nor warnings are shown
                - 1 - 50 - only most important warnings
                - 51 - 100 - shows other warnings and prints
                - above 100 - presents all prints and all warnings (including SHAP warnings).

            random_state (int, optional):
                Random state set at each round of feature elimination. If it is None, the results will not be
                reproducible and in random search at each iteration a different hyperparameters might be tested. For
                reproducible results set it to integer.
        """  # noqa
        super().__init__(
            clf=clf,
            scoring=scoring,
            test_prc=test_prc,
            n_jobs=n_jobs,
            verbose=verbose,
            random_state=random_state,
        )

        self.iterations = iterations

        self.iterations_columns = ["feature", "importance"]
        self.iterations_results = pd.DataFrame(columns=self.iterations_columns)

        self.plot_x_label = "Permutation Feature Importance"
        self.plot_y_label = "Feature Name"
        self.plot_title = "Permutation Feature Importance of Resemblance Model"

    def fit(self, X1, X2, column_names=None, class_names=None):
        """
        This function assigns labels to each sample, 0 to the first sample, 1 to the second.

        Then, it randomly selects a
            portion of data to train on. The resulting model tries to distinguish which sample a given test row
            comes from. This provides insights on how distinguishable these samples are and which features contribute to
            that. The feature importance is calculated using permutation importance.

        Args:
            X1 (np.ndarray or pd.DataFrame):
                First sample to be compared. It needs to have the same number of columns as X2.

            X2 (np.ndarray or pd.DataFrame):
                Second sample to be compared. It needs to have the same number of columns as X1.

            column_names (list of str, optional):
                List of feature names of the provided samples. If provided it will be used to overwrite the existing
                feature names. If not provided the existing feature names are used or default feature names are
                generated.

            class_names (None, or list of str, optional):
                List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
                default ['First Sample', 'Second Sample'] are used.

        Returns:
            (PermutationImportanceResemblance):
                Fitted object.
        """
        super().fit(X1=X1, X2=X2, column_names=column_names, class_names=class_names)

        permutation_result = permutation_importance(
            self.clf,
            self.X_test,
            self.y_test,
            scoring=self.scorer.scorer,
            n_repeats=self.iterations,
            n_jobs=self.n_jobs,
        )

        # Prepare report
        self.report_columns = ["mean_importance", "std_importance"]
        self.report = pd.DataFrame(index=self.column_names, columns=self.report_columns, dtype=float)

        for feature_index, feature_name in enumerate(self.column_names):
            # Fill in the report
            self.report.loc[feature_name, "mean_importance"] = permutation_result["importances_mean"][feature_index]
            self.report.loc[feature_name, "std_importance"] = permutation_result["importances_std"][feature_index]

            # Fill in the iterations
            current_iterations = pd.DataFrame(
                np.stack(
                    [
                        np.repeat(feature_name, self.iterations),
                        permutation_result["importances"][feature_index, :].reshape((self.iterations,)),
                    ],
                    axis=1,
                ),
                columns=self.iterations_columns,
            )

            self.iterations_results = pd.concat([self.iterations_results, current_iterations])

        self.iterations_results["importance"] = self.iterations_results["importance"].astype(float)

        # Sort by mean test score of first metric
        self.report.sort_values(by="mean_importance", ascending=False, inplace=True)

        return self

    def plot(self, ax=None, top_n=None, show=True, **plot_kwargs):
        """
        Plots the resulting AUC of the model as well as the feature importances.

        Args:
            ax (matplotlib.axes, optional):
                Axes to which the output should be plotted. If not provided new axes are created.

            top_n (int, optional):
                Number of the most important features to be plotted. By default features are included in the plot.

            show (bool, optional):
                If True, the plots are shown to the user, otherwise they are not shown. Not showing a plot can be useful
                when you want to edit the returned axis before showing it.

            **plot_kwargs:
                Keyword arguments passed to the matplotlib.plotly.subplots method.

        Returns:
            (matplotlib.axes):
                Axes that include the plot.
        """

        feature_report = self.compute()
        self.iterations_results["importance"] = self.iterations_results["importance"].astype(float)

        sorted_features = feature_report["mean_importance"].sort_values(ascending=True).index.values
        if top_n is not None and top_n > 0:
            sorted_features = sorted_features[-top_n:]

        if ax is None:
            fig, ax = plt.subplots(**plot_kwargs)

        for position, feature in enumerate(sorted_features):
            ax.boxplot(
                self.iterations_results[self.iterations_results["feature"] == feature]["importance"],
                positions=[position],
                vert=False,
            )

        ax.set_yticks(range(position + 1))
        ax.set_yticklabels(sorted_features)
        ax.set_xlabel(self.plot_x_label)
        ax.set_ylabel(self.plot_y_label)
        ax.set_title(self.plot_title)

        ax.annotate(
            self.results_text,
            (0, 0),
            (0, -50),
            fontsize=12,
            xycoords="axes fraction",
            textcoords="offset points",
            va="top",
        )

        if show:
            plt.show()
        else:
            plt.close()

        return ax

__init__(clf, iterations=100, scoring='roc_auc', test_prc=0.25, n_jobs=1, verbose=0, random_state=None)

Initializes the class.

Parameters:

Name Type Description Default
clf model object

Binary classification model or pipeline.

required
iterations int

Number of iterations performed to calculate permutation importance. By default 100 iterations per feature are done.

100
scoring string or Scorer

Metric for which the model performance is calculated. It can be either a metric name aligned with predefined classification scorers names in sklearn. Another option is using probatus.utils.Scorer to define a custom metric. Recommended option for this class is 'roc_auc'.

'roc_auc'
test_prc float

Percentage of data used to test the model. By default 0.25 is set.

0.25
n_jobs int

Number of parallel executions. If -1 use all available cores. By default 1.

1
verbose int

Controls verbosity of the output:

  • 0 - neither prints nor warnings are shown
  • 1 - 50 - only most important warnings
  • 51 - 100 - shows other warnings and prints
  • above 100 - presents all prints and all warnings (including SHAP warnings).
0
random_state int

Random state set at each round of feature elimination. If it is None, the results will not be reproducible and in random search at each iteration a different hyperparameters might be tested. For reproducible results set it to integer.

None
Source code in probatus/sample_similarity/resemblance_model.py
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
def __init__(
    self,
    clf,
    iterations=100,
    scoring="roc_auc",
    test_prc=0.25,
    n_jobs=1,
    verbose=0,
    random_state=None,
):
    """
    Initializes the class.

    Args:
        clf (model object):
            Binary classification model or pipeline.

        iterations (int, optional):
            Number of iterations performed to calculate permutation importance. By default 100 iterations per
            feature are done.

        scoring (string or probatus.utils.Scorer, optional):
            Metric for which the model performance is calculated. It can be either a metric name aligned with
            predefined
            [classification scorers names in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html).
            Another option is using probatus.utils.Scorer to define a custom metric. Recommended option for this
            class is 'roc_auc'.

        test_prc (float, optional):
            Percentage of data used to test the model. By default 0.25 is set.

        n_jobs (int, optional):
            Number of parallel executions. If -1 use all available cores. By default 1.

        verbose (int, optional):
            Controls verbosity of the output:

            - 0 - neither prints nor warnings are shown
            - 1 - 50 - only most important warnings
            - 51 - 100 - shows other warnings and prints
            - above 100 - presents all prints and all warnings (including SHAP warnings).

        random_state (int, optional):
            Random state set at each round of feature elimination. If it is None, the results will not be
            reproducible and in random search at each iteration a different hyperparameters might be tested. For
            reproducible results set it to integer.
    """  # noqa
    super().__init__(
        clf=clf,
        scoring=scoring,
        test_prc=test_prc,
        n_jobs=n_jobs,
        verbose=verbose,
        random_state=random_state,
    )

    self.iterations = iterations

    self.iterations_columns = ["feature", "importance"]
    self.iterations_results = pd.DataFrame(columns=self.iterations_columns)

    self.plot_x_label = "Permutation Feature Importance"
    self.plot_y_label = "Feature Name"
    self.plot_title = "Permutation Feature Importance of Resemblance Model"

fit(X1, X2, column_names=None, class_names=None)

This function assigns labels to each sample, 0 to the first sample, 1 to the second.

Then, it randomly selects a portion of data to train on. The resulting model tries to distinguish which sample a given test row comes from. This provides insights on how distinguishable these samples are and which features contribute to that. The feature importance is calculated using permutation importance.

Parameters:

Name Type Description Default
X1 ndarray or DataFrame

First sample to be compared. It needs to have the same number of columns as X2.

required
X2 ndarray or DataFrame

Second sample to be compared. It needs to have the same number of columns as X1.

required
column_names list of str

List of feature names of the provided samples. If provided it will be used to overwrite the existing feature names. If not provided the existing feature names are used or default feature names are generated.

None
class_names None, or list of str

List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the default ['First Sample', 'Second Sample'] are used.

None

Returns:

Type Description
PermutationImportanceResemblance

Fitted object.

Source code in probatus/sample_similarity/resemblance_model.py
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
def fit(self, X1, X2, column_names=None, class_names=None):
    """
    This function assigns labels to each sample, 0 to the first sample, 1 to the second.

    Then, it randomly selects a
        portion of data to train on. The resulting model tries to distinguish which sample a given test row
        comes from. This provides insights on how distinguishable these samples are and which features contribute to
        that. The feature importance is calculated using permutation importance.

    Args:
        X1 (np.ndarray or pd.DataFrame):
            First sample to be compared. It needs to have the same number of columns as X2.

        X2 (np.ndarray or pd.DataFrame):
            Second sample to be compared. It needs to have the same number of columns as X1.

        column_names (list of str, optional):
            List of feature names of the provided samples. If provided it will be used to overwrite the existing
            feature names. If not provided the existing feature names are used or default feature names are
            generated.

        class_names (None, or list of str, optional):
            List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
            default ['First Sample', 'Second Sample'] are used.

    Returns:
        (PermutationImportanceResemblance):
            Fitted object.
    """
    super().fit(X1=X1, X2=X2, column_names=column_names, class_names=class_names)

    permutation_result = permutation_importance(
        self.clf,
        self.X_test,
        self.y_test,
        scoring=self.scorer.scorer,
        n_repeats=self.iterations,
        n_jobs=self.n_jobs,
    )

    # Prepare report
    self.report_columns = ["mean_importance", "std_importance"]
    self.report = pd.DataFrame(index=self.column_names, columns=self.report_columns, dtype=float)

    for feature_index, feature_name in enumerate(self.column_names):
        # Fill in the report
        self.report.loc[feature_name, "mean_importance"] = permutation_result["importances_mean"][feature_index]
        self.report.loc[feature_name, "std_importance"] = permutation_result["importances_std"][feature_index]

        # Fill in the iterations
        current_iterations = pd.DataFrame(
            np.stack(
                [
                    np.repeat(feature_name, self.iterations),
                    permutation_result["importances"][feature_index, :].reshape((self.iterations,)),
                ],
                axis=1,
            ),
            columns=self.iterations_columns,
        )

        self.iterations_results = pd.concat([self.iterations_results, current_iterations])

    self.iterations_results["importance"] = self.iterations_results["importance"].astype(float)

    # Sort by mean test score of first metric
    self.report.sort_values(by="mean_importance", ascending=False, inplace=True)

    return self

plot(ax=None, top_n=None, show=True, **plot_kwargs)

Plots the resulting AUC of the model as well as the feature importances.

Parameters:

Name Type Description Default
ax axes

Axes to which the output should be plotted. If not provided new axes are created.

None
top_n int

Number of the most important features to be plotted. By default features are included in the plot.

None
show bool

If True, the plots are shown to the user, otherwise they are not shown. Not showing a plot can be useful when you want to edit the returned axis before showing it.

True
**plot_kwargs

Keyword arguments passed to the matplotlib.plotly.subplots method.

{}

Returns:

Type Description
axes

Axes that include the plot.

Source code in probatus/sample_similarity/resemblance_model.py
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
def plot(self, ax=None, top_n=None, show=True, **plot_kwargs):
    """
    Plots the resulting AUC of the model as well as the feature importances.

    Args:
        ax (matplotlib.axes, optional):
            Axes to which the output should be plotted. If not provided new axes are created.

        top_n (int, optional):
            Number of the most important features to be plotted. By default features are included in the plot.

        show (bool, optional):
            If True, the plots are shown to the user, otherwise they are not shown. Not showing a plot can be useful
            when you want to edit the returned axis before showing it.

        **plot_kwargs:
            Keyword arguments passed to the matplotlib.plotly.subplots method.

    Returns:
        (matplotlib.axes):
            Axes that include the plot.
    """

    feature_report = self.compute()
    self.iterations_results["importance"] = self.iterations_results["importance"].astype(float)

    sorted_features = feature_report["mean_importance"].sort_values(ascending=True).index.values
    if top_n is not None and top_n > 0:
        sorted_features = sorted_features[-top_n:]

    if ax is None:
        fig, ax = plt.subplots(**plot_kwargs)

    for position, feature in enumerate(sorted_features):
        ax.boxplot(
            self.iterations_results[self.iterations_results["feature"] == feature]["importance"],
            positions=[position],
            vert=False,
        )

    ax.set_yticks(range(position + 1))
    ax.set_yticklabels(sorted_features)
    ax.set_xlabel(self.plot_x_label)
    ax.set_ylabel(self.plot_y_label)
    ax.set_title(self.plot_title)

    ax.annotate(
        self.results_text,
        (0, 0),
        (0, -50),
        fontsize=12,
        xycoords="axes fraction",
        textcoords="offset points",
        va="top",
    )

    if show:
        plt.show()
    else:
        plt.close()

    return ax

SHAPImportanceResemblance

Bases: BaseResemblanceModel

This model checks for similarity of two samples.

A possible use case is analysis of whether the train sample differs from the test sample, due to e.g. non-stationarity.

It assigns labels to each sample, 0 to the first sample, 1 to the second. Then, it randomly selects a portion of data to train on. The resulting model tries to distinguish which sample a given test row comes from. This provides insights on how distinguishable these samples are and which features contribute to that. The feature importance is calculated using SHAP feature importance.

If the model achieves test AUC significantly different than 0.5, it indicates that it is possible to distinguish between the samples, and therefore, the samples differ. Features with a high permutation importance contribute to that effect the most. Thus, their distribution might differ between two samples.

This class currently works only with the Tree based models.

Examples:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from probatus.sample_similarity import SHAPImportanceResemblance
X1, _ = make_classification(n_samples=100, n_features=5)
X2, _ = make_classification(n_samples=100, n_features=5, shift=0.5)
clf = RandomForestClassifier(max_depth=2)
rm = SHAPImportanceResemblance(clf)
feature_importance = rm.fit_compute(X1, X2)
rm.plot()

Source code in probatus/sample_similarity/resemblance_model.py
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
class SHAPImportanceResemblance(BaseResemblanceModel):
    """
    This model checks for similarity of two samples.

    A possible use case is analysis of whether the train sample differs
        from the test sample, due to e.g. non-stationarity.

    It assigns labels to each sample, 0 to the first sample, 1 to the second. Then, it randomly selects a portion of
        data to train on. The resulting model tries to distinguish which sample a given test row comes from. This
        provides insights on how distinguishable these samples are and which features contribute to that. The feature
        importance is calculated using SHAP feature importance.

    If the model achieves test AUC significantly different than 0.5, it indicates that it is possible to distinguish
        between the samples, and therefore, the samples differ. Features with a high permutation importance contribute
        to that effect the most. Thus, their distribution might differ between two samples.

    This class currently works only with the Tree based models.

    Examples:
    ```python
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    from probatus.sample_similarity import SHAPImportanceResemblance
    X1, _ = make_classification(n_samples=100, n_features=5)
    X2, _ = make_classification(n_samples=100, n_features=5, shift=0.5)
    clf = RandomForestClassifier(max_depth=2)
    rm = SHAPImportanceResemblance(clf)
    feature_importance = rm.fit_compute(X1, X2)
    rm.plot()
    ```

    <img src="../img/sample_similarity_shap_importance.png" width="320" />
    <img src="../img/sample_similarity_shap_summary.png" width="320" />
    """

    def __init__(
        self,
        clf,
        scoring="roc_auc",
        test_prc=0.25,
        n_jobs=1,
        verbose=0,
        random_state=None,
    ):
        """
        Initializes the class.

        Args:
            clf (model object):
                Binary classification model or pipeline.

            scoring (string or probatus.utils.Scorer, optional):
                Metric for which the model performance is calculated. It can be either a metric name aligned with
                predefined
                [classification scorers names in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html).
                Another option is using probatus.utils.Scorer to define a custom metric. Recommended option for this
                class is 'roc_auc'.

            test_prc (float, optional):
                Percentage of data used to test the model. By default 0.25 is set.

            n_jobs (int, optional):
                Number of parallel executions. If -1 use all available cores. By default 1.

            verbose (int, optional):
                Controls verbosity of the output:

                - 0 - neither prints nor warnings are shown
                - 1 - 50 - only most important warnings
                - 51 - 100 - shows other warnings and prints
                - above 100 - presents all prints and all warnings (including SHAP warnings).

            random_state (int, optional):
                Random state set at each round of feature elimination. If it is None, the results will not be
                reproducible and in random search at each iteration a different hyperparameters might be tested. For
                reproducible results set it to integer.
        """  # noqa
        super().__init__(
            clf=clf,
            scoring=scoring,
            test_prc=test_prc,
            n_jobs=n_jobs,
            verbose=verbose,
            random_state=random_state,
        )

        self.plot_title = "SHAP summary plot"

    def fit(self, X1, X2, column_names=None, class_names=None, **shap_kwargs):
        """
        This function assigns labels to each sample, 0 to the first sample, 1 to the second.

        Then, it randomly selects a
            portion of data to train on. The resulting model tries to distinguish which sample a given test row
            comes from. This provides insights on how distinguishable these samples are and which features contribute to
            that. The feature importance is calculated using SHAP feature importance.

        Args:
            X1 (np.ndarray or pd.DataFrame):
                First sample to be compared. It needs to have the same number of columns as X2.

            X2 (np.ndarray or pd.DataFrame):
                Second sample to be compared. It needs to have the same number of columns as X1.

            column_names (list of str, optional):
                List of feature names of the provided samples. If provided it will be used to overwrite the existing
                feature names. If not provided the existing feature names are used or default feature names are
                generated.

            class_names (None, or list of str, optional):
                List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
                default ['First Sample', 'Second Sample'] are used.

            **shap_kwargs:
                keyword arguments passed to
                [shap.Explainer](https://shap.readthedocs.io/en/latest/generated/shap.Explainer.html#shap.Explainer).
                It also enables `approximate` and `check_additivity` parameters, passed while calculating SHAP values.
                The `approximate=True` causes less accurate, but faster SHAP values calculation, while
                `check_additivity=False` disables the additivity check inside SHAP.

        Returns:
            (SHAPImportanceResemblance):
                Fitted object.
        """
        super().fit(X1=X1, X2=X2, column_names=column_names, class_names=class_names)

        self.shap_values_test = shap_calc(self.clf, self.X_test, verbose=self.verbose, **shap_kwargs)
        self.report = calculate_shap_importance(self.shap_values_test, self.column_names)
        return self

    def plot(self, plot_type="bar", show=True, **summary_plot_kwargs):
        """
        Plots the resulting AUC of the model as well as the feature importances.

        Args:
            plot_type (str, optional): Type of plot, used to compute shap.summary_plot. By default 'bar', available ones
                are  "dot", "bar", "violin",

            show (bool, optional):
                If True, the plots are showed to the user, otherwise they are not shown. Not showing plot can be useful,
                when you want to edit the returned axis, before showing it.

            **summary_plot_kwargs:
                kwargs passed to the shap.summary_plot.

        Returns:
            (matplotlib.axes):
                Axes that include the plot.
        """

        # This line serves as a double check if the object has been fitted
        self._check_if_fitted()

        summary_plot(
            self.shap_values_test,
            self.X_test,
            plot_type=plot_type,
            class_names=self.class_names,
            show=False,
            **summary_plot_kwargs,
        )
        ax = plt.gca()
        ax.set_title(self.plot_title)

        ax.annotate(
            self.results_text,
            (0, 0),
            (0, -50),
            fontsize=12,
            xycoords="axes fraction",
            textcoords="offset points",
            va="top",
        )

        if show:
            plt.show()
        else:
            plt.close()

        return ax

    def get_shap_values(self):
        """
        Gets the SHAP values generated on the test set.

        Returns:
             (np.array):
                SHAP values generated on the test set.
        """
        self._check_if_fitted()
        return self.shap_values_test

__init__(clf, scoring='roc_auc', test_prc=0.25, n_jobs=1, verbose=0, random_state=None)

Initializes the class.

Parameters:

Name Type Description Default
clf model object

Binary classification model or pipeline.

required
scoring string or Scorer

Metric for which the model performance is calculated. It can be either a metric name aligned with predefined classification scorers names in sklearn. Another option is using probatus.utils.Scorer to define a custom metric. Recommended option for this class is 'roc_auc'.

'roc_auc'
test_prc float

Percentage of data used to test the model. By default 0.25 is set.

0.25
n_jobs int

Number of parallel executions. If -1 use all available cores. By default 1.

1
verbose int

Controls verbosity of the output:

  • 0 - neither prints nor warnings are shown
  • 1 - 50 - only most important warnings
  • 51 - 100 - shows other warnings and prints
  • above 100 - presents all prints and all warnings (including SHAP warnings).
0
random_state int

Random state set at each round of feature elimination. If it is None, the results will not be reproducible and in random search at each iteration a different hyperparameters might be tested. For reproducible results set it to integer.

None
Source code in probatus/sample_similarity/resemblance_model.py
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
def __init__(
    self,
    clf,
    scoring="roc_auc",
    test_prc=0.25,
    n_jobs=1,
    verbose=0,
    random_state=None,
):
    """
    Initializes the class.

    Args:
        clf (model object):
            Binary classification model or pipeline.

        scoring (string or probatus.utils.Scorer, optional):
            Metric for which the model performance is calculated. It can be either a metric name aligned with
            predefined
            [classification scorers names in sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html).
            Another option is using probatus.utils.Scorer to define a custom metric. Recommended option for this
            class is 'roc_auc'.

        test_prc (float, optional):
            Percentage of data used to test the model. By default 0.25 is set.

        n_jobs (int, optional):
            Number of parallel executions. If -1 use all available cores. By default 1.

        verbose (int, optional):
            Controls verbosity of the output:

            - 0 - neither prints nor warnings are shown
            - 1 - 50 - only most important warnings
            - 51 - 100 - shows other warnings and prints
            - above 100 - presents all prints and all warnings (including SHAP warnings).

        random_state (int, optional):
            Random state set at each round of feature elimination. If it is None, the results will not be
            reproducible and in random search at each iteration a different hyperparameters might be tested. For
            reproducible results set it to integer.
    """  # noqa
    super().__init__(
        clf=clf,
        scoring=scoring,
        test_prc=test_prc,
        n_jobs=n_jobs,
        verbose=verbose,
        random_state=random_state,
    )

    self.plot_title = "SHAP summary plot"

fit(X1, X2, column_names=None, class_names=None, **shap_kwargs)

This function assigns labels to each sample, 0 to the first sample, 1 to the second.

Then, it randomly selects a portion of data to train on. The resulting model tries to distinguish which sample a given test row comes from. This provides insights on how distinguishable these samples are and which features contribute to that. The feature importance is calculated using SHAP feature importance.

Parameters:

Name Type Description Default
X1 ndarray or DataFrame

First sample to be compared. It needs to have the same number of columns as X2.

required
X2 ndarray or DataFrame

Second sample to be compared. It needs to have the same number of columns as X1.

required
column_names list of str

List of feature names of the provided samples. If provided it will be used to overwrite the existing feature names. If not provided the existing feature names are used or default feature names are generated.

None
class_names None, or list of str

List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the default ['First Sample', 'Second Sample'] are used.

None
**shap_kwargs

keyword arguments passed to shap.Explainer. It also enables approximate and check_additivity parameters, passed while calculating SHAP values. The approximate=True causes less accurate, but faster SHAP values calculation, while check_additivity=False disables the additivity check inside SHAP.

{}

Returns:

Type Description
SHAPImportanceResemblance

Fitted object.

Source code in probatus/sample_similarity/resemblance_model.py
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
def fit(self, X1, X2, column_names=None, class_names=None, **shap_kwargs):
    """
    This function assigns labels to each sample, 0 to the first sample, 1 to the second.

    Then, it randomly selects a
        portion of data to train on. The resulting model tries to distinguish which sample a given test row
        comes from. This provides insights on how distinguishable these samples are and which features contribute to
        that. The feature importance is calculated using SHAP feature importance.

    Args:
        X1 (np.ndarray or pd.DataFrame):
            First sample to be compared. It needs to have the same number of columns as X2.

        X2 (np.ndarray or pd.DataFrame):
            Second sample to be compared. It needs to have the same number of columns as X1.

        column_names (list of str, optional):
            List of feature names of the provided samples. If provided it will be used to overwrite the existing
            feature names. If not provided the existing feature names are used or default feature names are
            generated.

        class_names (None, or list of str, optional):
            List of class names assigned, in this case provided samples e.g. ['sample1', 'sample2']. If none, the
            default ['First Sample', 'Second Sample'] are used.

        **shap_kwargs:
            keyword arguments passed to
            [shap.Explainer](https://shap.readthedocs.io/en/latest/generated/shap.Explainer.html#shap.Explainer).
            It also enables `approximate` and `check_additivity` parameters, passed while calculating SHAP values.
            The `approximate=True` causes less accurate, but faster SHAP values calculation, while
            `check_additivity=False` disables the additivity check inside SHAP.

    Returns:
        (SHAPImportanceResemblance):
            Fitted object.
    """
    super().fit(X1=X1, X2=X2, column_names=column_names, class_names=class_names)

    self.shap_values_test = shap_calc(self.clf, self.X_test, verbose=self.verbose, **shap_kwargs)
    self.report = calculate_shap_importance(self.shap_values_test, self.column_names)
    return self

get_shap_values()

Gets the SHAP values generated on the test set.

Returns:

Type Description
array

SHAP values generated on the test set.

Source code in probatus/sample_similarity/resemblance_model.py
688
689
690
691
692
693
694
695
696
697
def get_shap_values(self):
    """
    Gets the SHAP values generated on the test set.

    Returns:
         (np.array):
            SHAP values generated on the test set.
    """
    self._check_if_fitted()
    return self.shap_values_test

plot(plot_type='bar', show=True, **summary_plot_kwargs)

Plots the resulting AUC of the model as well as the feature importances.

Parameters:

Name Type Description Default
plot_type str

Type of plot, used to compute shap.summary_plot. By default 'bar', available ones are "dot", "bar", "violin",

'bar'
show bool

If True, the plots are showed to the user, otherwise they are not shown. Not showing plot can be useful, when you want to edit the returned axis, before showing it.

True
**summary_plot_kwargs

kwargs passed to the shap.summary_plot.

{}

Returns:

Type Description
axes

Axes that include the plot.

Source code in probatus/sample_similarity/resemblance_model.py
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
def plot(self, plot_type="bar", show=True, **summary_plot_kwargs):
    """
    Plots the resulting AUC of the model as well as the feature importances.

    Args:
        plot_type (str, optional): Type of plot, used to compute shap.summary_plot. By default 'bar', available ones
            are  "dot", "bar", "violin",

        show (bool, optional):
            If True, the plots are showed to the user, otherwise they are not shown. Not showing plot can be useful,
            when you want to edit the returned axis, before showing it.

        **summary_plot_kwargs:
            kwargs passed to the shap.summary_plot.

    Returns:
        (matplotlib.axes):
            Axes that include the plot.
    """

    # This line serves as a double check if the object has been fitted
    self._check_if_fitted()

    summary_plot(
        self.shap_values_test,
        self.X_test,
        plot_type=plot_type,
        class_names=self.class_names,
        show=False,
        **summary_plot_kwargs,
    )
    ax = plt.gca()
    ax.set_title(self.plot_title)

    ax.annotate(
        self.results_text,
        (0, 0),
        (0, -50),
        fontsize=12,
        xycoords="axes fraction",
        textcoords="offset points",
        va="top",
    )

    if show:
        plt.show()
    else:
        plt.close()

    return ax