LogisticRegression
Extended Logistic Regression.
Extends sklearn.linear_model.LogisticRegression.
This class provides the following extra statistics, calculated on .fit()
and accessible via .get_stats()
:
cov_matrix_
: covariance matrix for the estimated parameters.std_err_intercept_
: estimated uncertainty for the interceptstd_err_coef_
: estimated uncertainty for the coefficientsz_intercept_
: estimated z-statistic for the interceptz_coef_
: estimated z-statistic for the coefficientsp_value_intercept_
: estimated p-value for the interceptp_value_coef_
: estimated p-value for the coefficients
Examples:
from skorecard.datasets import load_uci_credit_card
from skorecard.bucketers import EqualFrequencyBucketer
from skorecard.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
X, y = load_uci_credit_card(return_X_y=True)
pipeline = Pipeline([
('bucketer', EqualFrequencyBucketer(n_bins=10)),
('clf', LogisticRegression(calculate_stats=True))
])
pipeline.fit(X, y)
assert pipeline.named_steps['clf'].p_val_coef_[0][0] > 0
pipeline.named_steps['clf'].get_stats()
An example output of .get_stats()
:
Index | Coef. | Std.Err | z | Pz |
---|---|---|---|---|
const | -0.537571 | 0.096108 | -5.593394 | 2.226735e-08 |
EDUCATION | 0.010091 | 0.044874 | 0.224876 | 8.220757e-01 |
__init__(self, penalty='l2', calculate_stats=False, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
special
¶
Extends sklearn.linear_model.LogisticRegression.fit().
Parameters:
Name | Type | Description | Default |
---|---|---|---|
calculate_stats |
bool |
If true, calculate statistics like standard error during fit, accessible with .get_stats() |
False |
decision_function(self, X)
inherited
¶
Predict confidence scores for samples.
The confidence score for a sample is proportional to the signed distance of that sample to the hyperplane.
Parameters¶
X : array-like or sparse matrix, shape (n_samples, n_features) Samples.
Returns¶
array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes) Confidence scores per (sample, class) combination. In the binary case, confidence score for self.classes_[1] where >0 means this class would be predicted.
densify(self)
inherited
¶
Convert coefficient matrix to dense array format.
Converts the coef_
member (back) to a numpy.ndarray. This is the
default format of coef_
and is required for fitting, so calling
this method is only required on models that have previously been
sparsified; otherwise, it is a no-op.
Returns¶
self Fitted estimator.
fit(self, X, y, sample_weight=None, calculate_stats=False, **kwargs)
¶
Fit the model.
Overwrites sklearn.linear_model.LogisticRegression.fit().
In addition to the standard fit by sklearn, this function will compute the covariance of the coefficients.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like, sparse matrix |
Matrix of shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features. |
required |
y |
array-like |
of shape (n_samples,) Target vector relative to X. |
required |
sample_weight |
array-like |
of shape (n_samples,) default=None Array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. |
None |
calculate_stats |
bool |
If true, calculate statistics like standard error during fit, accessible with .get_stats() |
False |
Returns:
Type | Description |
---|---|
self (LogisticRegression) |
Fitted estimator. |
get_params(self, deep=True)
inherited
¶
get_stats(self)
¶
Puts the summary statistics of the fit() function into a pandas DataFrame.
Returns:
Type | Description |
---|---|
data (pandas DataFrame) |
The statistics dataframe, indexed by the column name |
plot_weights(self)
¶
Plots the relative importance of coefficients of the model.
Examples:
from skorecard.datasets import load_uci_credit_card
from skorecard.bucketers import EqualFrequencyBucketer
from skorecard.linear_model import LogisticRegression
from skorecard.reporting.plotting import weight_plot
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
X, y = load_uci_credit_card(return_X_y=True)
pipeline = Pipeline([
('bucketer', EqualFrequencyBucketer(n_bins=10)),
('clf', LogisticRegression(calculate_stats=True))
])
pipeline.fit(X, y)
assert pipeline.named_steps['clf'].p_val_coef_[0][0] > 0
stats = pipeline.named_steps['clf'].get_stats()
pipeline.named_steps['clf'].plot_weights()
predict(self, X)
inherited
¶
predict_log_proba(self, X)
inherited
¶
Predict logarithm of probability estimates.
The returned estimates for all classes are ordered by the label of classes.
Parameters¶
X : array-like of shape (n_samples, n_features)
Vector to be scored, where n_samples
is the number of samples and
n_features
is the number of features.
Returns¶
T : array-like of shape (n_samples, n_classes)
Returns the log-probability of the sample for each class in the
model, where classes are ordered as they are in self.classes_
.
predict_proba(self, X)
inherited
¶
Probability estimates.
The returned estimates for all classes are ordered by the label of classes.
For a multi_class problem, if multi_class is set to be "multinomial" the softmax function is used to find the predicted probability of each class. Else use a one-vs-rest approach, i.e calculate the probability of each class assuming it to be positive using the logistic function. and normalize these values across all the classes.
Parameters¶
X : array-like of shape (n_samples, n_features)
Vector to be scored, where n_samples
is the number of samples and
n_features
is the number of features.
Returns¶
T : array-like of shape (n_samples, n_classes)
Returns the probability of the sample for each class in the model,
where classes are ordered as they are in self.classes_
.
score(self, X, y, sample_weight=None)
inherited
¶
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Parameters¶
X : array-like of shape (n_samples, n_features) Test samples.
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
True labels for X
.
sample_weight : array-like of shape (n_samples,), default=None Sample weights.
Returns¶
score : float
Mean accuracy of self.predict(X)
wrt. y
.
set_params(self, **params)
inherited
¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as :class:~sklearn.pipeline.Pipeline
). The latter have
parameters of the form <component>__<parameter>
so that it's
possible to update each component of a nested object.
Parameters¶
**params : dict Estimator parameters.
Returns¶
self : estimator instance Estimator instance.
sparsify(self)
inherited
¶
Convert coefficient matrix to sparse format.
Converts the coef_
member to a scipy.sparse matrix, which for
L1-regularized models can be much more memory- and storage-efficient
than the usual numpy.ndarray representation.
The intercept_
member is not converted.
Returns¶
self Fitted estimator.
Notes¶
For non-sparse models, i.e. when there are not many zeros in coef_
,
this may actually increase memory usage, so use this method with
care. A rule of thumb is that the number of zero elements, which can
be computed with (coef_ == 0).sum()
, must be more than 50% for this
to provide significant benefits.
After calling this method, further fitting with the partial_fit method (if any) will not work until you call densify.