For simplicity, let’s assume there are three customers (c1, c2, c3) in this batch, and one vehicle (v1) information is provided as a sale.
- P(C=c1) represents the likelihood of c1 to buy any car. Assuming no prior knowledge about each customer, their likelihood of buying any car should be the same: P(C=c1) = P(C=c2) = P(C=c3), which equals a constant (e.g. 1/3 in this situation)
- P(V=v1) is the likelihood for v1 to be sold, given it is shown in this batch, this should be 1 (100% likelihood to be sold)
Since there is only one customer making the purchase, this probability can be extended into:
P(V=v1) = P(C=c1, V=v1) + P(C=c2, V=v1) + P(C=c3, V=v1) = 1.0
For each of the item, given the following formula
P(C=c1, V=v1) = P(C=c1|V=v1) * P(V=v1) = P(V=v1|C=c1) * P(C=c1)
We can see P(C=c1|V=v1) is proportional to P(V=v1|C=c1). So now, we can get the formula for the probability calculation:
P(C=c1|V=v1) = P(V=v1|C=c1) / (P(V=v1|C=c1) + P(V=v1|C=c2) + P(V=v1|C=c3))
and the key is to get the probability for each P(V|C). Such a formula can be verbally explained as: the likelihood for a vehicle to be purchased by a specific customer is proportional to the likelihood for the customer to buy this specific vehicle.
The above formula may look too “mathematical”, so let me put it into an intuitive context: assuming three people were in a room, one is a musician, one is an athlete, and one is a data scientist. You were told there is a violin in this room belong to one of them. Now guess, whom do you think is the owner of the violin? This is pretty straightforward, right? given the likelihood of musician to own a violin is high, and the likelihood of athlete and data scientists to own a violin is lower, it is much more likely for the violin to belong to the musician. The “mathematical” thinking process is illustrated below.
- What is the definition and intuition behind it,
- The non-technical explanation that you can communicate to business stakeholders,
- How to calculate or plot it,
- When should you use it.
Before we start: problem definition
MODEL_PARAMS = {'random_state': 1234,
'learning_rate': 0.1,
'n_estimators': 10}
- defined hyperparameter values:
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
- trained the model:
y_test_pred = model.predict_proba(X_test)
- predicted on test data:
log_binary_classification_metrics(y_test, y_test_pred)
- logged all the metrics for each run:
- evaluation metrics
- performance charts
- metric by threshold plots
1. Confusion Matrix
from sklearn.metrics import confusion_matrix
- 11918 predictions were true negatives,
- 872 were true positives,
- 82 were false positives,
- 333 predictions were false negatives.
2. False Positive Rate | Type I error
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
How models score in this metric (threshold=0.5):
- You rarely would use this metric alone. Usually as an auxiliary one with some other metric,
- If the cost of dealing with an alert is high you should consider increasing the threshold to get fewer alerts.
3. False Negative Rate | Type II error
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
How models score in this metric (threshold=0.5):
- Usually, it is not used alone but rather with some other metric,
- If the cost of letting the fraudulent transactions through is high and the value you get from the users isn’t you can consider focusing on this number.
4. True Negative Rate | Specificity
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)
How models score in this metric (threshold=0.5): - Usually, you don’t use it alone but rather as an auxiliary metric,
- When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient “you are healthy”. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.
5. Negative Predictive Value
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
negative_predictive_value = tn/ (tn + fn)
- When we care about high precision on negative predictions. For example, imagine we really don’t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.
6. False Discovery Rate
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_discovery_rate = fp/ (tp + fp)
- Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
- When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.
7. True Positive Rate | Recall | Sensitivity
from sklearn.metrics import confusion_matrix, recall_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)
# or simply
recall_score(y_true, y_pred_class)
- Usually, you will not use it alone but rather coupled with other metrics like precision.
- That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.
8. Positive Predictive Value | Precision
from sklearn.metrics import confusion_matrix, precision_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)
How models score in this metric (threshold=0.5):
- Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
- When raising false alerts is costly, when you want all the positive predictions to be worth looking at you should optimize for precision.
9. Accuracy
You shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)
# or simply
accuracy_score(y_true, y_pred_class)
- When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
- When every class is equally important to you.
10. F beta score
from sklearn.metrics import fbeta_score
y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)
11. F1 score (beta=1)
- Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.
12. F2 score (beta=2)
- I’d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it.
13. Cohen Kappa Metric
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(y_true, y_pred_class)
How models score in this metric (threshold=0.5): - This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion/alternative to accuracy.
14. Matthews Correlation Coefficient MCC
from sklearn.metrics import matthews_corrcoef
y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)
- When working on imbalanced problems,When you want to have something easily interpretable.
15. ROC Curve
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
How does it look: 16. ROC AUC score
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_true, y_pred_pos)
How models score in this metric:
- You should use it when you ultimately care about ranking predictions and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration).
- You should not use it when your data is heavily imbalanced. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.
- You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.
17. Precision-Recall Curve
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)
How does it look:
18. PR AUC score | Average precision
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
How models score in this metric:
- when you want to communicate precision/recall decision to other stakeholders
- when you want to choose the threshold that fits the business problem.
- when your data is heavily imbalanced. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.
- when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).
19. Log loss
from sklearn.metrics import log_loss
log_loss(y_true, y_pred)
How models score in this metric:
- Pretty much always there is a performance metric that better matches your business problem. Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.
20. Brier score
from sklearn.metrics import brier_score_loss
brier_score_loss(y_true, y_pred_pos)
How models score in this metric:
Model from the experiment BIN-101 has the best calibration and for that model, on average our predictions were off by 0.16 (√0.0263309). - When you care about calibrated probabilities.
21. Cumulative gains chart
- you order your predictions from highest to lowest andfor every percentile
- you calculate the fraction of true positive observations up to that percentile.
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
How does it look:
- Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.
22. Lift curve | lift chart
- we order the predictions from highest to lowest
- for every percentile, we calculate the fraction of true positive observations up to that percentile for our model and for the random model,
- we calculate the ratio of those fractions and plot it.
from scikitplot.metrics import plot_lift_curve
fig, ax = plt.subplots()
plot_lift_curve(y_true, y_pred, ax=ax)
How does it look: - Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.
23. Kolmogorov-Smirnov plot
- sort your observations by the prediction score,
- for every cutoff point [0.0, 1.0] of the sorted dataset (depth) calculate the proportion of true positives and true negatives in this depth,
- plot those fractions, positive(depth)/positive(all), negative(depth)/negative(all), on Y-axis and dataset depth on X-axis.
from scikitplot.metrics import plot_ks_statistic
fig, ax = plt.subplots()
plot_ks_statistic(y_true, y_pred, ax=ax)
How does it look: 24. Kolmogorov-Smirnov statistic
from scikitplot.helpers import binary_ks_curve
res = binary_ks_curve(y_true, y_pred_pos)
ks_stat = res[3]
How models score in this metric:
By using the KS statistic as the metric we were able to rank BIN-101 as the best model which we truly expect to be “truly” best model. - when your problem is about sorting/prioritizing the most relevant observations and you care equally about positive and negative classes.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.
Final Thoughts
Bonus:
- logging helper function that calculates and logs all the metrics, performance charts, and metric by threshold chartsbinary classification
- metrics cheetsheet with everything I talked about digested into a few pages.
pip install neptune-contrib[all]
- install the package:
import neptunecontrib.monitoring.metrics as npt_metrics
npt_metrics.log_binary_classification_metrics(y_true, y_pred)
- import and run:
- explore everything in the app:
import lightgbm
import matplotlib.pyplot as plt
import neptune
from neptunecontrib.monitoring.utils import pickle_and_send_artifact
from neptunecontrib.monitoring.metrics import log_binary_classification_metrics
from neptunecontrib.versioning.data import log_data_version
import pandas as pd
plt.rcParams.update({'font.size': 18})
plt.rcParams.update({'figure.figsize': [16, 12]})
plt.style.use('seaborn-whitegrid')
# Define parameters
PROJECT_NAME = 'neptune-ml/binary-classification-metrics'
TRAIN_PATH = 'data/train.csv'
TEST_PATH = 'data/test.csv'
NROWS = None
MODEL_PARAMS = {'random_state': 1234,
'learning_rate': 0.1,
'n_estimators': 1500}
# Load data
train = pd.read_csv(TRAIN_PATH, nrows=NROWS)
test = pd.read_csv(TEST_PATH, nrows=NROWS)
feature_names = [col for col in train.columns if col not in ['isFraud']]
X_train, y_train = train[feature_names], train['isFraud']
X_test, y_test = test[feature_names], test['isFraud']
# Start experiment
neptune.init(PROJECT_NAME)
neptune.create_experiment(name='lightGBM training',
params=MODEL_PARAMS,
upload_source_files=['train.py', 'environment.yaml'])
log_data_version(TRAIN_PATH, prefix='train_')
log_data_version(TEST_PATH, prefix='test_')
# Train model
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
# Evaluate model
y_test_pred = model.predict_proba(X_test)
log_binary_classification_metrics(y_test, y_test_pred)
pickle_and_send_artifact((y_test, y_test_pred), 'test_predictions.pkl')
neptune.stop()