For simplicity, let’s assume there are three customers (c1, c2, c3) in this batch, and one vehicle (v1) information is provided as a sale.

- P(C=c1) represents the likelihood of c1 to buy any car. Assuming no prior knowledge about each customer, their likelihood of buying any car should be the same: P(C=c1) = P(C=c2) = P(C=c3), which equals a constant (e.g. 1/3 in this situation)
- P(V=v1) is the likelihood for v1 to be sold, given it is shown in this batch, this should be 1 (100% likelihood to be sold)

Since there is only one customer making the purchase, this probability can be extended into:

P(V=v1) = P(C=c1, V=v1) + P(C=c2, V=v1) + P(C=c3, V=v1) = 1.0

For each of the item, given the following formula

P(C=c1, V=v1) = P(C=c1|V=v1) * P(V=v1) = P(V=v1|C=c1) * P(C=c1)

We can see P(C=c1|V=v1) is proportional to P(V=v1|C=c1). So now, we can get the formula for the probability calculation:

P(C=c1|V=v1) = P(V=v1|C=c1) / (P(V=v1|C=c1) + P(V=v1|C=c2) + P(V=v1|C=c3))

and the key is to get the probability for each P(V|C). Such a formula can be verbally explained as: the likelihood for a vehicle to be purchased by a specific customer is proportional to the likelihood for the customer to buy this specific vehicle.

The above formula may look too “mathematical”, so let me put it into an intuitive context: assuming three people were in a room, one is a musician, one is an athlete, and one is a data scientist. You were told there is a violin in this room belong to one of them. Now guess, whom do you think is the owner of the violin? This is pretty straightforward, right? given the likelihood of musician to own a violin is high, and the likelihood of athlete and data scientists to own a violin is lower, it is much more likely for the violin to belong to the musician. The “mathematical” thinking process is illustrated below.

**understand how to choose**the model performance

**metric for your problem**. Specifically, for each metric, I will talk about:

- What is the
**definition**and**intuition**behind it, - The
**non-technical explanation**that you can communicate to business stakeholders, **How to calculate or plot it**,**When**should you**use it**.

## Before we start: problem definition

**43 features**and sampled

**66000 observations**from the original dataset adjusting the

**fraction of positive class to 0.09**.

**learning_rate**and

**n_estimators**parameters because I wanted to have an intuition as to which models are “truly” better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, as use more trees and smaller learning rates it gets tricky but I think it is a decent proxy.

**learning_rate**and

**n_estimators**, I did the following:

```
MODEL_PARAMS = {'random_state': 1234,
'learning_rate': 0.1,
'n_estimators': 10}
```

- defined hyperparameter values:

```
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
```

- trained the model:

```
y_test_pred = model.predict_proba(X_test)
```

- predicted on test data:

```
log_binary_classification_metrics(y_test, y_test_pred)
```

- logged all the metrics for each run:

- evaluation metrics
- performance charts
- metric by threshold plots

## 1. Confusion Matrix

**How to compute:**

`from sklearn.metrics import confusion_matrix `

```
```

```
```

```
```

```
y_pred_class = y_pred_pos > threshold
cm = confusion_matrix(y_true, y_pred_class)
tn, fp, fn, tp = cm.ravel()
```

**How does it look:**

**11918**predictions were**true negatives**,**872**were**true positives**,**82**were**false positives**,**333**predictions were**false negatives**.

**When to use it:**

## 2. False Positive Rate | Type I error

**fraction of false alerts**that will be raised based on your model predictions.

**How to compute:**

```
from sklearn.metrics import confusion_matrix
```

y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_positive_rate = fp / (fp + tn)
**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**When to use it:**

- You rarely would use this metric alone. Usually as an auxiliary one with some other metric,
- If the
**cost of dealing with an alert is high**you should consider increasing the threshold to get fewer alerts.

**3. False Negative Rate | Type II error**

**fraction of missed fraudulent transactions**that your model lets through.

**How to compute:**

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_negative_rate = fn / (tp + fn)
```

**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**When to use it:**

- Usually, it is not used alone but rather with some other metric,
- If the cost of letting the fraudulent transactions through is high and the value you get from the users isn’t you can consider focusing on this number.

**4. True Negative Rate | Specificity**

**How to compute:**

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_negative_rate = tn / (tn + fp)
```

**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**How does it depend on the threshold:**

**When to use it:**

- Usually, you don’t use it alone but rather as an auxiliary metric,
- When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient “you are healthy”. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.

**5. Negative Predictive Value**

**How to compute:**

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
negative_predictive_value = tn/ (tn + fn)
```

**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**When to use it:**

- When we care about high precision on negative predictions. For example, imagine we really don’t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.

**6. False Discovery Rate**

**How to compute:**

```
from sklearn.metrics import confusion_matrix
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
false_discovery_rate = fp/ (tp + fp)
```

**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**When to use it**

- Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
- When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.

**7. True Positive Rate | Recall | Sensitivity**

**put all guilty in prison.**

**How to compute:**

```
from sklearn.metrics import confusion_matrix, recall_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
true_positive_rate = tp / (tp + fn)
# or simply
recall_score(y_true, y_pred_class)
```

**How does it depend on the threshold:**

**When to use it:**

- Usually, you will not use it alone but rather coupled with other metrics like precision.
- That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.

**8. Positive Predictive Value | Precision**

**people that you put in prison are guilty**.

**How to compute:**

```
from sklearn.metrics import confusion_matrix, precision_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
positive_predictive_value = tp/ (tp + fp)
# or simply
precision_score(y_true, y_pred_class)
```

**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**When to use it:**

- Again, it usually doesn’t make sense to use it alone but rather coupled with other metrics like recall.
- When raising false alerts is costly, when you want all the positive predictions to be worth looking at you should optimize for precision.

**9. Accuracy**

You **shouldn’t use accuracy on imbalanced problems**. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.

**How to compute:**

```
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_class = y_pred_pos > threshold
tn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()
accuracy = (tp + tn) / (tp + fp + fn + tn)
# or simply
accuracy_score(y_true, y_pred_class)
```

**How models score in this metric (threshold=0.5):**We can see that for all the models we beat the dummy model (all clean transactions) by a large margin. Also the models that we’d expect to be better are in fact at the top.

**How does it depend on the threshold:**

- When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,
- When every class is equally important to you.

**10. F beta score**

**the more you care about recall**over precision

**the higher beta**you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.

**How to compute:**

```
from sklearn.metrics import fbeta_score
y_pred_class = y_pred_pos > threshold
fbeta_score(y_true, y_pred_class, beta)
```

## 11. F1 score (beta=1)

**How models score in this metric (threshold=0.5):**

**adjust the threshold to optimize F1 score**. Notice that for both precision and recall you could get perfect scores by increasing or decreasing the threshold. Good thing is,

**you can find a sweet spot**for F1metric. As you can see, getting the threshold just right can actually improve your score by a bit 0.8077->0.8121.

**When to use it:**

- Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.

## 12. F2 score (beta=2)

**2x emphasis on recall**.

**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**find a sweet spot**for the threshold. Possible gain from 0.755 -> 0.803 show how

**important**threshold adjustments can be here.

**When to use it:**

- I’d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it.

## 13. Cohen Kappa Metric

**“observed agreement” (po)**and

**“expected agreement” (pe)**. Observed agreement (po) is simply how our classifier predictions agree with the ground truth, which means it is just accuracy. The expected agreement (pe) is how the predictions of the

**random classifier that samples according to class frequencies**agree with the ground truth, or accuracy of the random classifier.

**How to compute:**

```
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(y_true, y_pred_class)
```

**How models score in this metric (threshold=0.5):**

**How does it depend on the threshold:**

**When to use it:**

- This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion/alternative to accuracy.

## 14. Matthews Correlation Coefficient MCC

**How to compute:**

```
from sklearn.metrics import matthews_corrcoef
y_pred_class = y_pred_pos > threshold
matthews_corrcoef(y_true, y_pred_class)
```

**How does it depend on the threshold:**

**When to use it:**

- When working on imbalanced problems,When you want to have something easily interpretable.

## 15. ROC Curve

**How to compute:**

```
from scikitplot.metrics import plot_roc
fig, ax = plt.subplots()
plot_roc(y_true, y_pred, ax=ax)
```

**How does it look:**

## 16. ROC AUC score

**how good at ranking predictions your model is**. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

**How to compute:**

```
from sklearn.metrics import roc_auc_score
roc_auc = roc_auc_score(y_true, y_pred_pos)
```

**How models score in this metric:**

**When to use it:**

- You
**should use it**when you ultimately**care about ranking predictions**and not necessarily about outputting well-calibrated probabilities (read this article by Jason Brownlee if you want to learn about probability calibration). - You
**should not use it**when your**data is heavily imbalanced**. It was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives. - You
**should use it when you care equally about positive and negative classes**. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.

## 17. Precision-Recall Curve

**at which recall your precision starts to fall fast**can help you choose the threshold and deliver a better model.

**How to compute:**

```
from scikitplot.metrics import plot_precision_recall
fig, ax = plt.subplots()
plot_precision_recall(y_true, y_pred, ax=ax)
```

**How does it look:**

## 18. PR AUC score | Average precision

**How to compute:**

```
from sklearn.metrics import average_precision_score
average_precision_score(y_true, y_pred_pos)
```

## How models score in this metric:

**When to use it:**

- when you want to
**communicate precision/recall decision**to other stakeholders - when you want to
**choose the threshold that fits the business problem**. - when your data is
**heavily imbalanced**. As mentioned before, it was discussed extensively in this article by Takaya Saito and Marc Rehmsmeier. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class. - when
**you care more about positive than negative class**. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).

## 19. Log loss

**How to compute:**

```
from sklearn.metrics import log_loss
log_loss(y_true, y_pred)
```

**How models score in this metric:**

**When to use it:**

- Pretty much
**always there is a**performance**metric that better matches your**business**problem.**Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.

## 20. Brier score

**How to compute:**

```
from sklearn.metrics import brier_score_loss
brier_score_loss(y_true, y_pred_pos)
```

**How models score in this metric:**Model from the experiment BIN-101 has the best calibration and for that model, on average our predictions were off by 0.16 (√0.0263309).

**When to use it:**

- When you
**care about calibrated probabilities.**

## 21. Cumulative gains chart

- you order your predictions from highest to lowest andfor every percentile
- you calculate the fraction of true positive observations up to that percentile.

**How to compute:**

```
from scikitplot.metrics import plot_cumulative_gain
fig, ax = plt.subplots()
plot_cumulative_gain(y_true, y_pred, ax=ax)
```

**How does it look:**

**When to use it:**

- Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

## 22. Lift curve | lift chart

- we order the predictions from highest to lowest
- for every percentile, we calculate the fraction of true positive observations up to that percentile for our model and for the random model,
- we calculate the ratio of those fractions and plot it.

**How to compute:**

```
from scikitplot.metrics import plot_lift_curve
fig, ax = plt.subplots()
plot_lift_curve(y_true, y_pred, ax=ax)
```

**How does it look:**

**When to use it:**

- Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

## 23. Kolmogorov-Smirnov plot

- sort your observations by the prediction score,
- for every cutoff point [0.0, 1.0] of the sorted dataset (depth) calculate the proportion of true positives and true negatives in this depth,
- plot those fractions, positive(depth)/positive(all), negative(depth)/negative(all), on Y-axis and dataset depth on X-axis.

**How to compute:**

```
from scikitplot.metrics import plot_ks_statistic
fig, ax = plt.subplots()
plot_ks_statistic(y_true, y_pred, ax=ax)
```

**How does it look:**

## 24. Kolmogorov-Smirnov statistic

**How to compute:**

```
from scikitplot.helpers import binary_ks_curve
res = binary_ks_curve(y_true, y_pred_pos)
ks_stat = res[3]
```

**How models score in this metric:**By using the KS statistic as the metric we were able to rank BIN-101 as the best model which we truly expect to be “truly” best model.

**When to use it:**

- when your problem is about sorting/prioritizing the most relevant observations and you care equally about positive and negative classes.
- It can be a good addition to ROC AUC score which measures ranking/sorting performance of your model.

## Final Thoughts

**Bonus:**

that calculates and logs all the metrics, performance charts, and metric by threshold chartsbinary classification**logging helper function**with everything I talked about digested into a few pages.**metrics cheetsheet**

**Logging helper function**If you want to

**log all**of those

**metrics**

**and**performance

**charts**that we covered for your machine learning project

**with just one function call**and explore them in Neptune.

```
pip install neptune-contrib[all]
```

- install the package:

```
import neptunecontrib.monitoring.metrics as npt_metrics
npt_metrics.log_binary_classification_metrics(y_true, y_pred)
```

- import and run:
- explore everything in the app:

**track experiment runs with Neptune for free,**right? Singup for a free account

**Binary classification metrics cheatsheet**

**Example script**

```
import lightgbm
import matplotlib.pyplot as plt
import neptune
from neptunecontrib.monitoring.utils import pickle_and_send_artifact
from neptunecontrib.monitoring.metrics import log_binary_classification_metrics
from neptunecontrib.versioning.data import log_data_version
import pandas as pd
plt.rcParams.update({'font.size': 18})
plt.rcParams.update({'figure.figsize': [16, 12]})
plt.style.use('seaborn-whitegrid')
# Define parameters
PROJECT_NAME = 'neptune-ml/binary-classification-metrics'
TRAIN_PATH = 'data/train.csv'
TEST_PATH = 'data/test.csv'
NROWS = None
MODEL_PARAMS = {'random_state': 1234,
'learning_rate': 0.1,
'n_estimators': 1500}
# Load data
train = pd.read_csv(TRAIN_PATH, nrows=NROWS)
test = pd.read_csv(TEST_PATH, nrows=NROWS)
feature_names = [col for col in train.columns if col not in ['isFraud']]
X_train, y_train = train[feature_names], train['isFraud']
X_test, y_test = test[feature_names], test['isFraud']
# Start experiment
neptune.init(PROJECT_NAME)
neptune.create_experiment(name='lightGBM training',
params=MODEL_PARAMS,
upload_source_files=['train.py', 'environment.yaml'])
log_data_version(TRAIN_PATH, prefix='train_')
log_data_version(TEST_PATH, prefix='test_')
# Train model
model = lightgbm.LGBMClassifier(**MODEL_PARAMS)
model.fit(X_train, y_train)
# Evaluate model
y_test_pred = model.predict_proba(X_test)
log_binary_classification_metrics(y_test, y_test_pred)
pickle_and_send_artifact((y_test, y_test_pred), 'test_predictions.pkl')
neptune.stop()
```