Evaluation Metrics for Classification Models (Part 2)

black boots on gray cement, with arrows pointing diagonally left and right — Photo by Jon Tyson on Unsplash

In machine learning, data scientists use evaluation metrics to assess the model’s performance in terms of the ability of the various machine learning models to classify the data points into their respective classes accurately.

As a data scientist, selecting the right evaluation metrics is essential based on the problem’s use case and the dataset’s characteristics.

These metrics may differ depending on the requirements of the problem. For example, recall may be more important than precision in a medical diagnosis scenario, as it is more important to avoid false negatives (missed diagnoses) than false positives (unnecessary treatments).

In the previous part of this series, we learned about some of the evaluation metrics used for classification models and in what scenarios we should use those metrics.

This article will review other useful evaluation metrics for classification models. Let’s get started!

F1 Score

The F1 score is one of the most popular metrics for classification models. It is the harmonic mean of the model’s precision and recall and is a number that ranges between 0 and 1.

F1 score can be calculated in the following way:

The F1 score is helpful when precision and recall are essential, and the data is relatively balanced between the two classes. For example, it can be used to evaluate the performance of a fraud detection model, where both false positives and false negatives have serious consequences.

Example:

#F1 score 
from sklearn.metrics import f1_score

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)

Log Loss

Log loss (also called logarithmic loss or cross-entropy loss) measures the performance of a classification model where the prediction output is a probability value between 0 and 1. It compares the predicted probability distribution with the actual probability distribution of the test data. It’s defined as follows:

log_loss = -1/n * ∑(y * log(y_hat) + (1-y) * log(1-y_hat))

Where n is the number of samples, y is the true label, and y_hat is the predicted probability.

Log loss is helpful when penalizing the model for being confidently wrong. It is commonly used in multi-class classification problems, where the output is a probability distribution over multiple classes.

Example:

#Log loss evaluation metric
from sklearn.metrics import log_loss

y_true = [0, 1, 1, 0, 1]
y_pred = [[0.89, 0.11], [0.3, 0.7], [0.81, 0.19], [0.6, 0.4], [0.1, 0.9]]

logloss = log_loss(y_true, y_pred)
print("Log Loss:", logloss)

Cohen’s Kappa

Cohen’s Kappa is a statistical measure of inter-rater agreement between two raters for categorical items.

In the context of classification models, it measures the agreement between predicted and true labels and considers the possibility of the agreement by chance. It is defined as follows:

kappa = (observed agreement - expected agreement) / (1 - expected agreement)

Where observed agreement is the proportion of times the raters agreed, and expected agreement is the proportion of times they would be expected to agree by chance.

Cohen’s Kappa is useful when the classes are imbalanced, and the overall accuracy is not a good indicator of model performance. It is commonly used to evaluate NLP tasks, such as text classification.

Example:

#Cohen's Kappa evaluation metric
from sklearn.metrics import cohen_kappa_score

y_true = [0, 1, 0, 1, 1]
y_pred = [0, 0, 1, 0, 1]

kappa = cohen_kappa_score(y_true, y_pred)
print("Cohen's Kappa:", kappa)

Matthew’s Correlation Coefficient (MCC)

Matthew’s correlation coefficient (MCC) is between the observed and predicted binary classifications and considers true and false positives and negatives. It is calculated as follows:

MCC = (TP * TN - FP * FN) / sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))

The MCC is applicable when the classes are imbalanced, and the overall accuracy is not a good indicator of model performance. For example, it can be used to evaluate the performance of a cancer diagnosis model where the number of positive samples is much smaller than the number of negative samples.

Example:

#MCC evaluation metric
from sklearn.metrics import matthews_corrcoef

y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]

mcc = matthews_corrcoef(y_true, y_pred)
print("MCC:", mcc)

Receiver Operating Characteristic (ROC) Curve

The ROC curve plots the true positive rate (TPR) versus the false positive rate (FPR) at different classification thresholds. It provides a way to balance the trade-off between sensitivity (TPR) and specificity (1 — FPR) for different classification thresholds.

The AUC (area under the curve) summarizes the model’s overall performance across all possible classification thresholds.

Example:

#ROC Curve
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

y_true = [0, 1, 0, 1, 0]
y_score = [0.2, 0.7, 0.9, 0.4, 0.6]

fpr, tpr, thresholds = roc_curve(y_true, y_score)

plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

auc = roc_auc_score(y_true, y_score)

print("AUC:", auc)

The ROC curve is useful when the classes are imbalanced and the cost of false positives and false negatives is not the same.

For example, it can be used to evaluate the performance of a credit risk model, where the cost of false positives (granting credit to a risky borrower) is higher than the cost of false negatives (rejecting a good borrower).

Conclusion

These are some of the additional evaluation metrics for classification models in machine learning. As a data scientist, choosing the right evaluation metrics is essential based on the problem’s use case and the given dataset’s characteristics.

Thanks for reading!

Evaluation Metrics for Classification Models in Machine Learning (Part 2)

F1 Score

Log Loss

Cohen’s Kappa

Matthew’s Correlation Coefficient (MCC)

Receiver Operating Characteristic (ROC) Curve

Conclusion