Photo by NASA on Unsplash

As mentioned in Part 1 of Model Interpretability, the flexibility of model-agnostics is the greatest advantage, being the reason why they are so popular. Data Scientists and Machine Learning Engineers can use any machine learning model they wish as the interpretation method can be applied to it. This allows for the evaluation of the task, and the comparison of the model interpretability much simpler.

Part 2 of this series about Model Interpretability is about Global Model Agnostic Methods. To recap:

Global Interpretability aims to capture the entire model. It focuses on the explanation and understanding of why the model makes particular decisions, based on the dependent and independent variables.

Global Methods

Global methods are used to describe the average behavior of a machine learning model, making them of great value when the engineer of the model wants a better understanding of the general concepts of the model, its data, and how to possibly debug it.

I will be going through three different types of Global Model Agnostic Methods.

Partial Dependence Plot (PDP)

The Partial Dependence Plot shows the functional relationship between the set of input features and how it affects the prediction/target response. It explores how the predictions are more dependent on specific values of the input variable of interest over others.

It can show if the relationship between the target response and a feature is either linear, monotonic, or more complex. It helps researchers and data scientists/engineers understand and determine what happens to model predictions as various features are adjusted.

According to Greenwell et al’s paper: A Simple and Effective Model-Based Variable Importance Measure, a flat partial dependence plot indicates that the feature is not important and has no effect on the target response. The more the Partial Dependence Plot varies, the more the feature is important to its prediction.

When using numerical features, the importance of these features can be defined as the deviation of each unique feature value from the average curve, using this formula:

An example:

Let’s say we are using the cervical cancer dataset which explores and indicates the risk factors of whether a woman will get cervical cancer.

In this example, we fit a random forest to predict whether a woman might get cervical cancer based on risk factors such as the number of pregnancies, use of hormonal contraceptives, and more. We use a Partial Dependence Plot to compute and visualize the probability of getting cancer based on the different features.

Above are two visualizations that show the Partial Dependence Plots of cancer probability based on the features: age and years of hormonal contraceptive use.

For the age feature, we can see that the PDP remains low until the age of 40 is reached, then the probability of cancer increases. This is the same for the contraceptive feature, after 10 years of using hormonal contraceptives, there is an increase in the probability of cancer.

Advantages:

Partial Dependence Plots are easy to implement and interpret. Changing the features and measuring the impact it has on the prediction is a simple form of analyzing the relationship between the feature and prediction as well as interpreting complex models or tasks.
Interpretations are clear. There are some models where you will have to dive into understanding the explanation, however with PDP, if the feature used to compute the PDP is not correlated with other features it simply shows that the feature has little or no effect on the prediction. With this, you can make simple and clear interpretations.

Disadvantages:

The maximum number of features is 2. This is due to the 2-D representation that PDP is limited to. Using PDP to plot and interpret more than two features is difficult.
Lack of Data. This is an issue for a lot of processes, methods, and models, however, PDP may not be accurate for values that have little data. Interpreting regions with almost no data can be very misleading.
The assumption of Independence. Some features are not primarily independent and other features influence them. For example, imagine you are predicting the time it takes for someone to run 100m, taking into consideration their height and weight. The PDP of one feature, height, does not correlate with the other feature, weight. This is not true, and both these features directly affect the time it takes for someone to run 100m. PDP is easily interpreted if it is assumed that the feature or features for the computed partial dependence are not correlated with any other feature, however, this is also its biggest advantage.

Implementing PDP in your projects

If you are using R, there are packages such as: iml, pdp, and DALEX.
If you are using Python, there are packages such as the PDPBox and the PartialDependenceDisplay function in the sklearn.inspection module. For more information on sklearn.inspection, refer to this link.

2. Feature Interaction

So what is the solution to the disadvantage of PDP and its belief that features are not influenced by another feature? Feature Interaction. One feature and its effect are dependent on the value of other features.

When two features interact with one another, the change in the prediction occurs due to the variations in the feature and how it affects the individual features.

To better understand this concept, we can break down the predictions into four terms based on a machine learning model making a prediction based on two features:

Constant term
Term for the first feature
Term for the second feature
Term for the interaction between the two features

The most important thing to keep in mind when building and deploying your model? Understanding your end-goal. Read our interview with ML experts from Stanford, Google, and HuggingFace to learn more.

Friedman’s H-statistic

If two features do not interact with one another, we can assume that the partial dependence function is centered at 0. We can state the formula as:

PDjk(xj, xk) is the 2-way partial dependence function of both features
PDj(xj) + PDk(xk) are the two partial dependence functions of the single features

However, if the feature does not have any interaction with one another or with any other feature, the prediction function can be stated as:

f^(x) is the sum of partial dependence functions
PDj(xj) is the partial dependence that depends on the feature j
PD−j(x−j) is the partial dependence that depends on all other features except the j-th feature.

The next step involves measuring the interactions between the features:

The interaction between feature j and k:

The interaction between feature j and any other features:

An example:

Now let’s use the same cervical cancer dataset and apply Friedman’s H-statistic on each feature.

A random forest has been used to predict whether a woman might get cervical cancer based on risk factors. Friedman’s H-statistic has been applied to each feature, showing the relative interactive effects of all the features. Hormonal contraceptives have the highest effect in comparison to other features. Using this, we can further explore the 2-way interactions between features and the other features.

Advantages:

Unlike PDP, Friedman’s H-statistic allows you to analyze the interactions and the strength between 3 or more features.
Interpretation with meaning. The features are statistically explored and the interactions are defined, allowing you to further dive into understanding more about the types of interactions.

Disadvantages:

Friedman’s H-statistic is computationally expensive, taking a lot of time as it estimates the marginal distribution.
Variance. If all the data points are not used, the estimates for the marginal distribution face a certain variance, causing the results to be unstable.
Visualizing the interaction: Friedman’s H-statistic shows us the strength of interaction between the features, however, it does not tell us through a 2D visualization of what the interactions look like, such as PDP.
Friedman’s H-statistic cannot be used for tasks such as Image Classifier as the inputs are pixels.

3. Global Surrogate

Global Surrogate is another type of interpretable modeling that is trained to approximate the predictions of a black-box model.

Black-box models are models that are too complex that they are not interpretable by humans. Humans have little understanding of how the variables are being used or combined to make predictions. Using the black-box model, we can make conclusions about it by the use of a surrogate model.

A surrogate model, also known as a metamodel, or an emulator, response surface model, and emulator, is trained using a data-driven approach.

The steps of surrogate modeling:

Select a dataset.
You can use the same dataset that was training the black-box model or a completely new dataset from the same distribution.
Once you have selected your dataset, get the predictions of the black-box model.
Select your interpretable model type.
This can be a linear model, decision tree, random forest, etc.
Train the interpretable model on your selected dataset and its predictions.
There you have it. A surrogate model.
Your next step to help you better interpret is to measure the difference between the surrogate model predictions and those of the black-box model.

The R-squared measure can be used to calculate the difference between the surrogate model and the black-box model, measuring the replica between the two.

R2 is the percentage of variance captured by the surrogate model.
SSE is the sum of squared error.
SST is the sum of squares’ total.
y^∗(i) is the prediction for the i-th instance of the surrogate model.
y^(i) Is the prediction of the black-box model.
y^¯ Is the mean of the black box model predictions.

If the R2 value is close to 1, it will indicate a low SEE value, which in turn we can interpret that the interpretable model approximates the behavior of the black-box model well.

If the R2 value is close to 0, it will indicate a high SEE value, in turn allowing us to infer that the interpretable model fails to explain the black-box model.

An example:

Maintaining the same example throughout, the cervical cancer dataset uses a random forest. As mentioned in the steps above, you select your interpretable model type and train it on the original dataset. In this case, we’re using a decision tree, but using the prediction from the random forest as the outcomes. The counts in the nodes show the frequency of the classifications in the nodes using the black-box model.

Advantages:

The R-squared measure is a popular metric. It helps us to measure how good the surrogate model is in approximating black-box model predictions.
Surrogate modeling is easy and simple to implement. This allows for smoother interpretations and better explanations for people who have little to no knowledge in the world of Data Science and Machine Learning.
Flexibility: Being able to use any interpretable model type gives the adoption of surrogate modeling flexibility. This allows you to exchange the interpretable model, as well as the underlying black-box model.
Less computationally expensive. Training and employing surrogate modeling is much cheaper than using other methods.

Disadvantages:

Choosing your interpretable model. Although this is one of the advantages, due to its flexibility. You also need to take into consideration that depending on which interpretable model you chose, it comes with its advantages and disadvantages.
It’s about the model, not the data. When using surrogate modeling, you need to remember that you are drawing up conclusions and interpretations about the model, not about the data. Surrogate modeling does not allow you to see the real outcome.

Conclusion

In this part of the series, we have covered what Global Methods are and how they are related to Model Agnostic methods. I have gone through two different types of Model Agnostic methods, exploring the mathematics behind them, an example for your better understanding, and advantages and disadvantages to help you choose which method you should use.

In the next part, I will further explain more about Local Model Agnostic Methods.

Stay tuned!

Model Interpretability Part 2: Global Model Agnostic Methods

Global Methods

Partial Dependence Plot (PDP)

An example:

Implementing PDP in your projects

2. Feature Interaction

Friedman’s H-statistic

An example:

3. Global Surrogate

An example:

Conclusion