January 13, 2023
I've been a long time reader of Ben Thompson's newsletter called Stratechery. Ben Thompson…
As mentioned in Part 1 of Model Interpretability, the flexibility of model-agnostics is the greatest advantage, being the reason why they are so popular. Data Scientists and Machine Learning Engineers can use any machine learning model they wish as the interpretation method can be applied to it. This allows for the evaluation of the task, and the comparison of the model interpretability much simpler.
Part 2 of this series about Model Interpretability is about Global Model Agnostic Methods. To recap:
Global Interpretability aims to capture the entire model. It focuses on the explanation and understanding of why the model makes particular decisions, based on the dependent and independent variables.
Global methods are used to describe the average behavior of a machine learning model, making them of great value when the engineer of the model wants a better understanding of the general concepts of the model, its data, and how to possibly debug it.
I will be going through three different types of Global Model Agnostic Methods.
The Partial Dependence Plot shows the functional relationship between the set of input features and how it affects the prediction/target response. It explores how the predictions are more dependent on specific values of the input variable of interest over others.
It can show if the relationship between the target response and a feature is either linear, monotonic, or more complex. It helps researchers and data scientists/engineers understand and determine what happens to model predictions as various features are adjusted.
According to Greenwell et al’s paper: A Simple and Effective Model-Based Variable Importance Measure, a flat partial dependence plot indicates that the feature is not important and has no effect on the target response. The more the Partial Dependence Plot varies, the more the feature is important to its prediction.
When using numerical features, the importance of these features can be defined as the deviation of each unique feature value from the average curve, using this formula:
Let’s say we are using the cervical cancer dataset which explores and indicates the risk factors of whether a woman will get cervical cancer.
In this example, we fit a random forest to predict whether a woman might get cervical cancer based on risk factors such as the number of pregnancies, use of hormonal contraceptives, and more. We use a Partial Dependence Plot to compute and visualize the probability of getting cancer based on the different features.
Above are two visualizations that show the Partial Dependence Plots of cancer probability based on the features: age and years of hormonal contraceptive use.
For the age feature, we can see that the PDP remains low until the age of 40 is reached, then the probability of cancer increases. This is the same for the contraceptive feature, after 10 years of using hormonal contraceptives, there is an increase in the probability of cancer.
So what is the solution to the disadvantage of PDP and its belief that features are not influenced by another feature? Feature Interaction. One feature and its effect are dependent on the value of other features.
When two features interact with one another, the change in the prediction occurs due to the variations in the feature and how it affects the individual features.
To better understand this concept, we can break down the predictions into four terms based on a machine learning model making a prediction based on two features:
The most important thing to keep in mind when building and deploying your model? Understanding your end-goal. Read our interview with ML experts from Stanford, Google, and HuggingFace to learn more.
If two features do not interact with one another, we can assume that the partial dependence function is centered at 0. We can state the formula as:
However, if the feature does not have any interaction with one another or with any other feature, the prediction function can be stated as:
The next step involves measuring the interactions between the features:
Now let’s use the same cervical cancer dataset and apply Friedman’s H-statistic on each feature.
A random forest has been used to predict whether a woman might get cervical cancer based on risk factors. Friedman’s H-statistic has been applied to each feature, showing the relative interactive effects of all the features. Hormonal contraceptives have the highest effect in comparison to other features. Using this, we can further explore the 2-way interactions between features and the other features.
Global Surrogate is another type of interpretable modeling that is trained to approximate the predictions of a black-box model.
Black-box models are models that are too complex that they are not interpretable by humans. Humans have little understanding of how the variables are being used or combined to make predictions. Using the black-box model, we can make conclusions about it by the use of a surrogate model.
A surrogate model, also known as a metamodel, or an emulator, response surface model, and emulator, is trained using a data-driven approach.
The steps of surrogate modeling:
The R-squared measure can be used to calculate the difference between the surrogate model and the black-box model, measuring the replica between the two.
If the R2 value is close to 1, it will indicate a low SEE value, which in turn we can interpret that the interpretable model approximates the behavior of the black-box model well.
If the R2 value is close to 0, it will indicate a high SEE value, in turn allowing us to infer that the interpretable model fails to explain the black-box model.
Maintaining the same example throughout, the cervical cancer dataset uses a random forest. As mentioned in the steps above, you select your interpretable model type and train it on the original dataset. In this case, we’re using a decision tree, but using the prediction from the random forest as the outcomes. The counts in the nodes show the frequency of the classifications in the nodes using the black-box model.
In this part of the series, we have covered what Global Methods are and how they are related to Model Agnostic methods. I have gone through two different types of Model Agnostic methods, exploring the mathematics behind them, an example for your better understanding, and advantages and disadvantages to help you choose which method you should use.
In the next part, I will further explain more about Local Model Agnostic Methods.