Researchers create a mathematical framework to evaluate explanations of machine-learning models and quantify how well people understand them.
Modern machine-learning models, such as neural networks, are often referred to as “black boxes” because they are so complex that even the researchers who design them can’t fully understand how they make predictions.
To provide some insights, researchers use explanation methods that seek to describe individual model decisions. For example, they may highlight words in a movie review that influenced the model’s decision that the review was positive.
But these explanation methods don’t do any good if humans can’t easily understand them, or even misunderstand them. So, MIT researchers created a mathematical framework to formally quantify and evaluate the understandability of explanations for machine-learning models. This can help pinpoint insights about model behavior that might be missed if the researcher is only evaluating a handful of individual explanations to try to understand the entire model.
“With this framework, we can have a very clear picture of not only what we know about the model from these local explanations, but more importantly what we don’t know about it,” says Yilun Zhou, an electrical engineering and computer science graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of a paper presenting this framework.