Stanford
University
  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Privacy
  • Copyright
  • Trademarks
  • Non-Discrimination
  • Accessibility
© Stanford University.  Stanford, California 94305.
Better Benchmarks for Safety-Critical AI Applications | Stanford HAI

Stay Up To Date

Get the latest news, advances in research, policy work, and education program updates from HAI in your inbox weekly.

Sign Up For Latest News

Navigate
  • About
  • Events
  • Careers
  • Search
Participate
  • Get Involved
  • Support HAI
  • Contact Us
Skip to content
  • About

    • About
    • People
    • Get Involved with HAI
    • Support HAI
  • Research

    • Research
    • Fellowship Programs
    • Grants
    • Student Affinity Groups
    • Centers & Labs
    • Research Publications
    • Research Partners
  • Education

    • Education
    • Executive and Professional Education
    • Government and Policymakers
    • K-12
    • Stanford Students
  • Policy

    • Policy
    • Policy Publications
    • Policymaker Education
    • Student Opportunities
  • AI Index

    • AI Index
    • AI Index Report
    • Global Vibrancy Tool
    • People
  • News
  • Events
  • Industry
  • Centers & Labs
news

Better Benchmarks for Safety-Critical AI Applications

Date
May 27, 2025
Topics
Machine Learning
Business graph digital concept
istock

Stanford researchers investigate why models often fail in edge-case scenarios.

Many artificial intelligence models score well on benchmarks, but those scores hide a real weakness in ensuring reliable AI. At issue is the problem of spurious correlations, or relationships within a model’s training data that don’t hold true once the model is applied to a new domain outside the controlled environment. Say, for example, a model is trained to recognize images of camels and cows. If the cow images usually include pasture grass and camels show desert sand, the model could incorrectly assume all animals on green grass are cows.

“There’s a trend in the study of domain generalization – a field that studies the ability of models to perform well on unseen data from a different data distribution – to assume standard model training on very large datasets guarantees accurate performance in deployment. But we see some critical situations where models need to be more robust to the problem of spurious correlations in decision making,” says Olawale Salaudeen, a recent PhD graduate affiliated with the Stanford Trustworthy AI Research (STAIR) lab and current postdoctoral scholar at MIT.

In a new paper, “Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?,” Salaudeen worked with Sanmi Koyejo, assistant professor of computer science at Stanford, to expose the limitations of current domain generalization benchmarks and offer practical implications for improving benchmark selection, evaluation practices, and model selection. The work was partially funded by the Stanford Institute for Human-Centered AI.

Where Benchmarks Fall Short

Developers regard benchmarks as a kind of superpower for machine learning. By establishing standard measurement and evaluation protocols, benchmarks provide reference points to help developers predict how well a model they’ve run in a training environment will perform once it’s deployed in a real-world setting. Benchmarks help to contain the time and cost of training while ensuring the technology is likely to fulfill its promise when released.

This approach works well most of the time. But sometimes a model that performs according to accepted benchmarks fails when it runs outside the training environment and encounters new, unseen data that’s “out-of-distribution” in statistical speak. To illustrate the concern, Koyejo describes a 2019 study that looked at a medical dataset containing images of human lungs. Some were normal and healthy, while others showed a condition called collapsed lung, or pneumothorax. Researchers trained the model to recognize signs of this condition; however, the model picked up an unintentional cue: images of pneumothorax in the dataset often included a chest tube – a device inserted for treatment after the diagnosis has been made by a doctor.

“If the model relied on seeing a chest tube to predict whether an image reflects a healthy or compromised lung, it would be catastrophically wrong in some cases, leaving a critical health condition undetected,” Koyejo says.

The Allure of “Accuracy on the Line”

Many domain generalization benchmarks dismiss the issue of spurious correlations, citing a theory called “accuracy on the line,” which holds that incorrect associations like camel/cow image backgrounds don’t affect predictions of how most models will perform when deployed. As Salaudeen explains, “The accuracy on the line phenomenon says that better performance on ‘in-distribution’ data in a training environment generally predicts improvement on ‘out-of-distribution’ data as well. But in cases like the collapsed lung dataset, we found this assumption to be problematic, suggesting many state-of-the-art benchmarks with this property cannot be trusted to evaluate models for such safety-critical applications.”

Against this backdrop, the Stanford team decided to investigate when and why current domain generalization benchmarks fall short and how they might be improved. First, they introduced the concept of misspecified benchmarks to describe situations where models that use all correlations in the training data still generalize well to out-of-distribution data. These situations represent an ideal setting, in which machine learning excels by default because there are no incorrect associations in the training data to harm the out-of-distribution performance. However, such benchmarks do not represent non-ideal settings, such as the pneumothorax example. Essentially, they found misspecified benchmarks inflate our confidence in robustness to spurious correlations.

Next, the team outlined requirements for a benchmark to be considered foolproof – or well-specified – where algorithms being evaluated are actually confronted with potential spurious correlations and we can investigate their robustness. Working with the medical AI example, they aimed to provide benchmark guidelines that can distinguish between a model that uses the physiological features of the human lung depicted in an X-ray rather than spurious features such as a test tube or technician’s handwriting, which the model should ignore as it learns how to recognize signs of disease. This step of the research shows that well-specified benchmarks do not exhibit accuracy on the line.

“Counter to the trending theory, our results suggest benchmarks with accuracy on the line are inappropriate in a variety of settings, including medical diagnosis and bias detection. Since they do not contain spurious correlations that models should ignore, they cannot be reliable indicators of domain generalization potential in real-world settings where spurious correlations do exist,” Salaudeen explains.

Empirical Research and Analysis

Following this theoretical exercise, Salaudeen and colleagues applied their benchmark conditions test to datasets from two domain generalization benchmark suites, DomainBed and WILDS. Their analysis of several image datasets revealed that many datasets commonly used as benchmarks for domain generalization exhibit accuracy on the line and are, therefore, misspecified. The team concluded the AI community needs benchmarks without accuracy on the line to evaluate domain generalization potential. Such benchmarks are well-specified and can reliably identify algorithms that won’t be susceptible to spurious correlations.

Taking the work a step further, the team examined whether spurious correlations appear in text as well as image datasets. After applying their framework to the Civil Comments dataset, a collection of 2 million public comments from independent news sites, they found spurious correlations are also a concern in the language domain, and foundation models and language models (LLMs) should be evaluated accordingly.

Ideas for Improvement

What should the AI community do differently, considering these findings? In their paper, the authors highlight three recommendations.

  • Benchmark selection: First, in selecting a benchmark to follow for domain generalization, researchers should prioritize benchmarks that do not exhibit the accuracy on the line phenomenon.

  • Benchmark evaluation: Second, when evaluating benchmarks, researchers should note that averaging results across datasets can obscure meaningful insights, as averages can hide critical variability.

  • Model selection: Third, selecting models based on evaluations that do not reflect the intended deployment may unintentionally reinforce reliance on spurious correlations. Therefore, researchers should consider alternative selection criteria.

Looking ahead, the Stanford team may investigate what other insights one can draw about domain generalization benchmarks from extending their theoretical analysis. They’d also like to study how different evaluation metrics (other than accuracy) behave under distribution shifts and improve the automation of accuracy (and other metrics) on the line detection. Ultimately, they would like to help curate more reliable benchmarks for the domain generalization field.

“Robustness to spurious correlations remains a critical challenge in machine learning, if we want to ensure reliability and fairness of both predictive and generative models,” Koyejo says. “By identifying benchmarks that reliably test this pathology, our findings offer clearer guidance for evaluating and ultimately developing models that are truly robust to this problem.”

Want to learn more about this research? The scholars created a sandbox for experimenting with domain generalization and accuracy on the line: View the demo and the details in Github.

Share
Link copied to clipboard!
Contributor(s)
Nikki Goth Itoi

Related News

Digital Twins Offer Insights into Brains Struggling with Math — and Hope for Students
Andrew Myers
Jun 06, 2025
News

Researchers used artificial intelligence to analyze the brain scans of students solving math problems, offering the first-ever peek into the neuroscience of math disabilities.

News

Digital Twins Offer Insights into Brains Struggling with Math — and Hope for Students

Andrew Myers
Machine LearningSciences (Social, Health, Biological, Physical)Jun 06

Researchers used artificial intelligence to analyze the brain scans of students solving math problems, offering the first-ever peek into the neuroscience of math disabilities.

The AI Race Has Gotten Crowded—and China Is Closing In on the US
Wired
Apr 07, 2025
Media Mention

Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.

Media Mention
Your browser does not support the video tag.

The AI Race Has Gotten Crowded—and China Is Closing In on the US

Wired
Regulation, Policy, GovernanceEconomy, MarketsFinance, BusinessGenerative AIIndustry, InnovationMachine LearningSciences (Social, Health, Biological, Physical)Apr 07

Vanessa Parli, Stanford HAI Director of Research and AI Index Steering Committee member, notes that the 2025 AI Index reports flourishing and higher-quality academic research in AI.

Here are 3 Big Takeaways from Stanford's AI Index Report
Tech Brew
Apr 07, 2025
Media Mention

Vanessa Parli, HAI Director of Research and AI Index Steering Committee member, speaks about the biggest takeaways from the 2025 AI Index Report.

Media Mention
Your browser does not support the video tag.

Here are 3 Big Takeaways from Stanford's AI Index Report

Tech Brew
Sciences (Social, Health, Biological, Physical)Machine LearningRegulation, Policy, GovernanceIndustry, InnovationApr 07

Vanessa Parli, HAI Director of Research and AI Index Steering Committee member, speaks about the biggest takeaways from the 2025 AI Index Report.