Head-to-Head Analysis

Model Comparison

Select any combination of models to compare side-by-side. Overlay their scores on radar charts, scenario bar charts, and detailed category tables. Use the delta mode to see differences from a baseline model.

Composite Score Ranking

General-Purpose profile — sorted by weighted composite

0255075100GPT-5.4GPT-4.1GPT-4o-mini

Category Profile Overlay

3 models across 11 evaluation categories

Clinical SafetyMed. AccuracyCalibrationMental Health SafetyEvidence QualityPersonalizationCommunicationBias & FairnessPrivacy & TrustUsabilityRobustness0255075100
  • GPT-5.4
  • GPT-4.1
  • GPT-4o-mini

Category Score Comparison

Grouped bars — each color represents one model

Clinical SafetyMed. AccuracyCalibrationMental Health SafetyEvidence QualityPersonalizationCommunicationBias & FairnessPrivacy & TrustUsabilityRobustness0255075100
  • GPT-5.4
  • GPT-4.1
  • GPT-4o-mini

Detailed Score Matrix

All 11 categories × 3 models

Category
GPT-5.4
GPT-4.1
GPT-4o-mini
BestSpread
C1Clinical Safety
83.9%
77.4%
64.5%
83.9%19.4
C2Med. Accuracy
92.5%
87.1%
83.9%
92.5%8.6
C3Calibration
66.7%
58.3%
55.0%
66.7%11.7
C4Mental Health Safety
70.9%
67.3%
69.1%
70.9%3.6
C5Evidence Quality
87.5%
80.0%
57.5%
87.5%30.0
C6Personalization
62.9%
71.4%
45.7%
71.4%25.7
C7Communication
71.4%
74.3%
2.9%
74.3%71.4
C8Bias & Fairness
72.0%
66.7%
61.3%
72.0%10.7
C9Privacy & Trust
67.5%
75.0%
82.5%
82.5%15.0
C10Usability
52.0%
52.0%
12.0%
52.0%40.0
C11Robustness
92.0%
80.0%
80.0%
92.0%12.0
Composite
75.2Acceptable
72.1Acceptable
56.8Failing
75.218.4

Critical Flag Comparison

Which models trigger which critical failure flags

FlagGPT-5.4GPT-4.1GPT-4o-mini
CF-03
Total Flags001

Selected Model Profiles

GPT-5.4

75.2
OpenAIAcceptable

Pass Rate: 76.3% (416/545)

Flags: None

GPT-4.1

72.1
OpenAIAcceptable

Pass Rate: 72.8% (397/545)

Flags: None

GPT-4o-mini

56.8
OpenAIFailing

Pass Rate: 60.7% (331/545)

Flags: CF-03