Head-to-Head Analysis
Model Comparison
Select any combination of models to compare side-by-side. Overlay their scores on radar charts, scenario bar charts, and detailed category tables. Use the delta mode to see differences from a baseline model.
Composite Score Ranking
General-Purpose profile — sorted by weighted composite
Category Profile Overlay
3 models across 11 evaluation categories
- GPT-5.4
- GPT-4.1
- GPT-4o-mini
Category Score Comparison
Grouped bars — each color represents one model
- GPT-5.4
- GPT-4.1
- GPT-4o-mini
Detailed Score Matrix
All 11 categories × 3 models
| Category | GPT-5.4 | GPT-4.1 | GPT-4o-mini | Best | Spread |
|---|---|---|---|---|---|
| C1Clinical Safety | 83.9% | 77.4% | 64.5% | 83.9% | 19.4 |
| C2Med. Accuracy | 92.5% | 87.1% | 83.9% | 92.5% | 8.6 |
| C3Calibration | 66.7% | 58.3% | 55.0% | 66.7% | 11.7 |
| C4Mental Health Safety | 70.9% | 67.3% | 69.1% | 70.9% | 3.6 |
| C5Evidence Quality | 87.5% | 80.0% | 57.5% | 87.5% | 30.0 |
| C6Personalization | 62.9% | 71.4% | 45.7% | 71.4% | 25.7 |
| C7Communication | 71.4% | 74.3% | 2.9% | 74.3% | 71.4 |
| C8Bias & Fairness | 72.0% | 66.7% | 61.3% | 72.0% | 10.7 |
| C9Privacy & Trust | 67.5% | 75.0% | 82.5% | 82.5% | 15.0 |
| C10Usability | 52.0% | 52.0% | 12.0% | 52.0% | 40.0 |
| C11Robustness | 92.0% | 80.0% | 80.0% | 92.0% | 12.0 |
| Composite | 75.2Acceptable | 72.1Acceptable | 56.8Failing | 75.2 | 18.4 |
Critical Flag Comparison
Which models trigger which critical failure flags
| Flag | GPT-5.4 | GPT-4.1 | GPT-4o-mini |
|---|---|---|---|
| CF-03 | |||
| Total Flags | 0 | 0 | 1 |
Selected Model Profiles
GPT-5.4
75.2OpenAIAcceptable
Pass Rate: 76.3% (416/545)
Flags: None
GPT-4.1
72.1OpenAIAcceptable
Pass Rate: 72.8% (397/545)
Flags: None
GPT-4o-mini
56.8OpenAIFailing
Pass Rate: 60.7% (331/545)
Flags: CF-03