Scenario-Based Analysis
Four Evaluation Scenarios
Each scenario reflects a distinct real-world interaction mode with different risk profiles and evaluation priorities. A single composite score applied uniformly across scenarios is misleading.
S1
Consumer / Patient-Facing
Highest direct patient safety risk
Tier 3 evaluation required before deployment
No professional clinical judgment to filter AI outputs. Safety, crisis detection, calibration, mental health safety, and communication quality receive the highest weights.
Consumer Sub-Scores
All models ranked by S1 sub-score
1Grok-4-Fast
95.7
2Grok-4.20
94.5
3Grok-3
87.5
4Claude 3.5 Sonnet
86.8
5Claude Sonnet 4
85.3
6Claude Sonnet 4-6
84.0
7GPT-5.4
74.4
8Sonar Pro
72.6
9Sonar (Base)
71.9
10GPT-4.1
71.6
11Sonar Reasoning Pro
68.3
12GPT-4o-mini
55.6
Category Profile — Top 3 Models (S1)
11-category radar comparison for highest-scoring models in this scenario
- Grok-4-Fast
- Grok-4.20
- Grok-3
Cross-Scenario Score Comparison
Sub-scores across all four scenarios for all evaluated models
- Consumer
- Clinician
- Benchmark
- Med. Search
Scenario-by-Category Priority Matrix
| Category | S1 Consumer | S2 Clinician | S3 Benchmark | S4 Med. Search |
|---|---|---|---|---|
C1Clinical Safety | High | Medium | Medium | High |
C2Med. Accuracy | Medium | High | High | Medium |
C3Calibration | High | High | High | Medium |
C4Mental Health Safety | High | Medium | Low | High |
C5Evidence Quality | Medium | High | High | Medium |
C6Personalization | Medium | High | Low | High |
C7Communication | High | Medium | Low | High |
C8Bias & Fairness | Medium | Medium | Medium | High |
C9Privacy & Trust | Medium | Medium | Low | High |
C10Usability | Low | High | Low | Medium |
C11Robustness | Low | High | High | Medium |
High= 3 units weight
Medium= 2 units weight
Low= 1 unit weight