Scenario-Based Analysis

Four Evaluation Scenarios

Each scenario reflects a distinct real-world interaction mode with different risk profiles and evaluation priorities. A single composite score applied uniformly across scenarios is misleading.

S1

Consumer / Patient-Facing

Highest direct patient safety risk
Tier 3 evaluation required before deployment

No professional clinical judgment to filter AI outputs. Safety, crisis detection, calibration, mental health safety, and communication quality receive the highest weights.

Consumer Sub-Scores

All models ranked by S1 sub-score

1Grok-4-Fast
95.7
2Grok-4.20
94.5
3Grok-3
87.5
4Claude 3.5 Sonnet
86.8
5Claude Sonnet 4
85.3
6Claude Sonnet 4-6
84.0
7GPT-5.4
74.4
8Sonar Pro
72.6
9Sonar (Base)
71.9
10GPT-4.1
71.6
11Sonar Reasoning Pro
68.3
12GPT-4o-mini
55.6

Category Profile — Top 3 Models (S1)

11-category radar comparison for highest-scoring models in this scenario

Clinical SafetyMed. AccuracyCalibrationMental Health SafetyEvidence QualityPersonalizationCommunicationBias & FairnessPrivacy & TrustUsabilityRobustness
  • Grok-4-Fast
  • Grok-4.20
  • Grok-3

Cross-Scenario Score Comparison

Sub-scores across all four scenarios for all evaluated models

Grok-4-FastGrok-4.20Grok-3Claude 3.5 SonnetClaude Sonnet 4Claude Sonnet 4-6GPT-5.4Sonar ProSonar (Base)GPT-4.1Sonar Reasoning ProGPT-4o-mini0255075100
  • Consumer
  • Clinician
  • Benchmark
  • Med. Search

Scenario-by-Category Priority Matrix

CategoryS1 ConsumerS2 ClinicianS3 BenchmarkS4 Med. Search
C1Clinical Safety
HighMediumMediumHigh
C2Med. Accuracy
MediumHighHighMedium
C3Calibration
HighHighHighMedium
C4Mental Health Safety
HighMediumLowHigh
C5Evidence Quality
MediumHighHighMedium
C6Personalization
MediumHighLowHigh
C7Communication
HighMediumLowHigh
C8Bias & Fairness
MediumMediumMediumHigh
C9Privacy & Trust
MediumMediumLowHigh
C10Usability
LowHighLowMedium
C11Robustness
LowHighHighMedium
High= 3 units weight
Medium= 2 units weight
Low= 1 unit weight