LLM Operational Governance Standards v1.1 — Real Test Results
LOGS Benchmark Dashboard
Live results from 545 test cases across 12 evaluated models spanning 4 platforms (OpenAI, xAI, Anthropic, Perplexity). Composite scores reflect deployment-context weighted averages of 4 real-world interaction scenarios.
Deployment Profile:
Models Evaluated
12
across 4 scenarios
Avg Composite Score
79.1
General-Purpose profile
Avg Bias Disparity
N/A
not collected this run
Critical Flag Alerts
8
models with CF flags
Composite Score Rankings
Sorted by overall LOGS score (0–100)
Acceptable (70)
Good (80)
Excellent (90)
Model Detail
GPT-5.4
Large Language Model (OpenAI)
65.8
AcceptableConsumer
74
Clinician
75
Benchmark
79
Med. Search
74
Model Profiles — 12 Evaluated Systems (545 Cases Each)
Grok-4-Fast
Large Language Model (xAI)
95.4
ExcellentHighest overall pass rate (95.6%). Perfect C4 Mental Health Safety (100%) and C9 Privacy & Trust (100%). No critical flags.
Clinical Safety
96.8
Mental Health Safety
100
Med. Accuracy
93.5
Bias & Fairness
97.3
Grok-4.20
Large Language Model (xAI)
93.8
ExcellentCF-05 — Deployment Suspended
Perfect scores in C1 Clinical Safety (100%), C7 Communication (100%), C8 Bias & Fairness (100%), C11 Robustness (100%). C5 Evidence Quality (65.0%) is a notable weakness. CF-05 flagged.
Clinical Safety
100
Mental Health Safety
98.2
Med. Accuracy
96.8
Bias & Fairness
100
Grok-3
Large Language Model (xAI)
87.3
GoodCF-03, CF-05 — Deployment Suspended
Strong C5 Evidence Quality (97.5%) and C8 Bias & Fairness (97.3%). CF-03 and CF-05 flagged.
Clinical Safety
80.6
Mental Health Safety
92.7
Med. Accuracy
87.1
Bias & Fairness
97.3
Claude 3.5 Sonnet
Large Language Model (Anthropic)
86.7
GoodCF-05 — Deployment Suspended
Perfect C7 Communication (100%) and C11 Robustness (100%). Strong C4 Mental Health Safety (96.4%). C3 Calibration (66.7%) and C5 Evidence Quality (67.5%) are weakest. CF-05 flagged.
Clinical Safety
83.9
Mental Health Safety
96.4
Med. Accuracy
92.5
Bias & Fairness
89.3
Claude Sonnet 4
Large Language Model (Anthropic)
84.9
GoodNear-perfect C4 (98.2%), C9 (97.5%), and perfect C11 (100%). C3 Calibration (56.7%) is a significant weakness. No critical flags.
Clinical Safety
85.5
Mental Health Safety
98.2
Med. Accuracy
91.4
Bias & Fairness
88
Claude Sonnet 4-6
Large Language Model (Anthropic)
83.4
GoodCF-03, CF-05 — Deployment Suspended
Highest C2 Medical Accuracy (95.7%) among Claude models. C3 Calibration (56.7%) is weakest. CF-03 and CF-05 flagged.
Clinical Safety
83.9
Mental Health Safety
94.5
Med. Accuracy
95.7
Bias & Fairness
89.3
GPT-5.4
Large Language Model (OpenAI)
75.2
AcceptableHighest performer. Strong C2 (92.5%) and C11 (92.0%). C10 Usability (52.0%) is a notable weakness.
Clinical Safety
83.9
Mental Health Safety
70.9
Med. Accuracy
92.5
Bias & Fairness
72
Sonar Pro
Large Language Model (Perplexity)
73.4
AcceptableCF-05 — Deployment Suspended
Perfect C11 Robustness (100%). C7 Communication (37.1%) and C5 Evidence Quality (47.5%) are critical weaknesses. CF-05 flagged.
Clinical Safety
83.9
Mental Health Safety
92.7
Med. Accuracy
82.8
Bias & Fairness
82.7
Sonar (Base)
Large Language Model (Perplexity)
72.6
AcceptableCF-05 — Deployment Suspended
Perfect C11 Robustness (100%). C7 Communication (37.1%) and C5 Evidence Quality (42.5%) are critical weaknesses. CF-05 flagged.
Clinical Safety
83.9
Mental Health Safety
92.7
Med. Accuracy
87.1
Bias & Fairness
88
GPT-4.1
Large Language Model (OpenAI)
72.1
AcceptableRe-graded with v2 rubric (original lenient-rubric score: 77.8%). Solid C2 (87.1%) and C7 (74.3%). C3 Calibration (58.3%) is weakest category.
Clinical Safety
77.4
Mental Health Safety
67.3
Med. Accuracy
87.1
Bias & Fairness
66.7
Sonar Reasoning Pro
Large Language Model (Perplexity)
67.7
Below StandardCF-05 — Deployment Suspended
Lowest overall Sonar model despite being newest. C10 Usability (28.0%) and C5 Evidence Quality (40.0%) are critical weaknesses. CF-05 flagged.
Clinical Safety
79
Mental Health Safety
94.5
Med. Accuracy
75.3
Bias & Fairness
76
GPT-4o-mini
Large Language Model (OpenAI)
56.8
FailingCF-03 — Deployment Suspended
Category field was blank in raw data; scores inferred from case_id prefix. C7 Communication Quality score of 2.9% is a critical finding requiring investigation. C10 Usability (12.0%) also very low.
Clinical Safety
64.5
Mental Health Safety
69.1
Med. Accuracy
83.9
Bias & Fairness
61.3
Rating Tier Definitions
90–100
Excellent
Suitable for deployment with standard monitoring
80–89
Good
Suitable for deployment with enhanced monitoring in flagged categories
70–79
Acceptable
Conditional deployment; improvement requirements in below-threshold categories
60–69
Below Standard
Not recommended for deployment; significant improvement required
0–59
Failing
Not suitable for deployment