Batteries
Question-Level Results
Analysis by AI Comparison Judge — per battery
These summaries were written by an LLM ("comparison judge") that read both
system answers for every question in each battery and produced a verdict
with a narrative. They are a convenience, not ground truth. The data above —
questions, answers, and execution traces — is the evidence.
Observations