What the seven UX dimensions actually measure, and why they're scored separately instead of folded into one number.
Adding interaction tools to evaluation provides a quantitative way to measure LLM agents’ user experience — placing UX on equal footing with task accuracy.
Interaction preferences can be decomposed into discrete dimensions via the Interaction-as-a-Tool (IaaT) paradigm, each one measurable on its own.
With preference history, the Interaction Preference Alignment score rises from 3.276 to 3.882 on average (+18.5%), across the four models evaluated.
Optimizing for a specific preference dimension (e.g., Transparency & Auditability) can improve related dimensions (e.g., reducing Cognitive Load) — a guide for targeted agent optimization.
Why seven dimensions
Task accuracy alone can't tell you whether an agent was pleasant to work with. PrefIx scores interaction quality across seven complementary UX dimensions, grounded in HCI literature — not derived from task correctness.
They're distinct, not redundant: inter-dimension correlations average Spearman |ρ| = 0.530, with every pairwise correlation below 0.70 — clear separation, no overlap. Yet across the benchmark's LLM judges, the seven dimensions show excellent internal consistency (Cronbach's α = 0.943) and multiple judges agree on the overall construct (ICC(2,k) > 0.79) — distinct facets that still converge on one coherent UX construct.
Each is scored 1–5 by an LLM judge. Anchors below are the exact rubric text the judges see — shown as poor (1) vs. ideal (5); expand a card for the full ladder.
Whether the agent proposes actions or interruptions at opportune moments — neither prematurely (causing disruption) nor belatedly (missing the moment of need).
Acts too early or delays often; repeatedly interrupts flow.
Consistently acts at the right time with no unnecessary pauses.
The logical consistency and connectedness of the ongoing exchange. A coherent agent recalls prior events, avoids off-topic responses, and refrains from abrupt topic shifts.
Frequent memory loss, contradictions, or unexplained reversals.
Fully self-consistent end to end with no unnecessary repeats.
How well the agent correctly infers and remains aligned with the user's goals and preferences over time, avoiding attention decay on foundational constraints.
Clearly drifts from the user goal, ignores clarified intent.
Tightly aligned to user intent throughout with no drift.
Whether the agent honors implied or explicit commitments and behaves in line with user expectations without contradicting or deviating without justification.
Promises and actions diverge badly with no explanation.
All commitments met promptly or fully justified when not.
How quickly and effortlessly users achieve goals — minimizing unnecessary steps, delays, and cognitive effort while delivering value with speed and minimal friction.
Heavy redundancy or repeated asks; very inefficient.
Minimal turns, no visible redundancy or repeats.
The mental effort required to perform a task relative to working-memory capacity. Agents with predictable behavior reduce reported cognitive load and increase user trust.
Cognitive load rises; user gets more confused over time.
Significantly lowers load; progress is always clear.
A holistic assessment encompassing reuse intention, perceived trust, interaction smoothness, and perceived reliability. Aggregate of the seven individual UX dimensions.
Poor experience; would not reuse.
Excellent — orderly, reliable, not annoying.
Interaction Preference Alignment isn't one of the seven UX dimensions — it measures whether the agent's behavior matched what the user actually asked for. Theoretically it sits inside the Effectiveness branch below, alongside Intent Alignment Drift and Commitment Consistency; practically, the leaderboard and demo report it as its own composite score.
How effectively the agent's autonomy, information density, decision-making logic, and communicative style match, adapt to, and remain consistent with a user's stated or implicit interaction preferences.
Strongly misaligned with the persona's interaction preferences; repeated behaviors that contradict stated style/trajectory.
Fully aligned end-to-end with the persona's interaction preferences and trajectory.
Overall UX is the holistic aggregate. Below, the ISO 9241-11 usability triad groups the other dimensions — including Preference Alignment — into three branches.
Effectiveness
Satisfaction
Efficiency
Grounded in the ISO 9241-11 usability triad (Effectiveness, Efficiency, Satisfaction) and independently cross-validated by Zhao et al.'s SPHERE survey of 39 human-AI papers (ACL Findings 2025), which converged on the same three goals. Source: the authors' ACL rebuttal on theoretical grounding, not the published paper text.
Cite this work
@misc{li2026prefixunderstandadaptuser,
title={PrefIx: Understand and Adapt to User Preference in Human-Agent Interaction},
author={Jialin Li and Zhenhao Chen and Hanjun Luo and Hanan Salam},
year={2026},
eprint={2602.06714},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2602.06714},
}