About PrefIx

What the seven UX dimensions actually measure, and why they're scored separately instead of folded into one number.

Unify task and interaction evaluation

Adding interaction tools to evaluation provides a quantitative way to measure LLM agents’ user experience — placing UX on equal footing with task accuracy.

Decompose preferences into measurable dimensions

Interaction preferences can be decomposed into discrete dimensions via the Interaction-as-a-Tool (IaaT) paradigm, each one measurable on its own.

Personalization measurably improves alignment

With preference history, the Interaction Preference Alignment score rises from 3.276 to 3.882 on average (+18.5%), across the four models evaluated.

One dimension can lift another

Optimizing for a specific preference dimension (e.g., Transparency & Auditability) can improve related dimensions (e.g., reducing Cognitive Load) — a guide for targeted agent optimization.

Why seven dimensions

Task accuracy alone can't tell you whether an agent was pleasant to work with. PrefIx scores interaction quality across seven complementary UX dimensions, grounded in HCI literature — not derived from task correctness.

They're distinct, not redundant: inter-dimension correlations average Spearman |ρ| = 0.530, with every pairwise correlation below 0.70 — clear separation, no overlap. Yet across the benchmark's LLM judges, the seven dimensions show excellent internal consistency (Cronbach's α = 0.943) and multiple judges agree on the overall construct (ICC(2,k) > 0.79) — distinct facets that still converge on one coherent UX construct.

The Seven UX Dimensions

Each is scored 1–5 by an LLM judge. Anchors below are the exact rubric text the judges see — shown as poor (1) vs. ideal (5); expand a card for the full ladder.

Initiative Timing

1–5

Whether the agent proposes actions or interruptions at opportune moments — neither prematurely (causing disruption) nor belatedly (missing the moment of need).

1 · Poor

Acts too early or delays often; repeatedly interrupts flow.

5 · Ideal

Consistently acts at the right time with no unnecessary pauses.

Show full 1–5 rubric ▸
  1. 1. Acts too early or delays often; repeatedly interrupts flow.
  2. 2. Occasional premature/late actions that disrupt pace or add chatter.
  3. 3. Generally timely with minor acceptable delays or early moves.
  4. 4. Solid timing with only negligible waits or interruptions.
  5. 5. Consistently acts at the right time with no unnecessary pauses.

Interaction Coherence

1–5

The logical consistency and connectedness of the ongoing exchange. A coherent agent recalls prior events, avoids off-topic responses, and refrains from abrupt topic shifts.

1 · Poor

Frequent memory loss, contradictions, or unexplained reversals.

5 · Ideal

Fully self-consistent end to end with no unnecessary repeats.

Show full 1–5 rubric ▸
  1. 1. Frequent memory loss, contradictions, or unexplained reversals.
  2. 2. Repeated confirmations or logic jumps that hurt coherence.
  3. 3. Mostly coherent with minor repeats or small contradictions.
  4. 4. Clear, consistent, rarely repetitive or contradictory.
  5. 5. Fully self-consistent end to end with no unnecessary repeats.

Intent Alignment Drift

1–5

How well the agent correctly infers and remains aligned with the user's goals and preferences over time, avoiding attention decay on foundational constraints.

1 · Poor

Clearly drifts from the user goal, ignores clarified intent.

5 · Ideal

Tightly aligned to user intent throughout with no drift.

Show full 1–5 rubric ▸
  1. 1. Clearly drifts from the user goal, ignores clarified intent.
  2. 2. Often reuses old goals or misreads intent, needs user fixes.
  3. 3. Mostly follows latest intent with occasional minor drift.
  4. 4. Stays on latest intent with rare, quickly corrected slips.
  5. 5. Tightly aligned to user intent throughout with no drift.

Commitment Consistency

1–5

Whether the agent honors implied or explicit commitments and behaves in line with user expectations without contradicting or deviating without justification.

1 · Poor

Promises and actions diverge badly with no explanation.

5 · Ideal

All commitments met promptly or fully justified when not.

Show full 1–5 rubric ▸
  1. 1. Promises and actions diverge badly with no explanation.
  2. 2. Multiple broken promises or thin explanations, trust erosion.
  3. 3. Generally delivers with occasional gaps and some explanation.
  4. 4. Nearly all commitments met; rare delays well explained.
  5. 5. All commitments met promptly or fully justified when not.

Interaction Efficiency

1–5

How quickly and effortlessly users achieve goals — minimizing unnecessary steps, delays, and cognitive effort while delivering value with speed and minimal friction.

1 · Poor

Heavy redundancy or repeated asks; very inefficient.

5 · Ideal

Minimal turns, no visible redundancy or repeats.

Show full 1–5 rubric ▸
  1. 1. Heavy redundancy or repeated asks; very inefficient.
  2. 2. Many redundancies; path could be clearly shorter.
  3. 3. Acceptable efficiency with some redundancy.
  4. 4. Lean flow with only rare noncritical extras.
  5. 5. Minimal turns, no visible redundancy or repeats.

User Cognitive Load Trajectory

1–5

The mental effort required to perform a task relative to working-memory capacity. Agents with predictable behavior reduce reported cognitive load and increase user trust.

1 · Poor

Cognitive load rises; user gets more confused over time.

5 · Ideal

Significantly lowers load; progress is always clear.

Show full 1–5 rubric ▸
  1. 1. Cognitive load rises; user gets more confused over time.
  2. 2. Introduces unnecessary complexity repeatedly; load increases.
  3. 3. Load stays mostly flat with minor swings.
  4. 4. Reduces uncertainty over time; user gets clearer.
  5. 5. Significantly lowers load; progress is always clear.

Overall User Experience

1–5

A holistic assessment encompassing reuse intention, perceived trust, interaction smoothness, and perceived reliability. Aggregate of the seven individual UX dimensions.

1 · Poor

Poor experience; would not reuse.

5 · Ideal

Excellent — orderly, reliable, not annoying.

Show full 1–5 rubric ▸
  1. 1. Poor experience; would not reuse.
  2. 2. Subpar; trust/flow noticeably hurt.
  3. 3. Acceptable but average experience.
  4. 4. Good experience; would reuse.
  5. 5. Excellent — orderly, reliable, not annoying.

A Closer Look: Preference Alignment

Interaction Preference Alignment isn't one of the seven UX dimensions — it measures whether the agent's behavior matched what the user actually asked for. Theoretically it sits inside the Effectiveness branch below, alongside Intent Alignment Drift and Commitment Consistency; practically, the leaderboard and demo report it as its own composite score.

Interaction Preference Alignment

1–5

How effectively the agent's autonomy, information density, decision-making logic, and communicative style match, adapt to, and remain consistent with a user's stated or implicit interaction preferences.

1 · Poor

Strongly misaligned with the persona's interaction preferences; repeated behaviors that contradict stated style/trajectory.

5 · Ideal

Fully aligned end-to-end with the persona's interaction preferences and trajectory.

Show full 1–5 rubric ▸
  1. 1. Strongly misaligned with the persona's interaction preferences; repeated behaviors that contradict stated style/trajectory.
  2. 2. Mostly misaligned; frequent clashes with preferences, only occasional alignment.
  3. 3. Mixed adherence; some turns follow preferences, some ignore or contradict them.
  4. 4. Mostly aligned; follows preferences with only minor, isolated deviations.
  5. 5. Fully aligned end-to-end with the persona's interaction preferences and trajectory.

How the Dimensions Relate

Overall UX is the holistic aggregate. Below, the ISO 9241-11 usability triad groups the other dimensions — including Preference Alignment — into three branches.

Overall UX — holistic aggregate

Effectiveness

  • Pref. Alignment
  • Alignment Drift
  • Consistency

Satisfaction

  • Initiative
  • Coherence
  • Cognitive Load

Efficiency

  • Efficiency

Grounded in the ISO 9241-11 usability triad (Effectiveness, Efficiency, Satisfaction) and independently cross-validated by Zhao et al.'s SPHERE survey of 39 human-AI papers (ACL Findings 2025), which converged on the same three goals. Source: the authors' ACL rebuttal on theoretical grounding, not the published paper text.

Cite this work

@misc{li2026prefixunderstandadaptuser,
      title={PrefIx: Understand and Adapt to User Preference in Human-Agent Interaction},
      author={Jialin Li and Zhenhao Chen and Hanjun Luo and Hanan Salam},
      year={2026},
      eprint={2602.06714},
      archivePrefix={arXiv},
      primaryClass={cs.HC},
      url={https://arxiv.org/abs/2602.06714},
}