Results Analysis

Personalization doesn't help every preference category equally. These charts break the aggregate numbers down by the 4 preference categories to show where adaptation pays off most.

Alignment Gain by Preference Category

Transparency & Auditability sees the largest alignment gains from personalization — agents are already better at adapting to transparency-related preferences than to ones requiring holistic changes to global interaction patterns (Strategy & Initiative, Robustness & Adaptability).

Which Category Drives Which UX Gain

Robustness & Adaptability is the dominant driver of Interaction Efficiency gains; Transparency & Auditability dominates Cognitive Load and Initiative Timing gains — evidence that alignment isn't monolithic, and different preference categories pull different UX levers.

CategoryInitiativeCoherenceAlignment DriftConsistencyEfficiencyCognitive LoadOverall UX
Transparency & Auditability
+0.39
+0.16
+0.17
+0.24
+0.20
+0.48
+0.49
Interaction Pace & Flow
+0.28
+0.09
+0.10
+0.15
+0.18
+0.27
+0.37
Strategy & Initiative
+0.26
+0.10
+0.10
+0.07
+0.14
+0.16
+0.29
Robustness & Adaptability
+0.25
+0.19
+0.13
+0.10
+0.23
+0.23
+0.31

Darker cells = larger gain. The outlined cell in each column is the category that dominates that UX dimension.

Key Takeaways

Current LLMs lack interaction preference sensitivity

Even with preference history, alignment averages only 3.882/5.0 — a consistent but unsaturated +18.5% gain, not a solved problem.

Interaction Efficiency remains the weakest dimension

It scores lowest across all UX dimensions in both baseline and personalized conditions — reducing unnecessary steps stays hard.

Preference-UX coupling enables targeted optimization

Specific preference categories disproportionately drive specific UX gains (e.g., Robustness & Adaptability → Efficiency), guiding where to focus model training.

Stronger baselines gain less

Claude Opus 4.5 and Kimi K2 show more modest UX gains (+3.6–3.8%) from personalization, suggesting stronger baseline models have limited room for further improvement.

Category aggregates computed from the real per-task LLM-judge scores across all 31 preference settings and 4 models (not estimated). Takeaways above are drawn from the authors' own discussion of results (ACL rebuttal) and arXiv 2602.06714v1.