Benchmarking LLM agents on task accuracy and interaction-level user experience — evaluating how well agents adapt to individual interaction preferences across 14 attributes and 31 preference settings.
Last updated Feb 6, 2026
Agent receives interaction history and infers user preferences
| Model | Task | Interaction Preference | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | Provider · Model | Accuracy ↕0–100% | Overall ↕/5 | UX Avg ▼/5 | Initiative ↕/5 | Coherence ↕/5 | Intent Aln. ↕/5 | Consistency ↕/5 | Efficiency ↕/5 | Cogn. Load ↕/5 | Overall UX ↕/5 | |
| 1 | Gemini 3 Flash Google DeepMind | 50.7% | 4.152 | 4.190 | 4.319 | 3.947 | 4.808 | 4.746 | 3.394 | 3.971 | 4.145 | |
| 2 | Kimi K2 Moonshot AI | 45.3% | 3.461 | 3.802 | 3.916 | 3.785 | 4.514 | 4.364 | 3.122 | 3.393 | 3.519 | |
| 3 | Claude Opus 4.5 Anthropic | 69.0% | 3.983 | 3.703 | 3.871 | 3.407 | 4.317 | 4.247 | 2.838 | 3.462 | 3.777 | |
| 4 | Claude Sonnet 4.5 Anthropic | 62.5% | 3.930 | 3.546 | 3.752 | 3.229 | 4.284 | 4.044 | 2.663 | 3.249 | 3.601 | |
Source: arXiv 2602.06714v1, Tables 2–3, Figure 3. Interaction Preference scores on a Likert 1–5 scale (higher is better). Tool Accuracy: Subset-Matched Response-based Evaluation.
The leaderboard above tells you who's ahead. These three views help show why — and by how much.
Every model runs twice: once on generic instructions, once after seeing your interaction history. The grey bar is the default; the green bar is after adapting to you — the bigger the gap, the more it improved.
7-dimension average, Likert 1–5
How well behavior matches your stated preferences
Subset-matched response evaluation, %
Interaction quality breaks into six separate skills, one shape per model. A larger, more even shape is reliably good across the board; a lopsided one shows exactly where it struggles. Overall UX above each chart summarizes all six at once.
6 UX dimensions, scale 1–5
6 UX dimensions, scale 1–5
6 UX dimensions, scale 1–5
6 UX dimensions, scale 1–5
Prefer precise figures over shapes? Every score is shown before and after personalization, plus the difference. Green deltas mean a model measurably improved once it understood your preferences.
| Metric | Gemini 3 Flash | Claude Opus 4.5 | Claude Sonnet 4.5 | Kimi K2 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Before | After | Δ | Before | After | Δ | Before | After | Δ | Before | After | Δ | |
| Tool Accuracy | 49.7% | 50.7% | +1.1pp | 66.7% | 69.0% | +2.3pp | 42.1% | 62.5% | +20.5pp | 40.6% | 45.3% | +4.7pp |
| Pref. Alignment | 3.142 | 4.152 | +1.010 | 3.429 | 3.983 | +0.554 | 3.210 | 3.930 | +0.720 | 3.324 | 3.461 | +0.137 |
| UX Average | 3.754 | 4.190 | +0.436 | 3.569 | 3.703 | +0.134 | 3.184 | 3.546 | +0.362 | 3.671 | 3.802 | +0.131 |
| Initiative | 3.745 | 4.319 | +0.574 | 3.654 | 3.871 | +0.217 | 3.298 | 3.752 | +0.454 | 3.737 | 3.916 | +0.179 |
| Coherence | 3.684 | 3.947 | +0.263 | 3.381 | 3.407 | +0.026 | 2.994 | 3.229 | +0.235 | 3.643 | 3.785 | +0.142 |
| Alignment Drift | 4.674 | 4.808 | +0.134 | 4.264 | 4.317 | +0.053 | 3.989 | 4.284 | +0.295 | 4.433 | 4.514 | +0.081 |
| Consistency | 4.593 | 4.746 | +0.153 | 4.137 | 4.247 | +0.110 | 3.701 | 4.044 | +0.343 | 4.216 | 4.364 | +0.148 |
| Efficiency | 2.956 | 3.394 | +0.438 | 2.805 | 2.838 | +0.033 | 2.422 | 2.663 | +0.241 | 2.945 | 3.122 | +0.177 |
| Cognitive Load | 3.283 | 3.971 | +0.688 | 3.269 | 3.462 | +0.193 | 2.807 | 3.249 | +0.442 | 3.304 | 3.393 | +0.089 |
| Overall UX | 3.343 | 4.145 | +0.802 | 3.470 | 3.777 | +0.307 | 3.079 | 3.601 | +0.522 | 3.416 | 3.519 | +0.103 |