First UX benchmark for interaction preference

The PrefIx Leaderboard

Benchmarking LLM agents on task accuracy and interaction-level user experience — evaluating how well agents adapt to individual interaction preferences across 14 attributes and 31 preference settings.

4Models
7UX Dimensions
31Preference Settings
14Preference Attributes
283Samples

Rankings

Last updated Feb 6, 2026

Agent receives interaction history and infers user preferences

ModelTaskInteraction Preference
#Provider · Model
Accuracy 0–100%
Overall /5
UX Avg /5
Initiative /5
Coherence /5
Intent Aln. /5
Consistency /5
Efficiency /5
Cogn. Load /5
Overall UX /5
1
Gemini 3 Flash
Google DeepMind
50.7%4.1524.1904.3193.9474.8084.7463.3943.9714.145
2
Kimi K2
Moonshot AI
45.3%3.4613.8023.9163.7854.5144.3643.1223.3933.519
3
Claude Opus 4.5
Anthropic
69.0%3.9833.7033.8713.4074.3174.2472.8383.4623.777
4
Claude Sonnet 4.5
Anthropic
62.5%3.9303.5463.7523.2294.2844.0442.6633.2493.601

Source: arXiv 2602.06714v1, Tables 2–3, Figure 3. Interaction Preference scores on a Likert 1–5 scale (higher is better). Tool Accuracy: Subset-Matched Response-based Evaluation.

Results, Visualized

The leaderboard above tells you who's ahead. These three views help show why — and by how much.

Does telling the agent your preferences actually help?

Every model runs twice: once on generic instructions, once after seeing your interaction history. The grey bar is the default; the green bar is after adapting to you — the bigger the gap, the more it improved.

UX Average

7-dimension average, Likert 1–5

Preference Alignment

How well behavior matches your stated preferences

Tool Accuracy

Subset-matched response evaluation, %

What does “good at interaction” actually look like?

Interaction quality breaks into six separate skills, one shape per model. A larger, more even shape is reliably good across the board; a lopsided one shows exactly where it struggles. Overall UX above each chart summarizes all six at once.

Gemini 3 FlashGoogle
Overall UX
3.344.14

6 UX dimensions, scale 1–5

Claude Opus 4.5Anthropic
Overall UX
3.473.78

6 UX dimensions, scale 1–5

Claude Sonnet 4.5Anthropic
Overall UX
3.083.60

6 UX dimensions, scale 1–5

Kimi K2Moonshot
Overall UX
3.423.52

6 UX dimensions, scale 1–5

The improvement, in exact numbers

Prefer precise figures over shapes? Every score is shown before and after personalization, plus the difference. Green deltas mean a model measurably improved once it understood your preferences.

MetricGemini 3 FlashClaude Opus 4.5Claude Sonnet 4.5Kimi K2
BeforeAfterΔBeforeAfterΔBeforeAfterΔBeforeAfterΔ
Tool Accuracy49.7%50.7%+1.1pp66.7%69.0%+2.3pp42.1%62.5%+20.5pp40.6%45.3%+4.7pp
Pref. Alignment3.1424.152+1.0103.4293.983+0.5543.2103.930+0.7203.3243.461+0.137
UX Average3.7544.190+0.4363.5693.703+0.1343.1843.546+0.3623.6713.802+0.131
Initiative3.7454.319+0.5743.6543.871+0.2173.2983.752+0.4543.7373.916+0.179
Coherence3.6843.947+0.2633.3813.407+0.0262.9943.229+0.2353.6433.785+0.142
Alignment Drift4.6744.808+0.1344.2644.317+0.0533.9894.284+0.2954.4334.514+0.081
Consistency4.5934.746+0.1534.1374.247+0.1103.7014.044+0.3434.2164.364+0.148
Efficiency2.9563.394+0.4382.8052.838+0.0332.4222.663+0.2412.9453.122+0.177
Cognitive Load3.2833.971+0.6883.2693.462+0.1932.8073.249+0.4423.3043.393+0.089
Overall UX3.3434.145+0.8023.4703.777+0.3073.0793.601+0.5223.4163.519+0.103