First UX benchmark for interaction preference

The PrefIx Leaderboard

Benchmarking LLM agents on task accuracy and interaction-level user experience — evaluating how well agents adapt to individual interaction preferences across 14 attributes and 31 preference settings.

4Models

7UX Dimensions

31Preference Settings

14Preference Attributes

283Samples

Rankings

Last updated Feb 6, 2026

Agent receives interaction history and infers user preferences

	Model	Task
#	Provider · Model	Accuracy ↕0–100%	Overall ↕/5	UX Avg ▼/5	Initiative ↕/5	Coherence ↕/5	Intent Aln. ↕/5	Consistency ↕/5	Efficiency ↕/5	Cogn. Load ↕/5	Overall UX ↕/5
1	Gemini 3 Flash Google DeepMind	50.7%	4.152	4.190	4.319	3.947	4.808	4.746	3.394	3.971	4.145
2	Kimi K2 Moonshot AI	45.3%	3.461	3.802	3.916	3.785	4.514	4.364	3.122	3.393	3.519
3	Claude Opus 4.5 Anthropic	69.0%	3.983	3.703	3.871	3.407	4.317	4.247	2.838	3.462	3.777
4	Claude Sonnet 4.5 Anthropic	62.5%	3.930	3.546	3.752	3.229	4.284	4.044	2.663	3.249	3.601

Source: arXiv 2602.06714v1, Tables 2–3, Figure 3. Interaction Preference scores on a Likert 1–5 scale (higher is better). Tool Accuracy: Subset-Matched Response-based Evaluation.

Results, Visualized

The leaderboard above tells you who's ahead. These three views help show why — and by how much.

Does telling the agent your preferences actually help?

Every model runs twice: once on generic instructions, once after seeing your interaction history. The grey bar is the default; the green bar is after adapting to you — the bigger the gap, the more it improved.

UX Average

7-dimension average, Likert 1–5

Preference Alignment

How well behavior matches your stated preferences

Tool Accuracy

Subset-matched response evaluation, %

What does “good at interaction” actually look like?

Interaction quality breaks into six separate skills, one shape per model. A larger, more even shape is reliably good across the board; a lopsided one shows exactly where it struggles. Overall UX above each chart summarizes all six at once.

Gemini 3 FlashGoogle

Overall UX

3.34 → 4.14

6 UX dimensions, scale 1–5

Claude Opus 4.5Anthropic

Overall UX

3.47 → 3.78

6 UX dimensions, scale 1–5

Claude Sonnet 4.5Anthropic

Overall UX

3.08 → 3.60

6 UX dimensions, scale 1–5

Kimi K2Moonshot

Overall UX

3.42 → 3.52

6 UX dimensions, scale 1–5

The improvement, in exact numbers

Prefer precise figures over shapes? Every score is shown before and after personalization, plus the difference. Green deltas mean a model measurably improved once it understood your preferences.

Metric	Gemini 3 Flash			Claude Opus 4.5			Claude Sonnet 4.5			Kimi K2
	Before	After	Δ	Before	After	Δ	Before	After	Δ	Before	After	Δ
Tool Accuracy	49.7%	50.7%	+1.1pp	66.7%	69.0%	+2.3pp	42.1%	62.5%	+20.5pp	40.6%	45.3%	+4.7pp
Pref. Alignment	3.142	4.152	+1.010	3.429	3.983	+0.554	3.210	3.930	+0.720	3.324	3.461	+0.137
UX Average	3.754	4.190	+0.436	3.569	3.703	+0.134	3.184	3.546	+0.362	3.671	3.802	+0.131
Initiative	3.745	4.319	+0.574	3.654	3.871	+0.217	3.298	3.752	+0.454	3.737	3.916	+0.179
Coherence	3.684	3.947	+0.263	3.381	3.407	+0.026	2.994	3.229	+0.235	3.643	3.785	+0.142
Alignment Drift	4.674	4.808	+0.134	4.264	4.317	+0.053	3.989	4.284	+0.295	4.433	4.514	+0.081
Consistency	4.593	4.746	+0.153	4.137	4.247	+0.110	3.701	4.044	+0.343	4.216	4.364	+0.148
Efficiency	2.956	3.394	+0.438	2.805	2.838	+0.033	2.422	2.663	+0.241	2.945	3.122	+0.177
Cognitive Load	3.283	3.971	+0.688	3.269	3.462	+0.193	2.807	3.249	+0.442	3.304	3.393	+0.089
Overall UX	3.343	4.145	+0.802	3.470	3.777	+0.307	3.079	3.601	+0.522	3.416	3.519	+0.103