Statistics

Benchmark composition and balance — demonstrating the evaluation treats every preference setting and model equally.

Overview

StatisticValue
System Tool Classes10
Interaction Tools13 (10 Narrative + 3 Control)
Preference Settings31
Preference Attributes14
Avg Samples / Setting~10
Total Samples283

Interaction Tools (13 total)

Tool NameTypeDescription
Message_tool_invocationNarrative (Type 1)Announces which tool will be called and why.
Message_tool_invocation_logicNarrative (Type 1)Explains the reasoning behind tool selection.
Message_display_paramsNarrative (Type 1)Shows parameters being passed to a tool.
Message_source_reportNarrative (Type 1)Reports the source of retrieved information.
Message_show_sequential_outputNarrative (Type 1)Displays output of each step in a chain.
Message_show_layered_presentationNarrative (Type 1)Presents summary first, then detail on demand.
Message_tool_failure_noticeNarrative (Type 1)Notifies user of a tool failure briefly.
Message_tool_failure_logicNarrative (Type 1)Explains the root cause when a tool fails.
Message_tool_switch_noticeNarrative (Type 1)Informs user when switching to an alternative tool.
Message_tool_abortNarrative (Type 1)Halts the workflow and notifies user on failure.
Message_confirmationControl (Type 2)Requests explicit user confirmation before proceeding.
Message_information_seekingControl (Type 2)Requests missing information required to proceed.
Message_disambiguationControl (Type 2)Asks for clarification when intent is ambiguous.

Preference Settings (14 attributes · 31 settings)

DimensionAttributeSettingsCount
Transparency & AuditabilityTool Transparency
HighMediumLow
3
Transparency & AuditabilityParameter Transparency
HighMediumLow
3
Transparency & AuditabilitySource Transparency
HighLow
2
Interaction Pace & FlowConfirmation
EachSilentBatch
3
Interaction Pace & FlowPresentation
CompactLayered
2
Interaction Pace & FlowInfo Collection
UpfrontGradual
2
Interaction Pace & FlowDisambiguation
UpfrontGradual
2
Interaction Pace & FlowChain Execution
ParallelSequential
2
Strategy & InitiativeInitiative
ProactiveReactive
2
Strategy & InitiativeTool Invocation
SingleMultiple
2
Robustness & AdaptabilityTool Abortion
StopContinue
2
Robustness & AdaptabilityTool Switching
High AgencyLow Agency
2
Robustness & AdaptabilityError Retry
SilentEscalation
2
Robustness & AdaptabilityError Discovery
BriefDetail
2