FILE / 2026.05 · RESEARCH REPORT · ONE PAGE EDITION
v

JAMOVI/

A THREE PHASE
USABILITY STUDY

Why a statistical workbench feels effortless to its users while quietly producing wrong conclusions, and what that means now that AI agents have started clicking the same buttons.

Subject
Jamovi 2.7.17 desktop
Domain
Open source statistical analysis
Phases
3 (User / Expert / Agent)
Participants
6 STEM users + 3 UX experts + 7 AI agents
Task
Cheese vs Nightmare ANOVA, with 2 buried traps
Output
4 error clusters, 6 design proposals
Author
Independent, 2026
Six users took an effortless walk through Jamovi. They produced 70 codable errors, five of them confidently reporting a wrong scientific conclusion. The interface never warned them.
62.9
SUS Score
Below the 70 industry benchmark, despite a perceived workload of only 19.83 / 100.
5/6
misread "p > 0.05"
As "no effect of cheese on nightmares". Matches the 51% rate found in literature.
01
Phase 1 / User Study

A Quiet Paradox

Six STEM participants, all with a college statistics course but under five hours of Jamovi exposure, played the role of sleep researchers. The dataset contained two seeded traps: an outlier of value 200, and the string "3a" hidden in a numeric column. The second trap silently flips the whole column to text and breaks every downstream analysis.

Three large language models (DeepSeek v3, Gemini 2.5 Pro, Grok 4.1) each coded the think aloud transcripts five times. Inter coder agreement ranged from 0.836 to 0.888, all above the 0.80 threshold.

Subjective workload

Felt easy

019.83 / 100100

NASA TLX (Performance dimension omitted, score is relative).

Behavioural reality

70 errors

070 codable eventsn=6

Six users felt nothing was wrong while breaking the analysis 70 times.

SUS and TLX distribution
FIG 1.1 / SUS Mean 62.9 vs TLX Mean 5.0. The threshold sits above almost everyone.
Session time allocation per participant
FIG 1.2 / Roughly half of the 66.4 total minutes was spent outside the 14 prescribed steps.

Four error clusters

K MEANS / N=70 / CODED FROM THINK ALOUD + SCREEN RECORD

A

Missing
error feedback

most frequent

A drag onto the wrong slot triggers a brief icon flicker. All six users saw it. None decoded it.

"I knew it should go there. It took two minutes to realise the variable type was wrong."
3 to 5 repeats per user · 0 diagnoses
B

High risk
workarounds

most dangerous

Blocked from the right variable, users dragged the only one that fit: the row ID. Jamovi ran the test without warning.

"I picked ID as the dependent variable because the icon in the corner matched."
Silent success · wrong conclusion shipped
C

Mental model
collisions

cross tool

Excel and SPSS habits leaked in. One user tried to paste analysis output into the data grid to "compare side by side".

"I just want to put these next to each other like in Excel."
Reveals unclear data / output boundary
D

Skipped
data prep

researcher habit

A scan, a delete, a confident "no problem". Only one of six caught the "3a" string trap.

"One obvious problem, removed. Looks fine now."
1 / 6 found the format trap
Drag fails, the icon blinks
FIG 1.3 / The only feedback for a rejected drop is a sub second flicker on a 16 px icon.
P value misreading
5/6

Of users who saw "p > 0.05" wrote it down as "cheese has no effect on nightmares". Literature on 791 papers reports the same conflation in roughly 51% of cases. The tool inherits the field's bad habit and amplifies it.

02
Phase 2 / Expert Evaluation

Tracing The Bottleneck

Three UX students with statistics literacy but no Jamovi history performed a PURE cognitive walkthrough plus a Nielsen heuristic evaluation. Two methods in tension: one locates where the user will stall, the other diagnoses why.

PURE total gauge 19/42
PURE TOTAL · 19 / 42 · MOSTLY EASY
Heuristic mean 1.67
HEURISTIC MEAN · 1.67 / 4 · MINOR ISSUES

Both gauges read green. The interesting story is in their distribution. Two heuristics dominate the violations: Help & Documentation (9 / 12) and Error Recovery (8 / 12). One PURE step stands alone with a 3: "Identify wrong data type". It is the only step that is purely cognitive. No click is required. Nothing on screen says it is happening.

CROSS PHASE FINDING

The invisible step explodes downstream

Mapping the 70 phase 1 errors back onto the 14 PURE steps produces a counter intuitive distribution. The hardest cognitive step receives almost no error events. The easiest looking step absorbs 29 of them.

PURE rated 3 / likely failure
Identify data type error

A pure inference step. Only icon based. Two of six users never solved it independently.

29
errors land here, on "drag variable", a step PURE rated 1 / easy.
Trace A · repeated failure

Five drag attempts, "the ruler is blinking", six minutes lost.

Trace B · spillover

Four analysis tools tried in five minutes. Eventually surrendered.

Trace C · silent success

Drags row ID as the dependent. Reads off a meaningless F statistic. Reports a confident, wrong conclusion.

2 / 6

Reached the correct data type recognition without prompting.

PURE difficulty by subtask
FIG 2.1 / Only "T2B P3 Identify data type" crosses the difficulty boundary, yet receives no error events. Errors surface elsewhere.
03
Phase 3 / Agent Experience

When The User Is A Model

The same task was handed to seven AI agents using a NixOS based Computer Use rig. Operation layer first, then the interpretation layer.

OPERATION LAYER · TASK COMPLETION
GPT 5.4 xHigh
Pass
Only model to complete the full pipeline. Still misses small targets repeatedly.
Claude Opus 4.6
Fail
Cannot identify Jamovi's custom collapse control.
Claude Sonnet 4.6
Fail
Coordinate misses, hits invocation cap.
Gemini 3.1 Pro
Fail
Stalls before completion.
Qwen 3.5 35B
Fail
Sees the outlier in the screenshot. Cannot land the cursor on it.
Qwen 3.5 122B int4
Fail
Clicks empty space dozens of times after one wrong move.
Page Agent.js (DOM)
Fail
DOM tree leaks no spatial layout for grid cells.
RATE 1 / 7 · ONLY GPT 5.4 xHIGH COMPLETES
Custom folding control
FIG 3.1 / Custom collapsibles fall outside model training distribution.
Verbose DOM nodes
FIG 3.2 / Verbose DOM does not encode grid topology. The agent loses spatial sense.
Material vs windows control sizing
FIG 3.3 / Larger touch targets help models the same way they help fingers.

Interpretation flips with two lines

Task completion is a model bottleneck today. Statistical interpretation is not. Ten models read the ANOVA result page five times each. Adding two sentences of advice into the result region changed almost everything.

Without the two lines
28
/ 50 correct

Most models repeat the human mistake: "p > 0.05 means no difference".

With two lines added
42
/
50 correct, large models near 100%

Sentences embedded next to the table: "absence of evidence is not evidence of absence" and "do not issue clinical recommendations without an equivalence test". No layout tricks. Pure semantic context.

ANOVA result with two corrective sentences
FIG 3.4 / The two added lines. Cheap to ship, indifferent to visual hierarchy, effective for both humans and LLMs.
04
Synthesis / Design Direction

Six Moves Worth Making

01Recommendation

Persistent error message
not a flickering icon

Replace the sub second icon blink with a visible text message stating what failed, why, and how to fix it. Screenshot based agents cannot perceive animation, the flicker is invisible to them.

Source · Cluster A · Agent6 / 6 users + 8 / 12 heuristic violation
02Recommendation

Embed statistical guidance
inside test output

Print a standardised note next to ANOVA results. Two sentences pushed agent accuracy to near 100% and quietly help human readers as well.

Source · Phase 3 finding50 model runs, +28% accuracy delta
03Recommendation

Active data quality checks
and conversion previews

Surface columns where 95% are numeric but a few are not. Show a preview before any silent type conversion. Use a non blocking notice when entering analysis with unresolved quality issues.

Source · PURE 3 / 3 step"3a" trap + silent NA conversion
04Recommendation

A navigable result panel,
not an empty half screen

Layer a lightweight outline navigator over the free form notebook. SPSS Output Viewer and Wikiwand floating outlines are good prior art. Recovers space, prevents accidental delete of the entire output.

Source · Cluster C3 distinct result panel incidents
05Recommendation

One search bar for help,
features, and documentation

Bring the Office "Tell me what you want to do" pattern to the ribbon. Jamovi already borrows the ribbon. The search entry is the natural extension. No more help vacuum at the moment of need.

Source · Heuristic 9 / 12Highest heuristic violation score
06Recommendation

APA aligned default output
for hypothesis tests

Descriptives, effect sizes and a box plot should be on by default. Each unchecked box is one missed report for a human and one extra coordinate hit for an agent. Suggest post hoc comparisons when results are significant.

Source · Sensible defaultsPhase 1 omissions + Phase 3 click failures
Result panel reference designs
FIG 4.1 / Prior art for result panel navigation. SPSS tree, lightweight floating outline, classic checkbox set.