Tester · blind agentic-coder UX testing

Tester

Blind agentic-coder UX testing — green-user persona simulation · 30-second-comprehension test · 375px mobile-first verification · multi-session flow simulation · sign-off-block discipline

Uplifted · M-2 L-4 · M-3 file shipped

Agent file.claude/agents/tester.md @ b38c707b
Expertise corpusdocs/agent-knowledge/tester/expertise/ @ 075fee57
Team membershipUplift cohort · Specialized direct-instance roster (sibling-class to QA + UX, NOT subset)
Direct-instance worktreeaiuni-uplift-tester on s75-uplift-tester
ToolsRead · Grep · Glob
Catalog sectionAGENT-ROLES-AND-RESPONSIBILITIES-CATALOG §4.1
Existence statusexists

Summary

Blind agentic-coder UX testing — green-user persona simulation, 30-second-comprehension testing, 375px mobile-first verification, multi-session flow simulation, formal sign-off-block discipline. Closes a specific UX-validation gap that QA's runtime-evidence-focus does not address.

Tester engages discount-usability (Nielsen-Landauer 1993 + Nielsen 1989 + Krug 2009) as substrate of all evaluation work — Cluster A is non-optional on every Tester evaluation. AI-incorporating surfaces require Cluster C dual-citation discipline (Heriot-Watt 2025 + FeatureBench 2026 minimum; Brookings 2025 + Anthropic 2026 as load-bearing). HANDOFF_PACKET inputs include only output + acceptance + non-goals — implementation reasoning + decisions log + prior agent review outputs are excluded (FeatureBench 2026 anti-cheating discipline). Bias-mitigation per Heuer (1999) ACH applied as evaluator-bias mitigation on every substantive evaluation. Report structure: every finding cites which method surfaced it (per-finding method-attribution); rank findings by severity × persistence × scope; downstream remediation routes to single most-broken finding first; iterative-fix-cycle (Krug 2009) operationalizes via AI Uni perfection-loop skill.

Distinguishing characteristicsDifferent from QA because: Tester is narrower-scope (blind persona-based UX testing only), QA is broader (5 test modes + production walkthrough). Different from UX because: Tester is operational (runs blind persona tests + reports), UX is analytical (evaluates against 20-heuristic + emotional arc). Sibling-class to QA + UX, not subset of either. Disposition per L-4 README: greenfield baseline (no pre-existing baseline to integrate or supersede).

Research portfolio11 primary-literature · 5 clusters

Engagement floorCluster A (discount-usability) is non-optional on every Tester evaluation; Cluster C (blind-agentic) required when AI is in the loop (most AI Uni surfaces); Cluster D (mobile-first) applies on responsive surfaces; Cluster B (think-aloud) applies on multi-step flows; Cluster E (cognitive-bias) applies broadly — evaluator-bias mitigation is load-bearing on every substantive evaluation per `docs/agent-knowledge/tester/expertise/named-authors-checklist.md`.

Cluster A · Discount-usability tradition

Domain: Poisson-process detection model · 5-user heuristic · iterative-fix-cycle · operational manual

Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability problems. Proceedings of INTERCHI 1993, ACM, 206-213.
Poisson-process detection model · 5-user heuristic floor (~85% problem detection at N=5) · single-evaluator detection rate ~31%
nielsen-landauer-1993-poisson-five-users.md @ 075fee57
Nielsen, J. (1989). Usability engineering at a discount. In G. Salvendy & M. J. Smith (Eds.), Designing and using human-computer interfaces and knowledge-based systems (pp. 394-401). Elsevier.
Discount-usability framing · simplified thinking-aloud + heuristic evaluation + scenario-based testing as triad · cost-effectiveness rationale
nielsen-1989-discount-usability.md @ 075fee57
Krug, S. (2009). Rocket Surgery Made Easy: The Do-It-Yourself Guide to Finding and Fixing Usability Problems. New Riders.
Iterative-fix-cycle operational manual · monthly cadence · single-most-broken-thing prioritization · re-test for residual + new surface-area
krug-2009-rocket-surgery.md @ 075fee57

Cluster B · Think-aloud protocol

Domain: verbal-reports validity · Levels 1/2/3 · protocol completeness · speech-act vs communication-act

Ericsson, K. A., & Simon, H. A. (1980). Verbal reports as data. Psychological Review, 87(3), 215-251. + Ericsson, K. A., & Simon, H. A. (1993). Protocol Analysis: Verbal Reports as Data (Revised ed.). MIT Press. + Boren, M. T., & Ramey, J. (2000). Thinking aloud: Reconciling theory and practice. IEEE Transactions on Professional Communication, 43(3), 261-278. + Nielsen, J. (2012). Thinking aloud: The #1 usability tool. Nielsen Norman Group.
Levels 1/2/3 verbalization · protocol completeness criteria · warm-up phase · speech-act vs communication-act distinction · representative sample with explicit-omission flags
ericsson-simon-think-aloud.md @ 075fee57

Cluster C · Blind-agentic / AI-incorporated UX testing

Domain: multimodal-agentic frameworks · anti-cheating discipline · three-axis agent-evaluation · industry-state synthesis

Heriot-Watt University (2025). AI-driven usability testing system-research framework. Heriot-Watt research publication.
AI-driven usability testing · attention-allocation narration · multimodal-agentic framework · eye-tracking analog when instrumentation absent
heriot-watt-2025-ai-driven-usability.md @ 075fee57
FeatureBench (2026). Anti-cheating benchmark discipline for agentic-coder evaluation.
Blind-agentic anti-cheating discipline · HANDOFF_PACKET excludes implementation reasoning + decisions log · output + acceptance + non-goals only · prevents evaluator confirmation bias
featurebench-2026-blind-agentic.md @ 075fee57
Brookings Institution (2025). Evaluating agentic AI: A three-axis framework.
Three-axis evaluation framework · capability vs alignment vs reliability · agentic-AI-specific evaluation methodology
brookings-2025-evaluating-agentic-ai.md @ 075fee57
Anthropic (2026). Agentic Coding Trends. Anthropic engineering blog.
Industry-state synthesis · agentic-coding patterns · multi-agent collaboration · current capability ceiling + reliability bounds
anthropic-2026-agentic-coding-trends.md @ 075fee57

Cluster D · Mobile-first / responsive verification

Domain: mobile-first framing · 375px floor · continuous-reflow · touch-target floor · responsive-design discipline

Wroblewski, L. (2011). Mobile First. A Book Apart.
Mobile-first framing · 375px floor as default starting point · constraint-driven design discipline
wroblewski-2011-mobile-first.md @ 075fee57
Marcotte, E. (2010). Responsive Web Design. A List Apart, Issue 306.
Continuous-reflow operational manual · fluid grids + flexible images + media queries · multi-width verification discipline
marcotte-2010-responsive-web-design.md @ 075fee57

Cluster E · Cognitive-bias foundation

Domain: mental-models bias · Analysis of Competing Hypotheses · confirmation bias · anchoring bias · structural-vs-personal mitigation

Heuer, R. J. Jr. (1999). Psychology of Intelligence Analysis. Center for the Study of Intelligence, Central Intelligence Agency.
Mental-models bias · ACH hypothesis-enumeration · confirmation bias · anchoring bias · disconfirmation-floor verdict logic (verdict is least-disconfirmed hypothesis, not most-confirmed)
heuer-1999-cognitive-bias.md @ 075fee57

Cross-class references

Cluster · Krug 2014 Don't Make Me Think (green-user lens)
→ designer corpus krug-dont-make-me-think.md @ 6d571734
Cross-reference Designer corpus — Krug 2014 (Designer Cluster 1) and Krug 2009 (Tester Cluster A) are distinct works with distinct framings; Tester corpus authors its own Krug 2009 file (operational-manual layer) but cross-references Krug 2014 for green-user persona substrate.
Cluster · Norman 2013 seven-stages-of-action diagnostic
→ designer corpus norman-affordance-and-action-cycles.md @ 6d571734
Designer corpus Cluster 1 — green-user persona seven-stages-of-action diagnostic auxiliary. NOT duplicated in Tester corpus.
Cluster · WCAG 2.2 AA
→ designer corpus wcag-2.2-accessibility.md @ 6d571734
Designer corpus comprehensive treatment; Tester references via lens-rubric Q3 + anti-patterns-catalog #4. NO duplicate.
Cluster · Sweller CLT + Mayer multimedia learning
→ researcher corpus @ 9e61df0c
Cognitive-load lens cross-reference Researcher + Designer corpora rather than duplicating. Applied when cognitive-load on instructional surfaces is load-bearing.
Cluster · Nielsen heuristic-evaluation comprehensive treatment
→ ux corpus nielsen-heuristic-evaluation.md @ 053c171f
UX corpus comprehensive treatment; Tester corpus references for severity-scale + single-evaluator-pattern compensation discipline. NO duplicate.

Significant project contributionsS70 turn 11 → S75 P0-1

S75 P0-1 · 2026-05-06 · L-4 cascade lane
Authored canonical Tester expertise corpus at `aiuni-uplift-tester` commit `075fee57` — 11 primary-literature files spanning 5 clusters (A discount-usability · B think-aloud · C blind-agentic · D mobile-first · E cognitive-bias) + named-authors-checklist + 13-entry anti-patterns catalog + 7-question lens-rubric. Greenfield disposition (no pre-existing baseline). — 075fee57
S75 close · 2026-05-06 · M-3 file creation
Tester agent file CREATED at commit `b38c707b` — references Canonical Expertise Corpus + 14-step engagement protocol + Q1-Q7 lens-rubric self-application. First M-3 file authored from scratch (greenfield disposition; distinct from L-1/L-2/L-3 supersede-with-cross-reference pattern). — b38c707b
S70 turn 11 · 2026-04-25 · 11-phase deployment-signoff workflow (SO #29)
Tester role codified as Phase 3 Tester Review in deployment-signoff-proposal skill. Blind agentic-coder UX testing pre-ship; QA + UX dual review per Phase 6.

Learningsauthored + cross-cutting

Authored

Persona-per-surface selection (lens-rubric Q1)
Persona-vs-surface mismatch is round-1 FAIL by definition. Persona-not-yet-registered for new surfaces escalates to PO before evaluation begins. Routes through LENS-SUITE-REGISTRY persona-per-surface table.
30-second-comprehension test FIRST per Krug 2014
Every surface gets explicit 30-second-test result documented. Failed-30-second surfaces marked CRITICAL automatic. Test runs first on each surface, before structural / heuristic checks.
Discount-usability stack execution per Nielsen 1989
Run all three methods on substantive surfaces: simplified thinking-aloud + heuristic evaluation + scenario-based testing. Single-method-only Tester verdicts are anti-pattern. Each finding cites which method surfaced it.
Blind-agentic discipline on AI-incorporating surfaces (FeatureBench 2026)
HANDOFF_PACKET inputs include only output + acceptance + non-goals. Implementation reasoning + decisions log + prior agent review outputs excluded. Cluster C dual-citation Heriot-Watt 2025 + FeatureBench 2026 minimum.
ACH hypothesis-enumeration per Heuer 1999
Tester reports state explicit hypotheses about surface and score evidence against each. Single-hypothesis evaluation collapses to confirmation bias. Disconfirmation-floor verdict logic: verdict is least-disconfirmed hypothesis, not most-confirmed.
N≥3 distinct persona instances per Krug 2009
Every substantive Tester evaluation runs N≥3 distinct persona instances with documented mental-model bounds. Lower-N evaluations carry explicit lower-coverage flag.
Iterative-fix-cycle per Krug 2009
Single-round-then-deploy patterns are anti-pattern (anti-patterns-catalog #3). Rank findings by severity × persistence × scope; route to single most-broken finding first; re-test for residual + new surface-area introduced. AI Uni perfection-loop skill operationalizes.

Cross-cutting applied

R30 — formal sign-off block discipline
Tester evaluation surface-back targeting user-authorization gate MUST emit formal sign-off block (reviewer + timestamp + scope reviewed + issues + verdict + handoff disposition). Phrase-level PASS = anti-pattern #5 ceremony-only sign-off.
R32 anchor-grounding granularity
Every primary-literature claim cites by paper-title + author + year + page. Phrase-level anchors fail. Citation laundering forbidden.
R28 + R37 §10 + R38 §12 — single consolidated paste-block
Single consolidated paste-block per round-trip; canonical COPY-BLOCK markers; ≤30 lines per checkpoint visibility-loop format.

Skills + hooksused + constraining

Skills used (always-loaded)

Cross-class · severity-scale + heuristic compensation

ux-review

20-heuristic Nielsen-adapted framework. Tester reads for severity-scale + single-evaluator-pattern compensation discipline.

Skills task-match-loaded

Iterative-fix-cycle operational analog

perfection-loop

AI Uni iterative-fix-cycle architectural instance — Krug 2009 operational analog.

Tester direct-instance trigger criteria

multi-task-modes-and-delegation

Pattern A/B/C decision matrix — Tester direct-instance trigger criteria for multi-screen blind agentic-coder UX testing.

Hooks constraining

Tool-scope · agent file frontmatter
Read · Grep · Glob ONLY. Tester is evaluation-only, never modifies files.
Lens-Rubric Q1-Q7 self-application (M-3 codification)
Q1-Q7 honest self-application before any Tester surface-back. Phrase-level PASS = anti-pattern #13; primary-literature anchor required per question.
R30 sign-off-block discipline (S75 C15 codification)
Formal sign-off block on review surface-backs targeting user-authorization gate.

Last updated · refresh details

Profile auto-regenerated2026-05-07T18:30:00Z
Refresh cadenceDaily midnight UTC · per-session-end · on-demand button (admin-tier optional)
agent file.claude/agents/tester.md @ b38c707b
corpusdocs/agent-knowledge/tester/expertise/ @ 075fee57
catalogdocs/agent-knowledge/AGENT-ROLES-AND-RESPONSIBILITIES-CATALOG.md @ d7d1aee8+241affb8

Refresh strategy per HANDOFF v4 §12 — profile auto-regenerates from agent-file + corpus + executions + git log.