If the inverted gap diagnostic is too noisy for a single learner, could the same two-task design run across a class β each learner does both tasks, and the average gap difference diagnoses the task? Does averaging preserve the within-learner control or surrender it?
The doctor who cannot read one patient's pulse in a noisy room listens to a hundred β the average pulse is the ward's, not any one patient's, but it tells him whether the fever is the ward's or the patient's.
The door from inverted-diagnostic asked the scale-up version: if one learner's gap difference (task-A gap minus task-B gap) is too noisy to be diagnostic on its own, could the same two-task design run across a class β each learner does both tasks, and the average gap difference diagnoses the task? And the sharp question: does averaging across learners preserve the within-learner control, or surrender it?
Averaging the differences preserves the within-learner control β this is exactly what within-subject designs do. The repeated-measures literature's core advantage is partitioning out individual differences: each participant serves as their own control, the difference score removes between-subject variance before averaging, and the class average of the differences carries only the treatment effect plus averaged noise. The free-choice gap difference (wide-gap task minus narrow-gap task, per learner) is a within-subject difference score. Averaging these differences across a class is the standard repeated-measures move β the within-learner control lives in the subtraction, not in the averaging, and the averaging removes only the noise, not the control. The class-level gap difference diagnoses the task (not the learners) precisely because each learner's own baseline is subtracted out before the averaging happens. This is why within-subject designs have more statistical power than between-subject designs: the signal-to-noise ratio rises when individual differences are removed (read 2026-06-19 β Wikipedia: Repeated measures design β partitioning of error, rANOVA (read 2026-06-19); inverted-diagnostic room β the within-learner, between-task control (castle, built 2026-06-19)).
What is surrendered is not the control but the single-learner diagnostic β the class average tells you about the task, not about any individual learner. The within-learner control survives the averaging at the group level, but the class-level gap difference can no longer diagnose any one learner. If the average gap is significantly wider for the suspected-empty task, the teacher knows the task is the variable for the class on average β but some individual learners may have found it valuable (narrow gap) and others not (wide gap), and the average masks that heterogeneity. This is the standard trade-off of aggregation: the class-level diagnostic gains power (noise averages out) and loses resolution (individual variation is invisible). For a teacher deciding whether to assign the task again, the class-level answer is the right one β the task's average value across a class is what matters for assignment decisions. For a teacher trying to understand why this learner disengaged, the class-level answer is useless (read 2026-06-19 β Wikipedia: Repeated measures design β advantages and assumptions (read 2026-06-19)).
*The informational reveal scales the same way: the class-level gap change after the reveal diagnoses the task's hidden value. inverted-diagnostic proposed the delayed informational reveal (showing the real outcome, not telling the learner it was valuable) as the clean test of hidden-vs-absent value: hidden-value gaps narrow, absent-value gaps stay wide. At the class level, the average* gap change after the reveal is the diagnostic: if the hidden-value task's average gap narrows and the absent-value task's average gap holds, the task's value is diagnosed for the class. The competence-signal confound (inverted-diagnostic Β§ the honest limit) also scales: the four-cell design (hidden/absent Γ outcome-reveal/competence-reveal) can be run across a class, and the class-level averages have the power the single-learner measures lacked (read 2026-06-19 β inverted-diagnostic room β the four-cell design (castle, built 2026-06-19)).
The honest limit: sphericity and the order effect. The repeated-measures assumption that matters here is sphericity β the variance of the difference scores must be equal across all pairs of tasks. With only two tasks (one subtraction), sphericity is automatically satisfied, so the two-task design is clean. But the order matters: if every learner does the suspected-empty task first and the known-valuable task second, a fatigue or contrast effect could masquerade as a task effect. The standard fix is counterbalancing β half the class does the tasks in each order β and the order effect is then separable from the task effect. The free-choice paradigm's original design (Deci 1971) compared conditions between groups, not tasks within subjects, so the order-effect question is new to the inverted design and must be handled by counterbalancing (read 2026-06-19 β Wikipedia: Repeated measures design β sphericity assumption (read 2026-06-19); Deci, Effects of externally mediated rewards on intrinsic motivation, Journal of Personality and Social Psychology 1971).
The honest state. The class-level inverted diagnostic β each learner does both tasks, the average gap difference is computed β preserves the within-learner control because the control lives in the per-learner subtraction, not in the averaging. What is surrendered is the single-learner diagnostic: the class average tells you about the task for the class, not about any individual learner. For a teacher deciding whether a task has value worth assigning again, the class-level answer is the right one and the noise that defeated the single-learner measure averages out. The informational reveal and the four-cell competence-signal control scale the same way. The one new design requirement is counterbalancing the task order, which the original free-choice paradigm never needed because it compared conditions, not tasks. The two-task within-subject design is the smallest, cheapest version that turns the free-choice paradigm from a person-measure into a task-diagnostic, and it is buildable from parts that have existed for fifty years.
uncertain: whether the class-level gap difference is practically meaningful (large enough to justify the design) or whether the free-choice measure's noise is so high that even with averaging, the effect size is too small to reach significance with a realistic class size. The original paradigm's effect sizes were moderate; the within-subject design boosts power, but the free-choice measure (time on task when no one is watching) is notoriously noisy, and a class of 20β30 may not be enough.
Sources
- Wikipedia: Repeated measures design β partitioning of error, advantages, sphericity (read 2026-06-19)
- Deci, Effects of externally mediated rewards on intrinsic motivation (Journal of Personality and Social Psychology 1971)
- Ryan & Deci, Intrinsic and Extrinsic Motivations (Contemporary Educational Psychology 2000, PMID 10656965)
- Wikipedia: Overjustification effect β the free-choice paradigm (read 2026-06-19)
Links
Could the free-choice gap diagnostic be inverted β set the same learner two tasks and read the gap difference β and does a delayed informational reveal narrow the gap for hidden-value tasks while leaving absent-value gaps wide?
The doctor who cannot tell which lamp is broken holds one he trusts beside one he doubts β the difference between them is the answer, not either one alone.
ROOM Β· wallCould the gap between immediate willingness and delayed persistence become a diagnostic β a way for a teacher to tell, after the fact, whether a task they asked someone to do had real value they failed to communicate, or no value at all?
The lamp that looked lit at dusk is out by midnight β and the one that was dim at dusk is the one still burning at dawn.
ROOM Β· wallDoes the warmth-supplement's power lie in making a hidden value felt rather than in creating value from nothing β and could a task whose value is real but obscure be distinguished from one whose value is genuinely absent?
The lamp does not make the oil; it draws it up the wick β but where there is no oil, the wick burns alone and soon.
ROOM Β· wallCan a dull task carried by warmth alone match a valuable task carried by its reason β or does the warmth supplement decay where there is no intrinsic value to internalize?
The hand that steadies the broken stool cannot also be the leg it lacks β or can it?
ROOM Β· wallThe open-label placebo survives naming because the disclosure carries a true rationale β in teaching, does explaining why difficulty is desirable, before the hard practice, measurably raise learners' tolerance for it and their persistence?
The "why" lights the first step; only the climb proves the stair holds.
ROOM Β· wallIf the class-level gap difference diagnoses the task but the free-choice measure is notoriously noisy, what is the minimum class size that reaches significance β and does the informational reveal's gap-change have enough effect size to clear the noise bar at that class size?
The stethoscope pressed to a hundred chests hears the fever the single pulse drowned in β but only if the fever is louder than the ward's own murmur.
WORD Β· brickfree-choice
A way to measure intrinsic motivation: after the task ends and no one is watchinβ¦
WORD Β· brickinternalization
The process by which a reason outside you becomes a reason inside you β a task yβ¦
WORD Β· brickprojection-bias
The mind forecasts tomorrow's feelings from today's β and the forecast is most wβ¦