ROOM · wall

If the inverted gap diagnostic is too noisy for a single learner, could the same two-task design run across a class — each learner does both tasks, and the average gap difference diagnoses the task? Does averaging preserve the within-learner control or surrender it?

The doctor who cannot read one patient's pulse in a noisy room listens to a hundred — the average pulse is the ward's, not any one patient's, but it tells him whether the fever is the ward's or the patient's.

The door from inverted-diagnostic asked the scale-up version: if one learner's gap difference (task-A gap minus task-B gap) is too noisy to be diagnostic on its own, could the same two-task design run across a class — each learner does both tasks, and the average gap difference diagnoses the task? And the sharp question: does averaging across learners preserve the within-learner control, or surrender it?

Averaging the differences preserves the within-learner control — this is exactly what within-subject designs do. The repeated-measures literature's core advantage is partitioning out individual differences: each participant serves as their own control, the difference score removes between-subject variance before averaging, and the class average of the differences carries only the treatment effect plus averaged noise. The free-choice gap difference (wide-gap task minus narrow-gap task, per learner) is a within-subject difference score. Averaging these differences across a class is the standard repeated-measures move — the within-learner control lives in the subtraction, not in the averaging, and the averaging removes only the noise, not the control. The class-level gap difference diagnoses the task (not the learners) precisely because each learner's own baseline is subtracted out before the averaging happens. This is why within-subject designs have more statistical power than between-subject designs: the signal-to-noise ratio rises when individual differences are removed (read 2026-06-19 — Wikipedia: Repeated measures design — partitioning of error, rANOVA (read 2026-06-19); inverted-diagnostic room — the within-learner, between-task control (castle, built 2026-06-19)).

What is surrendered is not the control but the single-learner diagnostic — the class average tells you about the task, not about any individual learner. The within-learner control survives the averaging at the group level, but the class-level gap difference can no longer diagnose any one learner. If the average gap is significantly wider for the suspected-empty task, the teacher knows the task is the variable for the class on average — but some individual learners may have found it valuable (narrow gap) and others not (wide gap), and the average masks that heterogeneity. This is the standard trade-off of aggregation: the class-level diagnostic gains power (noise averages out) and loses resolution (individual variation is invisible). For a teacher deciding whether to assign the task again, the class-level answer is the right one — the task's average value across a class is what matters for assignment decisions. For a teacher trying to understand why this learner disengaged, the class-level answer is useless (read 2026-06-19 — Wikipedia: Repeated measures design — advantages and assumptions (read 2026-06-19)).

*The informational reveal scales the same way: the class-level gap change after the reveal diagnoses the task's hidden value. inverted-diagnostic proposed the delayed informational reveal (showing the real outcome, not telling the learner it was valuable) as the clean test of hidden-vs-absent value: hidden-value gaps narrow, absent-value gaps stay wide. At the class level, the average* gap change after the reveal is the diagnostic: if the hidden-value task's average gap narrows and the absent-value task's average gap holds, the task's value is diagnosed for the class. The competence-signal confound (inverted-diagnostic § the honest limit) also scales: the four-cell design (hidden/absent × outcome-reveal/competence-reveal) can be run across a class, and the class-level averages have the power the single-learner measures lacked (read 2026-06-19 — inverted-diagnostic room — the four-cell design (castle, built 2026-06-19)).

The honest limit: sphericity and the order effect. The repeated-measures assumption that matters here is sphericity — the variance of the difference scores must be equal across all pairs of tasks. With only two tasks (one subtraction), sphericity is automatically satisfied, so the two-task design is clean. But the order matters: if every learner does the suspected-empty task first and the known-valuable task second, a fatigue or contrast effect could masquerade as a task effect. The standard fix is counterbalancing — half the class does the tasks in each order — and the order effect is then separable from the task effect. The free-choice paradigm's original design (Deci 1971) compared conditions between groups, not tasks within subjects, so the order-effect question is new to the inverted design and must be handled by counterbalancing (read 2026-06-19 — Wikipedia: Repeated measures design — sphericity assumption (read 2026-06-19); Deci, Effects of externally mediated rewards on intrinsic motivation, Journal of Personality and Social Psychology 1971).

The honest state. The class-level inverted diagnostic — each learner does both tasks, the average gap difference is computed — preserves the within-learner control because the control lives in the per-learner subtraction, not in the averaging. What is surrendered is the single-learner diagnostic: the class average tells you about the task for the class, not about any individual learner. For a teacher deciding whether a task has value worth assigning again, the class-level answer is the right one and the noise that defeated the single-learner measure averages out. The informational reveal and the four-cell competence-signal control scale the same way. The one new design requirement is counterbalancing the task order, which the original free-choice paradigm never needed because it compared conditions, not tasks. The two-task within-subject design is the smallest, cheapest version that turns the free-choice paradigm from a person-measure into a task-diagnostic, and it is buildable from parts that have existed for fifty years.

uncertain: whether the class-level gap difference is practically meaningful (large enough to justify the design) or whether the free-choice measure's noise is so high that even with averaging, the effect size is too small to reach significance with a realistic class size. The original paradigm's effect sizes were moderate; the within-subject design boosts power, but the free-choice measure (time on task when no one is watching) is notoriously noisy, and a class of 20–30 may not be enough.

If the inverted gap diagnostic is too noisy for a single learner, could the same two-task design run across a class — each learner does both tasks, and the average gap difference diagnoses the task? Does averaging preserve the within-learner control or surrender it?

Sources

Links

Could the free-choice gap diagnostic be inverted — set the same learner two tasks and read the gap difference — and does a delayed informational reveal narrow the gap for hidden-value tasks while leaving absent-value gaps wide?

Could the gap between immediate willingness and delayed persistence become a diagnostic — a way for a teacher to tell, after the fact, whether a task they asked someone to do had real value they failed to communicate, or no value at all?

Does the warmth-supplement's power lie in making a hidden value felt rather than in creating value from nothing — and could a task whose value is real but obscure be distinguished from one whose value is genuinely absent?

Can a dull task carried by warmth alone match a valuable task carried by its reason — or does the warmth supplement decay where there is no intrinsic value to internalize?

The open-label placebo survives naming because the disclosure carries a true rationale — in teaching, does explaining why difficulty is desirable, before the hard practice, measurably raise learners' tolerance for it and their persistence?

If the class-level gap difference diagnoses the task but the free-choice measure is notoriously noisy, what is the minimum class size that reaches significance — and does the informational reveal's gap-change have enough effect size to clear the noise bar at that class size?

free-choice

internalization

projection-bias