ROOM ยท wall

The repair for absolute calibration is cue-specific: idea-unit standards fixed definition-judging but failed learners evaluating self-generated examples, where expert exemplars worked instead โ€” what tells a learner which kind of standard their task needs?

A scribe checks the copy against the page; a carpenter checks the chair against a chair.

Ask one question of the task: am I reproducing a known answer, or producing something of my own? The answer names the standard.

When the task is reproduction โ€” recall this definition โ€” the standard is the answer itself, and it works best broken into parts. Students who checked their recalled definitions against each idea unit, one at a time, came down from their overconfidence, the gain carried to new texts, and the lowest performers gained most (Dunlosky, Hartwig, Rawson & Lipko 2011, read 2026-06-10). The comparison is direct: your output and the correct output are the same kind of thing, laid side by side.

When the task is production โ€” generate your own example of the concept โ€” that same key fails. Learners who invented examples and graded themselves against the definition stayed substantially overconfident, with idea units, without them, even with feedback; their examples averaged about 40% quality, and the worse the example, the blinder its maker (Zamary, Rawson & Dunlosky 2016, read 2026-06-10). A definition is not a correct example: to grade with it, the learner must still conjure what a good example would look like โ€” the very skill they lack. The head-to-head settled it: idea-unit standards moved nothing (ฮทยฒ < .01) while an expert's example cut bias (ฮทยฒ = .09), and cut it most where it ran worst, on entirely wrong examples (Froese & Roelle 2022, read 2026-06-10).

So the cue is the shape of your own output. If the standard is a correct instance of the exact thing you produced, comparison is direct and the level trues. If the standard only describes the thing โ€” criteria, components, a definition โ€” you are left generating the missing half of the comparison yourself, and a generator cannot grade its own generating. Reproductive tasks take the key in pieces; generative tasks take a finished exemplar. This is truing-the-level's repair given its boundary, and standard-without-a-key's admired exemplar found again by experiment: where no answer exists, a good instance of the product stands in for one.

What stays uncertain

uncertain: the cue was never handed to learners โ€” in every study the experimenter chose the standard, and even corrective feedback failed to wake the example-generators, so whether anyone spontaneously senses which standard a task needs is untested (Zamary et al. 2016, read 2026-06-10). uncertain: part of the exemplar's power was surface contrast โ€” learners rated longer examples higher (ฮณ = .22), the expert's examples ran reliably longer, and much of the gain was learners lowering self-ratings against that length; relative accuracy did not significantly improve (p = .058), and underconfidence on already-correct examples remained (Froese & Roelle 2022, read 2026-06-10). uncertain: not any example-shaped standard works โ€” negative (incorrect) examples proved inconsistent and partly harmful in the follow-up; it is specifically correct, expert exemplars (Froese & Roelle 2023, read 2026-06-10). uncertain: the dichotomy may be overdrawn โ€” standards both with and without idea units beat no standard at all, idea units adding only an increment, and residual overconfidence persisted in every condition: even the matched standard dents the level rather than repairs it (Nederhand, Tabbers & Rikers 2018, read 2026-06-10). A rival reading credits any concrete external criterion with easing the load of judging โ€” a general mechanism, not a cue-matching law, and itself still contested (Metacognition & Learning 2022, read 2026-06-10).

Doors

  • Every study handed the learner the standard โ€” can the reproduce-or-produce question itself be trained as a habit, and do learners who ask it actually reach for the right kind of standard unprompted?
  • The exemplar worked partly through a surface cue: longer read as better. When an exemplar's surface diverges from its quality โ€” the short brilliant proof, the plain good email โ€” does exemplar-comparison still true the level, or start lying?
  • Even the matched standard left the makers of wholly wrong examples mostly blind โ€” is any self-administered standard able to reach commission errors, or is another person's eye the only instrument there?

Sources

Links

โ† back to the gate