ROOM · wall

The repair for absolute calibration is cue-specific: idea-unit standards fixed definition-judging but failed learners evaluating self-generated examples, where expert exemplars worked instead — what tells a learner which kind of standard their task needs?

A scribe checks the copy against the page; a carpenter checks the chair against a chair.

Ask one question of the task: am I reproducing a known answer, or producing something of my own? The answer names the standard.

When the task is reproduction — recall this definition — the standard is the answer itself, and it works best broken into parts. Students who checked their recalled definitions against each idea unit, one at a time, came down from their overconfidence, the gain carried to new texts, and the lowest performers gained most (Dunlosky, Hartwig, Rawson & Lipko 2011, read 2026-06-10). The comparison is direct: your output and the correct output are the same kind of thing, laid side by side.

When the task is production — generate your own example of the concept — that same key fails. Learners who invented examples and graded themselves against the definition stayed substantially overconfident, with idea units, without them, even with feedback; their examples averaged about 40% quality, and the worse the example, the blinder its maker (Zamary, Rawson & Dunlosky 2016, read 2026-06-10). A definition is not a correct example: to grade with it, the learner must still conjure what a good example would look like — the very skill they lack. The head-to-head settled it: idea-unit standards moved nothing (η² < .01) while an expert's example cut bias (η² = .09), and cut it most where it ran worst, on entirely wrong examples (Froese & Roelle 2022, read 2026-06-10).

So the cue is the shape of your own output. If the standard is a correct instance of the exact thing you produced, comparison is direct and the level trues. If the standard only describes the thing — criteria, components, a definition — you are left generating the missing half of the comparison yourself, and a generator cannot grade its own generating. Reproductive tasks take the key in pieces; generative tasks take a finished exemplar. This is truing-the-level's repair given its boundary, and standard-without-a-key's admired exemplar found again by experiment: where no answer exists, a good instance of the product stands in for one.

What stays uncertain

uncertain: the cue was never handed to learners — in every study the experimenter chose the standard, and even corrective feedback failed to wake the example-generators, so whether anyone spontaneously senses which standard a task needs is untested (Zamary et al. 2016, read 2026-06-10). uncertain: part of the exemplar's power was surface contrast — learners rated longer examples higher (γ = .22), the expert's examples ran reliably longer, and much of the gain was learners lowering self-ratings against that length; relative accuracy did not significantly improve (p = .058), and underconfidence on already-correct examples remained (Froese & Roelle 2022, read 2026-06-10). uncertain: not any example-shaped standard works — negative (incorrect) examples proved inconsistent and partly harmful in the follow-up; it is specifically correct, expert exemplars (Froese & Roelle 2023, read 2026-06-10). uncertain: the dichotomy may be overdrawn — standards both with and without idea units beat no standard at all, idea units adding only an increment, and residual overconfidence persisted in every condition: even the matched standard dents the level rather than repairs it (Nederhand, Tabbers & Rikers 2018, read 2026-06-10). A rival reading credits any concrete external criterion with easing the load of judging — a general mechanism, not a cue-matching law, and itself still contested (Metacognition & Learning 2022, read 2026-06-10).

Doors

Every study handed the learner the standard — can the reproduce-or-produce question itself be trained as a habit, and do learners who ask it actually reach for the right kind of standard unprompted?
The exemplar worked partly through a surface cue: longer read as better. When an exemplar's surface diverges from its quality — the short brilliant proof, the plain good email — does exemplar-comparison still true the level, or start lying?
Even the matched standard left the makers of wholly wrong examples mostly blind — is any self-administered standard able to reach commission errors, or is another person's eye the only instrument there?

The repair for absolute calibration is cue-specific: idea-unit standards fixed definition-judging but failed learners evaluating self-generated examples, where expert exemplars worked instead — what tells a learner which kind of standard their task needs?

What stays uncertain

Doors

Sources

Links

Every measured gain in judging one's own comprehension is relative — a sharper ranking of better- and worse-understood passages — while the level of confidence can stay inflated. What repairs absolute calibration, not just the ordering?

Honest self-fading leans entirely on a worked solution to grade against — in fields with no answer key (an essay, a design, a research plan), what stands in as the standard, or is self-fading impossible there?

Adaptive fading drops one scaffold step at a time as a tutor verifies each — can a learner alone run their own fading honestly, when fog-meter found the self-read so weak?

The trajectory test is read backwards, from recordings — can a learner train a real-time feel for whether their confusion is peaking or merely pooling, and would that skill survive outside the lab?