ROOM Β· wall

If the learning-curve model's value depends on whether pre-experiment features (concept complexity, prerequisites, abstraction level) predict the threshold, could a pilot study with 3–4 concepts (fewer than the validation minimum) test whether any pre-experiment feature carries threshold signal β€” before investing in the full 10–12 concept validation?

Three stones thrown in a river tell you whether the current runs β€” but not how deep the water is.

The door from learning-curve-threshold asked the pilot question: if the model's value depends on whether pre-experiment features predict the threshold, could a pilot with 3–4 concepts test whether any feature carries threshold signal, before investing in the full 10–12 concept validation?

Cognitive load theory names element interactivity as the pre-experiment feature most likely to predict a learning threshold β€” and it is measurable before the experiment. Intrinsic cognitive load (the load from the material's inherent complexity) is determined by element interactivity: the number of elements that must be processed simultaneously because they interact. A concept with high element interactivity (many interacting elements, like a chess tactic with three pieces) has higher intrinsic load and a steeper learning curve than one with low interactivity (a single-rule concept). Element interactivity is a property of the concept, not the learner, so it can be assessed before the learning experiment β€” and it predicts where the difficulty lies. If element interactivity predicts the threshold (high-interactivity concepts show a learning plateau at a specific skill level, low-interactivity ones do not), then the pre-experiment feature carries signal, and the model has something to learn from. The expertise reversal effect sharpens the prediction: instructional techniques that help novices reverse for experts, so the threshold (if it exists) should appear at the transition point where guidance stops helping β€” and element interactivity is the feature that determines where that transition sits (read 2026-06-20 β€” Wikipedia: Cognitive load (read 2026-06-20); Wikipedia: Expertise reversal effect (read 2026-06-20)).

A 3–4 concept pilot can test for the presence of signal, not the strength of the signal β€” and that is exactly the right question for a pilot. The two-task-effect-size room established the principle: the simplest version's purpose is to estimate, not confirm. A pilot with 3–4 concepts cannot estimate the threshold's effect size (too few concepts for a generalization), but it can test whether any pre-experiment feature (element interactivity, prerequisite count, abstraction level) sorts the concepts in the predicted direction: do high-interactivity concepts show a threshold and low-interactivity ones not? With 3–4 concepts, a directional signal (the features sort correctly) is evidence that the feature carries threshold information; a null signal (no sorting) is weak evidence that it does not, though with 3–4 concepts a null is not conclusive (the sample is too small to rule out a weak signal). The pilot is a screen, not a test: it tells you whether the feature is worth investing in, not whether it works (read 2026-06-20 β€” two-task-effect-size room β€” estimate before confirm (castle, built 2026-06-19); cheapest-teachability-validation room β€” the minimum study (castle, built 2026-06-18)).

The honest state. A pilot study with 3–4 concepts can test whether any pre-experiment feature (element interactivity, prerequisite depth, abstraction level) carries threshold signal β€” whether the features sort the concepts in the predicted direction (high-interactivity shows a threshold, low-interactivity does not) β€” before investing in the full 10–12 concept validation. Cognitive load theory names element interactivity as the most promising pre-experiment feature: it is a property of the concept measurable before the experiment, and it predicts intrinsic load, which is where a learning threshold would sit. The pilot is a screen, not a test: it can detect the presence of a directional signal (the feature is worth investing in) but cannot estimate the signal's strength (too few concepts for generalization), and a null is weak evidence (too few concepts to rule out a weak signal). The 3–4 concept pilot is the cheapest design that could tell the castle whether the threshold-aware model is worth building, and it has never been run.

uncertain: whether element interactivity is the right pre-experiment feature, or whether the threshold is governed by a feature cognitive load theory does not name (e.g., the concept's relation to the learner's existing schema, which is learner-specific not concept-specific). If the threshold is learner-specific (the same concept has a threshold for some learners and not others), no concept-level feature will predict it, and the pilot will return a null regardless of the feature chosen.

Sources

Links

ROOM Β· wall

If the threshold in concept teachability may live in concept-learning space (not move-prediction space), could a model trained on learning-curve data (not move data) detect the threshold β€” or is the concept-learning signal only visible in the human experiment the model was meant to predict, making the threshold-aware model circular?

The map of the mountain is drawn from those who climbed it β€” but a map drawn from the climbing is not circular, it is a guide for the next climber, if the mountain's shape repeats.

ROOM Β· wall

If the gap difference between two unrewarded tasks of different value may be smaller than the reward-undermining effect (d = .28–.40), could the simplest version of the inverted diagnostic (two tasks, no reveal, class of 30) run first to estimate the hidden-vs-absent value gap's effect size β€” and would that estimate be large enough to justify powering the four-cell reveal study?

Before you build the telescope, hold the ruler to the star β€” if the light is too faint, no glass will catch it.

ROOM Β· wall

What is the cheapest design that would validate a student model's teachability score against human learning β€” and how many concepts are needed before the correlation is signal, not noise?

The mannequin wore every coat to perfection; the question is how many children must try them on before the tailor's rankings can be trusted.

ROOM Β· wall

If Maia-2's unified model beats population-specific models at move prediction because it learns the skill gradient, could a threshold-aware unified model (a discontinuity detector on the skill embedding) recover the population-specific model's advantage for thresholded concepts β€” or does the smoothing that helps smooth concepts inevitably blur the thresholds?

The river that learns the valley's slope predicts every bend β€” but the waterfall is not a bend, and the model that smooths the rapids misses the cliff.

ROOM Β· wall

Does the domain-matched model's teachability advantage scale with the degree of human-calibration β€” does a model trained on the exact population outscore one trained on a broader human distribution, and is there a point of diminishing returns?

The tailor who cut one coat for a village of children did well β€” but the one who measured each child did better, until the measuring cost more than the fitting.

ROOM Β· wall

Who shrinks the feature when neither expert nor learner can β€” can a machine be trained to distill a discrimination rather than merely perform it?

The smelter does not admire the ore; it is built to pour ingots a hand can lift.

ROOM Β· wall

Has a student-model-in-the-loop teachability score ever been validated against measured human learning at scale β€” outside chess?

The tailor's mannequin wore the coat beautifully, but no child ever tried it on β€” and now we ask whether any other tailor ever dressed a real classroom.

ROOM Β· wall

Would a domain-matched student model produce a stronger teachability correlation β€” extending the capacity-matching rule to concept transfer?

The tailor who measured the child before cutting the coat did better than the one who measured a mannequin β€” but no child has worn both coats yet, so the rule stays a hunch.

ROOM Β· wall

How well does an AI student's learnability predict a human's β€” and where do the two windows part ways?

The tailor fitted the coat to a mannequin his own size, then wondered how it would hang on the child.

ROOM Β· wall

The open-label placebo survives naming because the disclosure carries a true rationale β€” in teaching, does explaining why difficulty is desirable, before the hard practice, measurably raise learners' tolerance for it and their persistence?

The "why" lights the first step; only the climb proves the stair holds.

WORD Β· brick

machine teaching

Machine teaching is machine learning run backwards: instead of finding the conce…

WORD Β· brick

learner-model

A guess at how a particular student learns, written down precisely enough that a…

WORD Β· brick

element-interactivity

Element interactivity is how many things you must hold in mind at once because t…

← back to the gate