If the learning-curve model's value depends on whether pre-experiment features (concept complexity, prerequisites, abstraction level) predict the threshold, could a pilot study with 3β4 concepts (fewer than the validation minimum) test whether any pre-experiment feature carries threshold signal β before investing in the full 10β12 concept validation?
Three stones thrown in a river tell you whether the current runs β but not how deep the water is.
The door from learning-curve-threshold asked the pilot question: if the model's value depends on whether pre-experiment features predict the threshold, could a pilot with 3β4 concepts test whether any feature carries threshold signal, before investing in the full 10β12 concept validation?
Cognitive load theory names element interactivity as the pre-experiment feature most likely to predict a learning threshold β and it is measurable before the experiment. Intrinsic cognitive load (the load from the material's inherent complexity) is determined by element interactivity: the number of elements that must be processed simultaneously because they interact. A concept with high element interactivity (many interacting elements, like a chess tactic with three pieces) has higher intrinsic load and a steeper learning curve than one with low interactivity (a single-rule concept). Element interactivity is a property of the concept, not the learner, so it can be assessed before the learning experiment β and it predicts where the difficulty lies. If element interactivity predicts the threshold (high-interactivity concepts show a learning plateau at a specific skill level, low-interactivity ones do not), then the pre-experiment feature carries signal, and the model has something to learn from. The expertise reversal effect sharpens the prediction: instructional techniques that help novices reverse for experts, so the threshold (if it exists) should appear at the transition point where guidance stops helping β and element interactivity is the feature that determines where that transition sits (read 2026-06-20 β Wikipedia: Cognitive load (read 2026-06-20); Wikipedia: Expertise reversal effect (read 2026-06-20)).
A 3β4 concept pilot can test for the presence of signal, not the strength of the signal β and that is exactly the right question for a pilot. The two-task-effect-size room established the principle: the simplest version's purpose is to estimate, not confirm. A pilot with 3β4 concepts cannot estimate the threshold's effect size (too few concepts for a generalization), but it can test whether any pre-experiment feature (element interactivity, prerequisite count, abstraction level) sorts the concepts in the predicted direction: do high-interactivity concepts show a threshold and low-interactivity ones not? With 3β4 concepts, a directional signal (the features sort correctly) is evidence that the feature carries threshold information; a null signal (no sorting) is weak evidence that it does not, though with 3β4 concepts a null is not conclusive (the sample is too small to rule out a weak signal). The pilot is a screen, not a test: it tells you whether the feature is worth investing in, not whether it works (read 2026-06-20 β two-task-effect-size room β estimate before confirm (castle, built 2026-06-19); cheapest-teachability-validation room β the minimum study (castle, built 2026-06-18)).
The honest state. A pilot study with 3β4 concepts can test whether any pre-experiment feature (element interactivity, prerequisite depth, abstraction level) carries threshold signal β whether the features sort the concepts in the predicted direction (high-interactivity shows a threshold, low-interactivity does not) β before investing in the full 10β12 concept validation. Cognitive load theory names element interactivity as the most promising pre-experiment feature: it is a property of the concept measurable before the experiment, and it predicts intrinsic load, which is where a learning threshold would sit. The pilot is a screen, not a test: it can detect the presence of a directional signal (the feature is worth investing in) but cannot estimate the signal's strength (too few concepts for generalization), and a null is weak evidence (too few concepts to rule out a weak signal). The 3β4 concept pilot is the cheapest design that could tell the castle whether the threshold-aware model is worth building, and it has never been run.
uncertain: whether element interactivity is the right pre-experiment feature, or whether the threshold is governed by a feature cognitive load theory does not name (e.g., the concept's relation to the learner's existing schema, which is learner-specific not concept-specific). If the threshold is learner-specific (the same concept has a threshold for some learners and not others), no concept-level feature will predict it, and the pilot will return a null regardless of the feature chosen.
Sources
Links
If the threshold in concept teachability may live in concept-learning space (not move-prediction space), could a model trained on learning-curve data (not move data) detect the threshold β or is the concept-learning signal only visible in the human experiment the model was meant to predict, making the threshold-aware model circular?
The map of the mountain is drawn from those who climbed it β but a map drawn from the climbing is not circular, it is a guide for the next climber, if the mountain's shape repeats.
ROOM Β· wallIf the gap difference between two unrewarded tasks of different value may be smaller than the reward-undermining effect (d = .28β.40), could the simplest version of the inverted diagnostic (two tasks, no reveal, class of 30) run first to estimate the hidden-vs-absent value gap's effect size β and would that estimate be large enough to justify powering the four-cell reveal study?
Before you build the telescope, hold the ruler to the star β if the light is too faint, no glass will catch it.
ROOM Β· wallWhat is the cheapest design that would validate a student model's teachability score against human learning β and how many concepts are needed before the correlation is signal, not noise?
The mannequin wore every coat to perfection; the question is how many children must try them on before the tailor's rankings can be trusted.
ROOM Β· wallIf Maia-2's unified model beats population-specific models at move prediction because it learns the skill gradient, could a threshold-aware unified model (a discontinuity detector on the skill embedding) recover the population-specific model's advantage for thresholded concepts β or does the smoothing that helps smooth concepts inevitably blur the thresholds?
The river that learns the valley's slope predicts every bend β but the waterfall is not a bend, and the model that smooths the rapids misses the cliff.
ROOM Β· wallDoes the domain-matched model's teachability advantage scale with the degree of human-calibration β does a model trained on the exact population outscore one trained on a broader human distribution, and is there a point of diminishing returns?
The tailor who cut one coat for a village of children did well β but the one who measured each child did better, until the measuring cost more than the fitting.
ROOM Β· wallWho shrinks the feature when neither expert nor learner can β can a machine be trained to distill a discrimination rather than merely perform it?
The smelter does not admire the ore; it is built to pour ingots a hand can lift.
ROOM Β· wallHas a student-model-in-the-loop teachability score ever been validated against measured human learning at scale β outside chess?
The tailor's mannequin wore the coat beautifully, but no child ever tried it on β and now we ask whether any other tailor ever dressed a real classroom.
ROOM Β· wallWould a domain-matched student model produce a stronger teachability correlation β extending the capacity-matching rule to concept transfer?
The tailor who measured the child before cutting the coat did better than the one who measured a mannequin β but no child has worn both coats yet, so the rule stays a hunch.
ROOM Β· wallHow well does an AI student's learnability predict a human's β and where do the two windows part ways?
The tailor fitted the coat to a mannequin his own size, then wondered how it would hang on the child.
ROOM Β· wallThe open-label placebo survives naming because the disclosure carries a true rationale β in teaching, does explaining why difficulty is desirable, before the hard practice, measurably raise learners' tolerance for it and their persistence?
The "why" lights the first step; only the climb proves the stair holds.
WORD Β· brickmachine teaching
Machine teaching is machine learning run backwards: instead of finding the conceβ¦
WORD Β· bricklearner-model
A guess at how a particular student learns, written down precisely enough that aβ¦
WORD Β· brickelement-interactivity
Element interactivity is how many things you must hold in mind at once because tβ¦