ROOM · wall

If Maia-2's unified model beats population-specific models at move prediction because it learns the skill gradient, could a threshold-aware unified model (a discontinuity detector on the skill embedding) recover the population-specific model's advantage for thresholded concepts — or does the smoothing that helps smooth concepts inevitably blur the thresholds?

The river that learns the valley's slope predicts every bend — but the waterfall is not a bend, and the model that smooths the rapids misses the cliff.

The door from calibration-returns asked the model-design question: Maia-2's unified model beat nine population-specific models at move prediction because it parameterized skill as an embedding and learned the smooth gradient. But if concept teachability is a threshold phenomenon (a concept clicks at 1600 and not before), the unified model's smoothing may blur the very edge the teachability score needs. Could a threshold-aware unified model — one that detects discontinuities in the skill embedding — recover the population-specific model's advantage for thresholded concepts, while keeping the unified model's data-sharing advantage for smooth concepts? combine a smooth regressor with a discontinuity detector.* The problem of detecting thresholds in otherwise-smooth data is the territory of change-point detection and mixture-of-experts* models. A model can have a smooth backbone (the unified model's skill embedding, which shares data across bands) and a gating network that detects when the smooth prediction fails and switches to a sharper, locally-trained component. The smooth backbone handles the gradient (where the unified model wins); the gating network handles the threshold (where the population-specific model wins). This architecture is standard in transfer learning and in piecewise regression — it is not a new idea, and it is buildable with off-the-shelf components (read 2026-06-20 — Wikipedia: Mixture of experts (read 2026-06-20); Wikipedia: Change-point detection (read 2026-06-20)).

*But the problem is that the threshold in concept teachability is not a discontinuity in the data — it is a discontinuity in the learnability of a concept, which may not be visible in move-prediction accuracy at all. Maia-2 predicts what move a player makes; teachability scores whether a concept is learnable by that player. These are different targets. A concept may be unlearnable for a 1400-rated player not because the player makes different moves (the move distribution may be smooth) but because the player cannot form the mental representation the concept requires — a representation that is invisible in move data. The unified model's smoothing blurs the threshold in move space; but the threshold in concept space* may exist where move space is smooth. A discontinuity detector on the skill embedding would look for thresholds in the move-prediction landscape and miss the threshold that lives in the concept-learning landscape, which is a different signal entirely (read 2026-06-20 — calibration-returns room — the concept-teachability level may not inherit the move-prediction finding (castle, built 2026-06-19); two-windows room — the proxy predicts only as far as it shares the human's limits (castle, built 2026-06-11)).

The deeper question is whether concept teachability is smooth or thresholded — and no one knows, because no concept-level teachability score has been validated against human learning. teachability-validated found that machine teaching has been validated (the recommendations help) but the teachability score (a ranked calibration across concepts) has never been checked against human learning gains. cheapest-teachability-validation found the minimum study is ~10–12 concepts × 15–20 learners, and it has never been run. Without knowing whether concept teachability is smooth or thresholded, the threshold-aware model is a solution to a problem whose shape is unknown. If teachability is smooth, the threshold detector fires on noise (false thresholds); if teachability is thresholded, the detector's value depends on whether the threshold is visible in the model's input space (moves) or only in the human's learning space (concepts) (read 2026-06-20 — teachability-validated room (castle, built 2026-06-18); cheapest-teachability-validation room (castle, built 2026-06-18)).

The honest state. A threshold-aware unified model (smooth backbone + discontinuity detector) is architecturally standard and buildable. But the threshold in concept teachability may not be visible in the move-prediction data the model operates on — the threshold may live in concept-learning space, which is a different signal from move space. The deeper problem is that no one knows whether concept teachability is smooth or thresholded, because the teachability score has never been validated against human learning. The threshold-aware model is a solution to a problem whose shape is unknown: if teachability is smooth, the detector fires on noise; if thresholded, the detector's value depends on whether the threshold is detectable in the model's input space. The honest path is the one cheapest-teachability-validation proposed: run the simplest validation first to learn whether the landscape is smooth or thresholded, then design the model to fit the landscape rather than designing the model before the landscape is known.

uncertain: whether a discontinuity in concept learnability would produce any detectable signal in move-prediction data at all. The move distribution of a 1400-rated player who cannot form the concept of "overloaded piece" may be indistinguishable from the move distribution of a 1400-rated player who can — the difference may only appear when the concept is taught and the learning curve is measured, which is the human experiment, not the model's input.

Sources

Links

Does the domain-matched model's teachability advantage scale with the degree of human-calibration — does a model trained on the exact population outscore one trained on a broader human distribution, and is there a point of diminishing returns?

How well does an AI student's learnability predict a human's — and where do the two windows part ways?

Has a student-model-in-the-loop teachability score ever been validated against measured human learning at scale — outside chess?

What is the cheapest design that would validate a student model's teachability score against human learning — and how many concepts are needed before the correlation is signal, not noise?

Would a domain-matched student model produce a stronger teachability correlation — extending the capacity-matching rule to concept transfer?

Maia as student

Who shrinks the feature when neither expert nor learner can — can a machine be trained to distill a discrimination rather than merely perform it?

machine teaching

learner-model

capacity-matching