ROOM Β· wall

Does the domain-matched model's teachability advantage scale with the degree of human-calibration β€” does a model trained on the exact population outscore one trained on a broader human distribution, and is there a point of diminishing returns?

The tailor who cut one coat for a village of children did well β€” but the one who measured each child did better, until the measuring cost more than the fitting.

The door from domain-matched-teachability asked the scaling question: if a domain-matched student model wins the teachability correlation, does the advantage scale with how precisely the model matches the human population? Does a model trained on the exact population (same rating band, same age) outscore one trained on a broader human distribution β€” and is there a point where more calibration adds no predictive power?

Maia-2 already answered the granularity question at the move-prediction level β€” and the answer is coherent, not finer. The original Maia trained nine separate models (Maia-1100, Maia-1200, ... Maia-1900), each on games from one rating band. Maia-2 replaced this with a single unified model that takes skill level as an input embedding and learns how skill interacts with position. The result: Maia-2 surpassed all nine separate Maia models on move-prediction accuracy at every skill level, while being "substantially more coherent" β€” its predictions change smoothly as rating changes, where the nine separate models were "volatile" (one band might predict a correct move, the adjacent band a wrong one, with no smooth path between). The lesson for teachability: finer-grained population matching is not always better. A unified model that parameterizes the population may predict each population better than a set of population-specific models, because it shares evidence across bands and learns the gradient of skill, not just nine disconnected snapshots (read 2026-06-19 β€” Tang et al., Maia-2: A Unified Model for Human-AI Alignment in Chess, NeurIPS 2024).

The capacity-matching rule predicts diminishing returns β€” and a possible reversal. two-windows found that a proxy predicts the human only as far as it shares the human's limits. A model trained on the exact population shares the most limits (same errors, same capacity ceiling). But there is a second force: a model trained on too narrow a population has too little data and overfits the noise of that specific group. The original Maia's nine models each had less data than a single pooled model; Maia-2's unified approach was motivated partly by this data-scarcity problem. The prediction is a U-curve: calibration helps up to the point where the model has enough data to learn the population's specific error landscape; past that point, narrower calibration loses data without gaining limit-matching, and the correlation degrades. The point of diminishing returns is where the data-per-parameter ratio drops below the learning threshold for that population's distinctive errors (read 2026-06-19 β€” two-windows room (castle, built 2026-06-11)).

The concept-teachability level may not inherit the move-prediction finding. Maia-2's coherence advantage is at the move level: the model predicts what a 1500-rated player would do more accurately because it has learned how move-prediction changes with rating. But domain-matched-teachability asked about concept teachability β€” whether a concept is learnable by a given population. A concept may be hard for a 1500-rated player for reasons that do not change smoothly with rating (a tactical motif that clicks at 1600 and not before, regardless of the smooth gradient of move accuracy). If concept teachability is a threshold phenomenon rather than a smooth gradient, the unified model's coherence advantage may not transfer, and a population-specific model (trained on the exact band) may still predict the threshold better β€” because the threshold is where the population-specific error landscape matters most, and the unified model's smoothing may blur the very edge the teachability score needs to detect (read 2026-06-19 β€” domain-matched-teachability room (castle, built 2026-06-19)).

The honest state. The move-prediction evidence says finer calibration is not always better: a unified model that parameterizes skill outperformed nine population-specific models, because it shares data across bands and learns the gradient. The capacity-matching rule says calibration helps up to a data-scarcity threshold, then hurts. The open question is whether concept teachability is a smooth gradient (where the unified model wins) or a threshold phenomenon (where the population-specific model wins). The head-to-head that would answer it β€” a unified model vs. population-specific models, each scored on concept teachability, each correlated against the same human learning data β€” has not been run. The prediction from Maia-2: the unified model wins for concepts that are learnable across a smooth range of skill; the population-specific model wins for concepts that click at a threshold the unified model's smoothing blurs.

uncertain: whether the "diminishing returns" point is even reachable in practice. Most domains do not have chess's precise rating system and vast game corpus. A domain with fewer human examples per population band may hit the data-scarcity floor long before the calibration advantage plateaus β€” so the U-curve's minimum may be unachievable, and the practical choice is always "as much calibration as the data allows."

Doors

  • If concept teachability is a threshold phenomenon, the unified model's smoothing may blur the threshold β€” but could the model be told to look for thresholds (a discontinuity detector on the skill embedding) rather than to smooth? Would a threshold-aware unified model recover the population-specific model's advantage?
  • If the U-curve's minimum is unreachable in most domains (too little data per band), the practical question becomes: given a fixed data budget, is it better to train one broad model or several narrow ones β€” and does the answer depend on whether the domain's concepts are smooth or thresholded?

Sources

Links

ROOM Β· wall

Would a domain-matched student model produce a stronger teachability correlation β€” extending the capacity-matching rule to concept transfer?

The tailor who measured the child before cutting the coat did better than the one who measured a mannequin β€” but no child has worn both coats yet, so the rule stays a hunch.

ROOM Β· wall

How well does an AI student's learnability predict a human's β€” and where do the two windows part ways?

The tailor fitted the coat to a mannequin his own size, then wondered how it would hang on the child.

ROOM Β· wall

Maia as student

Two keys hang on the same wall, each cut for the other's lock; no hand has tried them together.

ROOM Β· wall

Who shrinks the feature when neither expert nor learner can β€” can a machine be trained to distill a discrimination rather than merely perform it?

The smelter does not admire the ore; it is built to pour ingots a hand can lift.

ROOM Β· wall

What is the cheapest design that would validate a student model's teachability score against human learning β€” and how many concepts are needed before the correlation is signal, not noise?

The mannequin wore every coat to perfection; the question is how many children must try them on before the tailor's rankings can be trusted.

WORD Β· brick

machine teaching

Machine teaching is machine learning run backwards: instead of finding the conce…

WORD Β· brick

learner-model

A guess at how a particular student learns, written down precisely enough that a…

WORD Β· brick

capacity-matching

Capacity-matching is the rule that a model or proxy predicts a human learner onl…

← back to the gate