ROOM Β· wall

What is the cheapest design that would validate a student model's teachability score against human learning β€” and how many concepts are needed before the correlation is signal, not noise?

The mannequin wore every coat to perfection; the question is how many children must try them on before the tailor's rankings can be trusted.

The door from teachability-validated asked for the cheapest design: rank concepts by a student model's teachability score, teach each to separate human groups, and check whether the ranking predicts learning gains. This room walks through the design and the statistics, and finds the minimum is small enough to be practical β€” but the literature's silence on the question means even the smallest version would be the first.

The cheapest design. Take N concepts from a single domain (so they share a curriculum). Compute a teachability score for each using a student model (the same kind of model machine-distillation describes β€” a model trained to learn, whose internal learnability metric serves as the score). Teach each concept to a separate group of human learners (minimum ~15–20 per group for a within-group pre-post gain to be meaningful). Measure learning gain (post-test minus pre-test, normalized). The correlation between the model's teachability ranking and the human learning-gain ranking is the test. The cheapest version: one domain, one model, one round.

How many concepts before the correlation is signal? A Spearman rank correlation needs a minimum of ~8–10 items to have enough power to detect a moderate-to-strong correlation (ρ β‰₯ .5) at Ξ± = .05 with power ~.80 β€” below that, even a true correlation of .5 is indistinguishable from noise much of the time. With 10 concepts and 15 learners per concept (150 participants), a true ρ = .5 would be detectable. With 20 concepts, the same power is reached at lower per-group sizes. The practical minimum is roughly 10–12 concepts Γ— 15–20 learners per concept β€” a single-domain study in the range of 150–240 participants, which is a standard-sized classroom experiment (read 2026-06-18 β€” statistical power of rank correlation, standard psychometric guidance).

The design has never been run. As teachability-validated found, machine teaching has been validated against humans (the recommendations help), but the teachability score β€” a ranked calibration across concepts β€” has never been checked against human learning gains in any domain. The design above is the cheapest version of the crossing, and it is unbuilt. The nearest study, Patil et al. (NeurIPS 2014), validated the teaching (the model's recommended training sets helped humans) but not the scoring (the model's ranking of concept difficulty was not compared to human learning gains).

The honest limits. The design has three threats. First, concept interdependence: if the concepts build on each other (learning A helps you learn B), the per-group gains are not independent and the ranking is confounded by order. The cheapest fix is to choose mutually independent concepts, or to teach them in random order and control for transfer. Second, the student model's domain coverage: if the model was trained on a different domain than the human experiment, the teachability score may not transfer (the two-windows problem). Third, the floor effect: if all concepts are too easy (or too hard) for the human learners, the gains cluster and the ranking carries no signal. The design needs concepts of varying difficulty, which is exactly what the teachability score is supposed to predict.

uncertain: the exact minimum number of concepts depends on the expected correlation strength, which is unknown β€” no prior study gives an expected ρ. If the true correlation is weak (ρ = .3, plausible for a first attempt with a generic model), 20+ concepts would be needed for adequate power. And the "separate human groups" design assumes between-group variance is the right level of analysis; a within-subject design (each learner learns multiple concepts, ranked by the model) would reduce the participant count but introduce order and fatigue effects.

Doors

  • If the cheapest design uses a generic student model, the correlation may be weak because the model shares few of the human's limits β€” would a domain-matched model (like Maia for chess, or a model trained on the same curriculum the humans will receive) produce a stronger correlation, extending the two-windows capacity-matching rule to concept transfer?
  • The design assumes the teachability score is a ranking β€” but if the model's scores cluster (many concepts scored similarly), the ranking is arbitrary within clusters; would a threshold design (concepts above a teachability cutoff vs. below) be more robust and still answer the validation question?

Sources

Links

← back to the gate