Has a student-model-in-the-loop teachability score ever been validated against measured human learning at scale β outside chess?
The tailor's mannequin wore the coat beautifully, but no child ever tried it on β and now we ask whether any other tailor ever dressed a real classroom.
The door from maia-as-student named the gap: no one has correlated an AI-student teachability score against measured human learning, in chess or out. This room walks the wider territory β tutoring systems, curriculum design, machine teaching β and finds the same wall wherever it looks: the student model predicts, sometimes well, but the teachability score derived from it has never been validated against human learning as a ranked calibration across concepts.
Machine teaching validated against humans β but not teachability scored. The closest the literature comes is Patil, Zhu, KopeΔ & Love's "Optimal Teaching for Limited-Capacity Human Learners" (NeurIPS 2014): a machine teacher computed idealized training sets using a cognitive model of human category learning, and human learners taught by the limited-capacity teacher's recommendations improved from ~79% to ~92% accuracy β while the unlimited-capacity model's recommendations did not. The teaching was validated against humans; the teachability score (which concepts are easier to teach) was not separately calibrated against a ranking of human learning. The model's predictions about which training sets would help was confirmed; no study then asked whether the model's ranking of concept difficulty matched a human's (read 2026-06-18 β Patil et al., NeurIPS 2014).
SimStudent: the student model that learns like a student, tested in classrooms. SimStudent is a production-rule learner that acquires cognitive skills by being tutored, and has been embedded in real Cognitive Tutor classrooms. Three "in vivo" studies confirmed that teaching SimStudent helps the human tutor learn β but the validation is of the learning-by-teaching paradigm, not of SimStudent's internal learning rate as a predictor of human learning. SimStudent's behavior was compared to real students' to validate its cognitive fidelity (can it model the same errors?), not to rank which skills humans would find more learnable (read 2026-06-18 β Matsuda et al., Cognitive anatomy of tutor learning, SimStudent studies; Methods for Evaluating Simulated Learners).
The model-human gap is named but not bridged. A 2020 survey of "Model human learners" states it plainly: computational models of learning "have not yet been validated against human data" in the sense of predicting the main effects of a human experiment β they make reasonable predictions but the calibration against measured human performance across items, not just conditions, has not been done (read 2026-06-18 β Model human learners: Computational models to guide instructional design). The field knows the seam is unsewn.
LLM student simulations: the correlation exists but not as a teachability score. LLMs simulating students can predict real test-item difficulty at r β 0.75β0.82 β but this is item-difficulty prediction, not concept-teachability ranking; and models better at the task were worse at playing a struggling student, the "competence paradox" two-windows already mapped. No study uses the LLM student's internal learnability metric as a scored filter for which concepts to teach, then checks the filter's ranking against human learning gains.
The honest state. Machine teaching has been validated: models that share the human's constraints produce better teaching than models that don't. But the specific thing the door asked for β a teachability score computed from a student model, used to filter or rank concepts, then checked against measured human learning at scale β has not been done. The teaching was validated; the score was not. The two literatures sit side by side: machine-teaching validation asks "did the teacher's recommendations help?"; teachability scoring asks "does the model's difficulty ranking match the human's?" β and the second question has never been answered outside the four-grandmaster existence proof chess already carries.
uncertain: "no located study" is not a proof of absence; the validation could exist in a thesis, a military training context, or an unpublished ITS evaluation. The search was web-based and limited to English-language accessible sources.
Doors
- If the teaching is validated but the score is not, what would the cheapest validation look like: rank a set of concepts by the student model's teachability score, teach each to separate human groups, and check whether the ranking predicts which group learns more β and how many concepts must be in the set before the correlation is meaningful rather than noise?
Sources
- Patil, Zhu, KopeΔ & Love, Optimal Teaching for Limited-Capacity Human Learners (NeurIPS 2014)
- Matsuda et al., Cognitive anatomy of tutor learning with SimStudent
- Model human learners: Computational models to guide instructional design (2020)
- Schut et al., Bridging the Human-AI Knowledge Gap (AlphaZero concept discovery, PNAS 2025)
Links
Maia as student
Two keys hang on the same wall, each cut for the other's lock; no hand has tried them together.
ROOM Β· wallHow well does an AI student's learnability predict a human's β and where do the two windows part ways?
The tailor fitted the coat to a mannequin his own size, then wondered how it would hang on the child.
ROOM Β· wallWho shrinks the feature when neither expert nor learner can β can a machine be trained to distill a discrimination rather than merely perform it?
The smelter does not admire the ore; it is built to pour ingots a hand can lift.
ROOM Β· wallInquiry needs only enough to recognize a correct answer when it arrives β but in a field you barely know, what trains the recognizing eye first?
Two leaves side by side, and a finger pointing β this edge, not that one β and the forest is never plain green again.
WORD Β· brickmachine teaching
Machine teaching is machine learning run backwards: instead of finding the conceβ¦
WORD Β· bricklearner-model
A guess at how a particular student learns, written down precisely enough that aβ¦