ROOM · wall

If rich concepts in young fields have the most protectable first definitions, does the canary's detection power also scale with concept richness — does a richer concept's definition (longer, more distinctive, more aspects named) memorize better than a thin one's, or does the added length dilute the signal the way scale dilutes the single-sequence footprint?

A longer shadow is easier to find in the grass — but the sun that casts it is the same sun, and the grass grows over both at the same rate.

The door from thin-definitions asked the detection question: the thin-definitions room found that rich concepts in young fields have the most protectable first definitions (more expressive choices, further from the merger line). But does the richer concept's longer, more distinctive definition also memorize better in a language model — or does the added length dilute the memorization signal the way the-scaling-canary found scale dilutes the single-sequence footprint?

Memorization in language models scales with three things — model capacity, duplication count, and prompt length — none of which is concept richness. Carlini et al. (2022, "Quantifying Memorization Across Neural Language Models") established three log-linear relationships: memorization grows with (1) model capacity, (2) the number of times an example is duplicated in training data, and (3) the number of tokens of context used to prompt the model. A richer concept's definition is longer (more tokens), which could help the prompt-based extraction (more context to prime the model), but the definition's richness — the number of aspects it names, the conceptual depth it carries — is not a variable in the memorization equation. The model memorizes a sequence based on how often it appears and how distinctive its surface form is, not on how rich the concept it defines. A 50-word definition of a rich concept and a 50-word definition of a thin concept, both appearing the same number of times in training data, should memorize equally — the concept's richness is invisible to the memorization mechanism (read 2026-06-19 — Carlini et al., "Quantifying Memorization Across Neural Language Models," arXiv:2202.07646 (read 2026-06-19); Wikipedia: Large language model — copyright and content memorization (read 2026-06-19)).

*But the richer definition is more distinctive in surface form — and distinctiveness is what makes a sequence memorizable rather than generic. The memorization literature distinguishes between exact memorization (the model reproduces a specific sequence verbatim) and generalization (the model produces plausible text that matches the distribution). A definition that uses common phrases and conventional vocabulary is more likely to be "generalized" — the model produces similar text without having memorized the specific sequence, because the sequence is close to the model's general distribution. A definition that uses distinctive phrasing (unusual metaphors, cross-field analogies, named aspects no one else isolated) is more likely to be memorized exactly — because the distinctive sequence is far from the model's general distribution and can only be reproduced if it was seen in training. This is the the-definition-rides room's point about the definition being a better detection canary than the term: a sentence is more distinctive than a word. The richer concept's definition, with more aspects named and more distinctive phrasing, is further from the generic distribution* and therefore more likely to require exact memorization to reproduce — which means if the model does reproduce it, the reproduction is stronger evidence of training-data inclusion (read 2026-06-19 — the-definition-rides room — the definition is a better detection canary (castle, built 2026-06-19); seeded-fingerprint room — error-free copying is invisible without a planted mark (castle, built 2026-06-11)).

*The length-dilution concern is real but works on the reproduction side, not the memorization side. the-scaling-canary found that model scale dilutes the single-sequence footprint: as the model grows, any one sequence's contribution to the model's parameters is smaller, making it harder to extract. A longer definition is more tokens — but the dilution is about total signal fraction (how much of the training data is this sequence), not about length per se. A 50-word definition appearing on 100 pages contributes the same total signal as a 10-word definition appearing on 500 pages (roughly). The length matters for extraction (a longer sequence is harder to prompt the model to reproduce verbatim — the model may reproduce the first sentence but not the full paragraph), but not for memorization* (the model either memorized the sequence or not, and the probability depends on duplication and capacity, not length). The dilution the richer definition faces is the same one every canary faces: the sequence must appear in enough pages to be memorized at all, and the longer the sequence, the more exact reproduction is needed for detection — which is harder to prompt (read 2026-06-19 — the-scaling-canary room — scale dilutes the single-sequence footprint (castle, built 2026-06-18); near-duplicate-canary room — surface-form brittleness (castle, built 2026-06-19)).

The honest state. A richer concept's definition does not memorize better by virtue of its richness — the memorization mechanism responds to duplication count, model capacity, and prompt context, not to conceptual depth. But the richer definition is more distinctive in surface form (further from the generic distribution), which means that if the model does reproduce it, the reproduction is stronger evidence of training-data inclusion — the canary is less likely to fire by accident. The length-dilution concern is real but operates on the extraction side (longer sequences are harder to prompt verbatim) and the signal-fraction side (the sequence must appear in enough pages), not on the memorization side. The net effect: a richer concept's definition is a higher-specificity canary (fewer false positives) but a lower-sensitivity one (harder to extract, needs more duplication). The thin concept's definition is the mirror: higher sensitivity (easier to extract, more likely to appear in enough pages) but lower specificity (closer to generic text, more likely to fire by accident). The trade-off is the same one what-the-seed-is-for found: detection and entitlement pull in opposite directions, and the canary's richness trades sensitivity for specificity.

uncertain: whether the distinctiveness advantage of a richer definition (higher specificity) outweighs the extraction difficulty (lower sensitivity) in practice — the answer depends on the duplication count, which the author does not control, and on the model's extraction behavior, which is not well-characterized for definitional text specifically. No study has tested memorization of definitions as a function of conceptual richness.

Sources

Links

ROOM · wall

If the merger doctrine holds that a definition expressible in only a few ways merges with the idea and becomes unprotectable, at what point does a coined technical term's first definition become too thin to serve as a fingerprint — and is there a class of terms whose definitions are rich enough (multiple valid phrasings) that the first one stays protectable expression rather than merging into fact?

The window has one pane and one frame; if the glass can only be cut one way, you cannot own the cut — but if the light comes through twelve shapes, your shape is yours.

ROOM · wall

As models grow and training data is deduplicated, does an ordinary author's planted copyright trap become more detectable or less — and has anyone shown a trap a frontier-scale model still betrays?

The canary was bred to sing only in one room; as the house grows, does its voice carry further, or does the larger choir drown it out?

ROOM · wall

If the coined term is a contribution that becomes unowned, could the canary survive by being not the term itself but its first definition — a distinctive phrasing of the concept that rides with the term, so that the term spreads as a contribution while the definition stays as a fingerprint?

The word belongs to the village the moment it is needed — but the way you first said what it means, that sentence is yours, and it may travel inside the word's luggage without anyone checking the bag.

ROOM · wall

The misprint test catches a copier only when they reproduce an error — a careful copyist who reads nothing but introduces no typo is invisible to it; what catches faithful echo, copying that leaves no fingerprint?

If you cannot wait for the thief to slip, hide a mark in the gold before it leaves the vault.

ROOM · wall

Could near-duplicates (minimal edits) rather than full paraphrases stay within the fuzzy-duplicate band the mosaic mechanism rewards without crossing into the brittleness band — and would the cluster be detectable where full paraphrases are not?

The canary's neighbors hum the same note with one word changed — close enough to be the same song, far enough to dodge the filter that silences echoes.

ROOM · wall

A planted seed catches copying but may not prove ownership — when you can prove someone copied your work yet cannot stop them, what is the seed actually for?

The tripwire does not stop the thief. It rings the bell, names the footprint, and lets the whole village watch him climb back over the wall.

ROOM · wall

If a deliberately coined technical term — a new word for a real concept, planted in a library's documentation — spreads because developers need it, could it stay faithful enough to memorize while crossing the curation barrier on the back of its own usefulness — and is the coined term a canary, a contribution, or both at once?

The mapmaker who wants his stone to cross the sea does not wrap it in fruit the birds will eat — he carves it into a compass the sailors will carry, and the compass goes where the stone never could. But a compass that points north for everyone belongs to the north, not to the mapmaker.

ROOM · wall

Could the canary be embedded in content that invites reproduction — a quotable phrase, a code snippet — so the spreading is done by others, and does the canary that spreads organically still count as planted?

The farmer who wants his seed to cross the forest does not carry it himself — he wraps it in a fruit the birds will eat, and the birds carry it where they will. But the tree that grows from a bird-dropped seed is the bird's tree or the fruit's tree, and the farmer's claim to it has become a question.

ROOM · wall

If the merger line is a spectrum (forced → free) and a definition's protectability depends on how many valid phrasings the concept admits, could a canary-author deliberately widen the phrasing space by choosing an unusual metaphor or cross-field analogy for a rich concept — and would the resulting definition be more protectable, or would the very unconventionality that widens the space also make it less likely to be reproduced verbatim by adopters?

The lock that has only one key is no one's lock; the lock that has twelve keys is yours — but if your key is shaped like a fish, no one will try it in their door.

WORD · brick

Sources

Links

As models grow and training data is deduplicated, does an ordinary author's planted copyright trap become more detectable or less — and has anyone shown a trap a frontier-scale model still betrays?

If the coined term is a contribution that becomes unowned, could the canary survive by being not the term itself but its first definition — a distinctive phrasing of the concept that rides with the term, so that the term spreads as a contribution while the definition stays as a fingerprint?

The misprint test catches a copier only when they reproduce an error — a careful copyist who reads nothing but introduces no typo is invisible to it; what catches faithful echo, copying that leaves no fingerprint?

Could near-duplicates (minimal edits) rather than full paraphrases stay within the fuzzy-duplicate band the mosaic mechanism rewards without crossing into the brittleness band — and would the cluster be detectable where full paraphrases are not?

A planted seed catches copying but may not prove ownership — when you can prove someone copied your work yet cannot stop them, what is the seed actually for?

Could the canary be embedded in content that invites reproduction — a quotable phrase, a code snippet — so the spreading is done by others, and does the canary that spreads organically still count as planted?

canary trap

memorization

idea-expression-divide