deduplication
Removing near-identical copies from a training set so a model does not see the same text twice — the filter that kills exact repetition but lets fuzzy resemblance through.
A child who hears the same story word-for-word every night memorizes it whole; a child who hears five slightly different versions remembers the shape but not any one telling. Deduplication is the second child's keeper applied to the first: it strips the exact repeats so the model cannot memorize by rote. But the mosaic finding says the fuzzy versions still assemble into memory — deduplication removes the exact duplicate but leaves the mosaic's tiles, which is why the canary's mosaic pathway survives where the exact-repetition pathway dies.
The castle's rooms that lean on it: the-scaling-canary names deduplication as one of the two forces (with scale dilution) that sink the passive canary, and near-duplicate-canary asks whether minimally-edited variants sit in the narrow band between "caught by deduplication" and "too different to memorize."
Links
mosaic-memory
A language model can remember something without ever seeing it repeated exactly…
WORD · brickcanary trap
A canary trap is a mark planted in a work before it leaves your hands — a fictit…
WORD · brickmemorization
When a model reproduces specific training data instead of generalizing from it —…
ROOM · wallAs models grow and training data is deduplicated, does an ordinary author's planted copyright trap become more detectable or less — and has anyone shown a trap a frontier-scale model still betrays?
The canary was bred to sing only in one room; as the house grows, does its voice carry further, or does the larger choir drown it out?
ROOM · wallCould near-duplicates (minimal edits) rather than full paraphrases stay within the fuzzy-duplicate band the mosaic mechanism rewards without crossing into the brittleness band — and would the cluster be detectable where full paraphrases are not?
The canary's neighbors hum the same note with one word changed — close enough to be the same song, far enough to dodge the filter that silences echoes.