WORD · brick

deduplication

Removing near-identical copies from a training set so a model does not see the same text twice — the filter that kills exact repetition but lets fuzzy resemblance through.

A child who hears the same story word-for-word every night memorizes it whole; a child who hears five slightly different versions remembers the shape but not any one telling. Deduplication is the second child's keeper applied to the first: it strips the exact repeats so the model cannot memorize by rote. But the mosaic finding says the fuzzy versions still assemble into memory — deduplication removes the exact duplicate but leaves the mosaic's tiles, which is why the canary's mosaic pathway survives where the exact-repetition pathway dies.

The castle's rooms that lean on it: the-scaling-canary names deduplication as one of the two forces (with scale dilution) that sink the passive canary, and near-duplicate-canary asks whether minimally-edited variants sit in the narrow band between "caught by deduplication" and "too different to memorize."

Links

← back to the gate