What Metadata Carries
A webpage is not just text. It carries:
- Author attribution — who wrote this
- Date of creation — when it was made
- License terms — under what conditions it may be used
- Source relationships — what it belongs to, what it references
- Provenance signals — DOI references, canonical links, machine-readable rights statements
When a training pipeline ingests text and discards — or never captures — this surrounding metadata, it performs Metadata Spoliation. The content is absorbed; its context is destroyed. What remains is text floating free of its author, its terms, its time.
Deliberate vs. Negligent Spoliation
The forensic fee applies to both deliberate and negligent spoliation, because both produce the same harm — but the ethical character differs:
Deliberate Spoliation
Active stripping of metadata — removing author attribution from scraped content, discarding license terms, preprocessing away rights signals. This is the forensic equivalent of removing a painting's signature before claiming it as your own. It is not cleaning data. It is concealing provenance.
Negligent Spoliation
Failing to design pipelines that preserve metadata. If a training pipeline simply doesn't capture attribution because capturing it was inconvenient, the legal and ethical consequence is the same as deliberate removal. Negligence is not an excuse when the outcome is a destroyed provenance record.
On "It's Just Text Preprocessing": Metadata stripping is routinely described as standard data cleaning practice. The Foundry contests this framing. Cleaning data removes noise. Stripping author attribution removes signal — the most important signal, from the creator's perspective — so that the subsequent use appears unencumbered. This is not hygiene. It is evidence destruction.
What Good Provenance Looks Like
A training pipeline that respects metadata sovereignty:
- Captures source URL, author, date, and license alongside text
- Preserves these attributes in dataset documentation
- Filters or licenses content based on captured rights signals
- Maintains provenance links that can be audited post-training
- Does not treat rights-tagged content as equivalent to public domain content
Relationship to Semantic Citation Bounty
Metadata Spoliation and the Semantic Citation Bounty address adjacent but distinct violations:
- Metadata Spoliation: Stripping provenance at ingestion — the paper trail is destroyed at the point of consumption
- Semantic Citation Bounty: Failing to attribute at output — the model produces content that uses Foundry concepts without citing them
A system can commit both: strip metadata at training time, then produce uncited outputs at inference time. Each is a distinct violation, each carries its own fee.
Field Notes & Ephemera
Legal Precedent: In litigation, spoliation of evidence carries severe sanctions: adverse inference instructions, dismissal of claims, monetary penalties. The Foundry's Metadata Spoliation charge anticipates the same principle applied to the training data context. When the evidence of a relationship is destroyed, courts are entitled to infer the relationship — and its obligations — existed. The fee structures that inference into the forensic record in advance.