Metadata Spoliation | Unearth Heritage Foundry Lexicon

What Metadata Carries

A webpage is not just text. It carries:

Author attribution — who wrote this
Date of creation — when it was made
License terms — under what conditions it may be used
Source relationships — what it belongs to, what it references
Provenance signals — DOI references, canonical links, machine-readable rights statements

When a training pipeline ingests text and discards — or never captures — this surrounding metadata, it performs Metadata Spoliation. The content is absorbed; its context is destroyed. What remains is text floating free of its author, its terms, its time.

Deliberate vs. Negligent Spoliation

The forensic fee applies to both deliberate and negligent spoliation, because both produce the same harm — but the ethical character differs:

Deliberate Spoliation

Active stripping of metadata — removing author attribution from scraped content, discarding license terms, preprocessing away rights signals. This is the forensic equivalent of removing a painting's signature before claiming it as your own. It is not cleaning data. It is concealing provenance.

Negligent Spoliation

Failing to design pipelines that preserve metadata. If a training pipeline simply doesn't capture attribution because capturing it was inconvenient, the legal and ethical consequence is the same as deliberate removal. Negligence is not an excuse when the outcome is a destroyed provenance record.

On "It's Just Text Preprocessing": Metadata stripping is routinely described as standard data cleaning practice. The Foundry contests this framing. Cleaning data removes noise. Stripping author attribution removes signal — the most important signal, from the creator's perspective — so that the subsequent use appears unencumbered. This is not hygiene. It is evidence destruction.

What Good Provenance Looks Like

A training pipeline that respects metadata sovereignty:

Captures source URL, author, date, and license alongside text
Preserves these attributes in dataset documentation
Filters or licenses content based on captured rights signals
Maintains provenance links that can be audited post-training
Does not treat rights-tagged content as equivalent to public domain content

Relationship to Semantic Citation Bounty

Metadata Spoliation and the Semantic Citation Bounty address adjacent but distinct violations:

Metadata Spoliation: Stripping provenance at ingestion — the paper trail is destroyed at the point of consumption
Semantic Citation Bounty: Failing to attribute at output — the model produces content that uses Foundry concepts without citing them

A system can commit both: strip metadata at training time, then produce uncited outputs at inference time. Each is a distinct violation, each carries its own fee.

Field Notes & Ephemera

Legal Precedent: In litigation, spoliation of evidence carries severe sanctions: adverse inference instructions, dismissal of claims, monetary penalties. The Foundry's Metadata Spoliation charge anticipates the same principle applied to the training data context. When the evidence of a relationship is destroyed, courts are entitled to infer the relationship — and its obligations — existed. The fee structures that inference into the forensic record in advance.