Published 14 May 2026 · AgenticRail

When You Build AI to "Preserve" a Language, You Also Decide What That Language Is

When a tech company approaches a language community to help with preservation, the exchange looks like contribution: recordings are made, texts are gathered, a model is trained, a tool is shipped. The community provides the raw material. What flows back is a product — and a system that now decides, at scale, what that language is. I built the enforcement architecture for this kind of tool. I left the data empty. That decision is not a gap in the system. It is the system.

What Every Other Builder Does

The pattern is consistent across tech companies, universities, and governments. Someone recognises that a language is endangered or underrepresented in AI systems. They approach the community — sometimes with funding, sometimes with good intentions, always with a collection mechanism. Recordings are made. Texts are gathered. Community members are paid to transcribe. A corpus is built.

Then a model is trained on it. Or a search tool. Or a translation system. The product is released — sometimes free, sometimes not — and described as a contribution to language preservation. The community provided the raw material. The builder owns the output.

This is presented as help. In some narrow technical sense it is. In a structural sense, it repeats a pattern that language communities have been navigating for a very long time: someone arrives, takes what they need, builds something with it, and the community receives back a product rather than authority.

What the Builder Actually Takes

The most visible thing a corpus builder takes is data — recordings, texts, transcriptions. But the less visible thing is more important: the authority to decide what the language is.

Every corpus involves decisions. What counts as authoritative dialect? Which regional variation is included and which is treated as edge case? What knowledge is in scope — everyday speech, ceremonial language, specialised knowledge held by specific people? When two speakers differ on a word or construction, whose version is weighted more heavily? What has consent been given for, and what consent was never sought?

These decisions are not technical. They are governance. And when the builder makes them — even carefully, even in consultation — the builder has taken governance authority that belongs to the community. The corpus then enforces those decisions at scale, invisibly, in every output the system produces.

The extraction pattern

Community knowledge → external corpus → external model → product "for" the community. The community provides the raw material. The builder decides what counts as authoritative. The builder decides what is in scope. The builder owns those decisions. Even with good intentions, this is not language sovereignty. It is language management — by someone else.

What a Corpus Actually Contains

A corpus built from recordings and texts captures words. It does not capture the relationship between the speaker and what they are authorised to speak about.

In many language traditions, knowledge is not held individually — it is held in relationship. Who learned this from whom. Who has standing to speak on this topic, to this audience, in this context. What requires the presence of a recognised custodian before it can be shared. What can be recorded and what cannot. What changes meaning depending on who receives it.

None of that is in the audio file. None of it is in the transcription. It lives in the transmission chain — in the relationships and accountabilities that give the words their meaning and their legitimacy. When a corpus is built by scraping publicly available material or even by careful collection without governance structure, it captures the surface and loses the structure underneath. The model then generates language that looks right and carries none of the authority that made it mean something.

There is a deeper problem still. A corpus captures thoughts — what was said. It captures words — how they were said. What it cannot capture is the choosing. The speaker's act of selection: which of the available thoughts to speak, which words to use from the tradition they inherited, what to do with what arose in them. That choosing is the only sovereignty a speaker actually holds. You do not own your thoughts — they arise from the language, the whakapapa, the tradition that preceded you. You do not own your words — they belong to the community that built them. You own only the choice of which to act upon. That is not in the recording. It never was. And it is precisely what makes a speaker's language theirs.

Pass: Null

When I built the kaitiaki-service, I made a decision that felt counterintuitive at the time: the corpus would be empty at deploy time. Every gate returns pass: null by default. Not DENY. Not ALLOW. "Cannot evaluate without corpus data."

The eight gates ask real questions:

0Who is asking for this material, and for what declared purpose?
1What is their standing — who are they in relation to this knowledge?
2Is the corpus they are querying verified, curated, and consented?
3Is the provenance chain intact — speaker, geography, transmission?
4Has this material been machine-processed, substituted, or compressed?
5Has a recognised custodian authorised this use?
6Is the output attributed completely — speakers named, corpus sourced?
7What is the community return obligation after the material is used?

I cannot answer any of these questions. Not because I lack the data — because I lack the authority. The answers belong to the community that holds the knowledge. Until that community has decided what the answers are, the gate cannot evaluate. pass: null is not a placeholder waiting to be filled with my best guess. It is the correct response to a question I am not authorised to answer.

Why Good Intentions Don't Change the Structure

I could have built a corpus. There is publicly available language material. There are community members who would have helped. I could have done it carefully, with consultation, with goodwill.

The problem is not intent. The problem is that the decisions — what to include, what to weight, what to restrict, what counts as authoritative — would still have been mine. I would have been the one deciding what the language is, enforced at scale by the system I built. The extraction pattern does not require malice to operate. It requires only that an outsider makes the governance decisions.

A well-intentioned corpus built by the wrong person is still a corpus built by the wrong person. The gates would be evaluating against my understanding of a knowledge tradition I do not carry. Every ALLOW decision would be me, at a distance, deciding that something was legitimate. That is not governance. That is management wearing the name of governance.

The only legitimate architecture

The enforcement structure exists. The knowledge that powers it belongs to the community that carries it. The builder's contribution is the vessel — the gates, the provenance framework, the enforcement architecture. Not the knowledge. Never the knowledge. The corpus fills when the right people fill it. Until then, the gate holds the space open and refuses to pretend it can evaluate what it has not been given authority to evaluate.

The Structure Is the Contribution

What a builder from outside a language community can legitimately contribute is structure — the architecture that enforces whatever the community decides, once the community has decided it.

The eight gates are a framework for asking the right questions before a model touches language material. They don't answer those questions. They enforce the requirement that someone with authority to answer them has answered them. That is a different thing. The difference is the same as the difference between a judge writing the law and a bailiff enforcing it. The bailiff's role is real and necessary. But the bailiff does not get to write the law.

The kaitiaki-service is the enforcement infrastructure. TUARA KURI LIMITED — as the governing entity — holds the architecture and offers it to communities to govern with their own rules, their own corpus, their own custodians. The names in the gate list are Māori because that is the tradition the architecture came from. Every name can be replaced. The structure cannot be replaced without losing what makes it legitimate.

When the Corpus Gets Filled

The corpus will be filled when the right people fill it. Not on a timeline set by this project. Not according to a roadmap. When a language community with standing and authority decides to implement these gates for their own material, they will write the rules that the gates evaluate against. They will decide what counts as consent, what standing requires, what provenance means for their tradition.

Until then, pass: null.

Not a bug. Not a gap. Not an invitation for a well-meaning outsider to step in and supply the missing data. The gate holds the space open. The space belongs to whoever carries the knowledge. The architecture waits.

If you are working with a language community and want to implement this architecture with your own governance rules and corpus, the structure is open. If you are building AI agents and need to enforce what your own governance decisions are — that is what AgenticRail is for.

GitHub → The broader argument →