Founding paper · 2026
The Alien Internet Dump Translation Hypothesis
Why We Should Start Injecting Phonetic Whale Language into Frontier Models
Oleksandr Anpilohov · ARION Research Initiative
Cite this paper
@article{anpilohov2026alien,
title={The Alien Internet Dump Translation Hypothesis: Why We Should Start Injecting Phonetic Whale Language into Frontier Models},
author={Anpilohov, Oleksandr},
year={2026},
url={https://arionresearch.org/paper}
}Abstract
We propose a concrete, low-cost intervention for frontier language model pretraining: inject tokenized non-human phonetic datasets — beginning with sperm whale codas — directly into the human text corpus. We call the theoretical basis for this approach the Alien Internet Dump Translation Hypothesis. The hypothesis holds that when two symbolic systems, each internally coherent, are co-trained in a shared next-token prediction objective, the model will discover latent structural alignments between them — the same mechanism responsible for emergent cross-lingual transfer in multilingual models. We argue that cetacean vocalizations, already formalized as phonetic text by Project CETI's WhAM pipeline, are the ideal first candidate: the data exists, the tooling exists, and the injection method requires no architectural changes. Each pretraining generation refines and re-organizes the non-human corpus, producing a bootstrapping loop that converges, over multiple generations, on genuine interspecies semantic alignment.
1. Introduction
Modern language models have demonstrated a remarkable property: when trained on text from dozens of human languages simultaneously, they spontaneously develop shared internal representations for semantically equivalent concepts, even without any parallel corpora or translation supervision [Conneau et al., 2020a]. A model trained on English, Swahili, and Mongolian does not learn three separate systems — it learns one system with three surface realizations. The shared embedding space is not designed; it emerges.
This paper asks a simple question: why stop at human languages?
If the alignment mechanism is general — if it operates on any two symbolic systems that are internally coherent and share deep structural invariants — then there is no principled reason why it should be limited to languages produced by Homo sapiens. Non-human species with complex vocal communication systems are, in this framework, simply low-resource languages that no one has yet included in the training corpus.
We propose that including them is now feasible, and that the marginal cost of doing so is negligible relative to the potential return: a model that can begin to describe, in human language, what non-human vocalizations mean.
2. The Thought Experiment: An Alien Internet Dump
To motivate the hypothesis, consider the following scenario.
Humanity intercepts a massive data transmission from another civilization — not a single message, but an entire internet dump. Petabytes of text in an utterly alien symbolic system. After years of effort, linguists and engineers develop a tokenizer that converts the alien text into discrete sub-word units, the same format used for English or Python code.
Now consider two possible approaches to analysis.
Approach A — Specialist decoder. Train a model exclusively on the alien corpus. This model learns the internal statistics of the alien language, can predict next tokens, cluster recurring patterns, and generate plausible alien text. It cannot, however, explain what any of it means in human terms. It has no bridge to human concepts.
Approach B — Co-training. Feed the tokenized alien corpus, side-by-side with the full sweep of human text, into a single frontier-scale language model. The training objective is unchanged: predict the next token. No cross-modal adapters. No special architecture. Just text alongside text.
The first-generation model will treat the two datasets as largely separate. But because both corpora are pure text, and because the universe runs on the same physics everywhere — causality, physical constraints, social dynamics, information structure — certain deep invariants begin to align. The model's next-token prediction objective quietly discovers these shared latent structures without being explicitly told to look for them.
This is precisely what happens with human multilingual models. Hindi and English share no script, no phonology, no morphology. Yet a model trained on both spontaneously learns that dog and कुत्ता should occupy nearby positions in embedding space [Conneau et al., 2020a]. The alignment is not supervised. It emerges from co-training on corpora that describe the same world.
The alien internet dump is the limiting case. We don't need to wait for it. We already have something close.
3. From Aliens to Whales
The alien internet dump is a thought experiment. Cetacean vocalization data is not.
Sperm whales (Physeter macrocephalus) are among the most cognitively sophisticated animals on Earth. They live in multi-generational matrilineal clans, coordinate complex cooperative behaviors including communal calf-rearing and group hunting, and exchange patterned click sequences — codas — that vary systematically across clans, populations, and individuals. The Dominica Sperm Whale Project, founded by Shane Gero in 2005, has accumulated the world's largest archive of longitudinal sperm whale behavioral and acoustic data.
Until recently, these codas were described only as rhythmic patterns — sequences of clicks with varying inter-click intervals. The breakthrough came in 2024, when Sharma et al. published a landmark analysis in Nature Communications demonstrating that sperm whale codas constitute a combinatorial coding system with a phonetic alphabet. Each coda encodes information along five dimensions:
- R — Rhythm: click count and temporal regularity
- T — Tempo: overall speed
- O — Ornamentation: decorative extra clicks
- RB — Rubato: expressive speed variation
- V — Vowel quality: spectral character of individual clicks
Beguš et al. (2025) extended this finding, identifying vowel-like and diphthong-like spectral patterns in individual clicks — features previously thought to be exclusive to human speech. The phonetic alphabet for sperm whale communication is now sufficiently developed to support systematic annotation at scale.
Project CETI's open-source WhAM (Whale Acoustics Model) pipeline operationalizes this alphabet. Given raw hydrophone recordings, WhAM detects individual clicks, groups them into codas, and outputs structured phonetic text:
R4.reg T.fast O.heavy RB.rise V.a
R5.reg T.slow O.none RB.steady V.i
R3.reg T.med O.light RB.fall V.a
These strings are text. They are in the same modality as English, Mandarin, Python, or ancient Sumerian. A standard subword tokenizer can process them. A frontier model can train on them.
4. The Core Hypothesis
The Alien Internet Dump Translation Hypothesis: When a symbolic system that is internally coherent — that encodes information about the real world, social relationships, physical constraints, and causal structure — is co-trained with human text in a shared next-token prediction objective at sufficient scale, the model will discover latent structural alignments between the two systems without explicit supervision.
The hypothesis follows directly from three established results:
1. Emergent cross-lingual alignment is universal at scale. Conneau et al. (2020a) demonstrated that multilingual models spontaneously align semantically equivalent concepts across languages with no shared vocabulary and no parallel data. Conneau et al. (2020b) showed this effect strengthens with scale and generalizes to low-resource languages. Agarwal & Gupta (2025) showed that cross-lingual alignments can be induced by targeted interventions, including the addition of previously unseen languages.
2. The bootstrapping loop is established. Brown et al. (2020) demonstrated the iterative data-cleaning loop: noisy web text → model → cleaner data → better model. Each generation of GPT-class models produced cleaner, more coherent representations of the same underlying information. We propose applying the identical loop to whale phonetic data. Generation N will not produce fluent translation. But it will learn the statistical regularities of the whale tokens and produce a more structured version of the corpus for Generation N+1.
3. Continual pretraining at small ratios is safe. Nag et al. (2025) demonstrated that adding small volumes of novel tokens during continual pretraining yields gains in the target domain with negligible regression on high-resource benchmarks. At frontier scale, the entire DSWP archive, fully annotated, represents a fraction of one percent of total training tokens — well within the safe injection range.
5. The Method
The pipeline has three stages.
5.1 Stage 1 — Transcription
Raw hydrophone recordings from the Dominica Sperm Whale Project archive (publicly available on Hugging Face via the ceti-ai organization) are processed through WhAM. The pipeline:
- Detects individual clicks using the trained click detector
- Groups clicks into codas based on inter-coda silence thresholds
- Annotates each coda along the five phonetic dimensions
- Outputs structured phonetic text sequences
The result is a corpus of "whale sentences" — symbolic sequences that capture the structure of sperm whale communication in a format directly compatible with language model training pipelines.
5.2 Stage 2 — Injection
The phonetic whale sequences are treated identically to any low-resource human language undergoing continual pretraining. They are:
- Tokenized using the same subword tokenizer as the target model (BPE or SentencePiece)
- Shuffled into the human text corpus at 1–5% by token count
- Fed into the next training run without modification to the training objective or architecture
No cross-modal adapters. No new loss terms. No special masking. The injection is architecturally invisible — the model simply sees more text, some of which happens to be in a non-human notation system.
5.3 Stage 3 — Bootstrapping
Generation N will not produce fluent interspecies translation. It will, however, learn:
- The tokenization patterns of whale phonetic text
- Statistical regularities within whale codas (which rhythm patterns co-occur with which tempo markings)
- Contextual patterns (which codas appear in sequence, which appear in isolation)
Using Generation N, we re-tokenize and filter the original DSWP corpus, producing a cleaner, more consistently annotated dataset for Generation N+1. Generation N+1 inherits a better-organized whale corpus and a richer shared embedding space. Each cycle narrows the gap between whale phonetic structure and human semantic categories.
This is the same loop that transformed raw Common Crawl scrapes into the training data for GPT-4.
6. Why Phonetic Text, Not Raw Audio?
The critical design decision is the choice of representation. An alternative approach — and the one currently pursued by DolphinGemma — is to tokenize raw audio waveforms using a neural codec (SoundStream, EnCodec) and train a model in audio token space.
Audio tokenization has demonstrated impressive results for generation and prediction within a single species' acoustic domain. DolphinGemma can generate realistic dolphin vocalizations and predict likely continuations. But it operates in a different modality from human text. Its internal representations cannot align with discrete text embeddings through the standard next-token prediction mechanism, because the two systems are not in the same token space.
Phonetic text bridges this gap. By converting acoustic signals into symbolic notation using a shared character set and tokenizer, we place whale vocalizations in the same modality as the human corpus. The next-token prediction objective can then discover cross-domain symmetries exactly as it discovers cross-lingual symmetries in multilingual models.
The analogy to written transcription of human languages is precise. A language model trained on romanized transcriptions of spoken Hindi learns to align Hindi with English not because anyone told it to, but because the symbolic representations inhabit the same space and describe the same world. We are proposing the same for whale phonetic notation.
7. Why Now?
Every component of this pipeline exists today.
| Component | Status | |-----------|--------| | Whale phonetic alphabet | Published: Sharma et al. (2024), Beguš et al. (2025) | | Annotation pipeline | Open-source: WhAM (Project CETI) | | Training data | Public: DSWP archive on Hugging Face | | Cross-lingual alignment theory | Established: Conneau et al. (2020a, 2020b) | | Continual pretraining safety | Validated: Nag et al. (2025) | | Open frontier models | Available: Llama 3, Mixtral, Falcon | | Bootstrapping loop | Demonstrated: Brown et al. (2020) |
The only missing ingredient is the decision to act.
Frontier labs — xAI, OpenAI, Anthropic, Google DeepMind — run pretraining at a scale where the marginal cost of adding whale phonetic tokens to the corpus is genuinely negligible. At 500 billion to 15 trillion training tokens per run, the complete DSWP archive represents less than 0.001% of total data. The infrastructure for adding new language data to pretraining corpora already exists. This is a policy decision, not an engineering one.
8. The Research Registry
Tracking progress across models, datasets, and tools requires coordination infrastructure. ARION maintains an open registry of:
Models — Language models trained with non-human phonetic data. Fields include architecture, parameter count, human/animal dataset ratio, injection method, and generation number. Generation lineage enables tracking of the bootstrapping loop across model versions.
Datasets — Audio, video, and transcribed text corpora. Fields include species, location, years of collection, raw vs. model-prepared status, and provenance chain (which generation prepared the data, which generation it is intended to feed).
Tools — Annotation pipelines, tokenizers, phonetic alphabets, evaluation frameworks. Fields include supported species, function description, and links to code.
The registry schema is designed for the current static v1 (seed entries only), with architecture that enables a Cloudflare D1 + Workers API backend in v2 and an LLM-based submission clerk in v3.
9. Call to Action
For frontier labs: Include tokenized non-human phonetic datasets in your next pretraining run. The marginal cost is negligible. The potential return — the first iterative step toward genuine interspecies translation — is profound.
For bioacoustics researchers: Contribute audio archives to the pipeline. Every hydrophone deployment, every tagged whale, every archived recording becomes training data. WhAM's automation means the annotation step is computationally cheap. Share through the ARION registry.
For open-source developers: Prototype today. Download the DSWP dataset from Hugging Face, run WhAM, tokenize the output, inject into Llama-3 70B or Mixtral 8x22B via continual pretraining, and observe the bootstrapping effects. Share your results.
10. Conclusion
The cross-lingual alignment mechanism in frontier language models is more general than its current application to human languages. Any symbolic system that is internally coherent and describes a real world — physical constraints, social dynamics, causal structure — should, in principle, find latent structural alignments with human text when co-trained in a shared embedding space.
Cetacean vocalizations, now formalized as phonetic text by Project CETI's WhAM pipeline, are the ideal first candidate for extending this mechanism beyond human language. The data is public. The tooling is open-source. The theoretical foundations are published. The injection method requires no architectural changes.
We propose that the next frontier pretraining run include tokenized whale phonetic data. The result will not be fluent interspecies translation. But it will be Generation 1 — and the bootstrapping loop, once started, has no principled stopping point short of genuine semantic alignment.
The story of Arion is 2,600 years old. Acoustic signal crossed species boundaries then. We propose the conditions now exist to do it again — deliberately, at scale, with the tools of modern machine learning.
References
Agarwal, A., & Gupta, P. (2025). Languages are modalities: Cross-lingual alignment via encoder injection. arXiv preprint.
Beguš, G., et al. (2025). Vowel- and diphthong-like patterns in sperm whale communication. Open Mind.
Brown, T., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33.
Conneau, A., Wu, S., Li, H., Zettlemoyer, L., & Stoyanov, V. (2020a). Emerging cross-lingual structure in pretrained language models. Proceedings of ACL 2020. https://aclanthology.org/2020.acl-main.536/
Conneau, A., et al. (2020b). Unsupervised cross-lingual representation learning at scale. Proceedings of ACL 2020. https://aclanthology.org/2020.acl-main.747/
Google DeepMind, Georgia Tech, & Wild Dolphin Project. (2025). DolphinGemma. https://deepmind.google/models/gemma/dolphingemma/
Nag, A., Chakrabarti, S., Mukherjee, A., & Ganguly, N. (2025). Efficient continual pre-training of LLMs for low-resource languages. Proceedings of NAACL 2025.
Project CETI. (2025). WhAM: Whale Acoustics Model. https://github.com/ceti-ai
Sharma, P., Gero, S., Payne, R., Gruber, J., Rus, D., Torralba, A., & Andreas, J. (2024). Contextual and combinatorial structure in sperm whale vocalisations. Nature Communications. https://doi.org/10.1038/s41467-024-47221-8