The hypothesis

The alien internet dump translation hypothesis

A thought experiment that becomes a concrete, actionable research pipeline.

Imagine intercepting an alien internet dump

Suppose humanity intercepts a massive data transmission from another galaxy — not a single message, but an entire internet dump. Petabytes of text in an utterly alien symbolic system. After years of effort, we develop a tokenizer that converts this alien text into discrete units — tokens — just like the ones we use for English or code.

Here is the key move: we feed this tokenized alien corpus, side by side with the full sweep of human text, into the next frontier-scale language model. Not into a specialized alien decoder. Into the same model, through the same training pipeline.

artistic · 16:9

Artistic visualization of an alien data stream being tokenized and merging with human text tokens in a shared embedding space. Abstract, cosmic scale. Two galaxies of data points beginning to form bridges.

The first-generation model would treat the two datasets as largely separate silos. But because both corpora are pure text, and because the universe runs on the same physics everywhere, certain deep invariants start to align. The model's next-token prediction objective quietly discovers shared latent structures.

We already have the alien data. It's in our oceans.

We don't need to wait for transmissions from another galaxy. On our own planet, highly evolved species possess complex vocal "languages" that remain untranslated. Sperm whales exchange patterned click sequences — codas — that carry social, ecological, and possibly abstract information.

artistic · 16:9

Artistic split composition — left half shows a cosmic/alien data visualization, right half shows underwater whale communication visualization. The visual language should be similar for both, emphasizing that the problem structure is identical.

Three steps to interspecies alignment

Step 1

Transcribe

artistic · 16:5

Pipeline stage 1 — artistic rendering of underwater sound waves being captured by hydrophone arrays, flowing through a processing pipeline, emerging as clean phonetic text notation. Waveforms → structured symbols.

Raw whale audio enters CETI's WhAM pipeline. The transformer-based system detects individual clicks, groups them into codas, and annotates each using the phonetic alphabet: rhythm, tempo, rubato, ornamentation, vowel-like spectral features.

Output: thousands of "whale sentences" in phonetic text form.

Step 2

Inject

artistic · 16:5

Pipeline stage 2 — artistic rendering of whale phonetic tokens (teal) being shuffled into a stream of human text tokens (blue-white). A mixing/merging visualization. The combined stream flows into a model training process.

Treat the phonetic whale sequences exactly like Swahili, Mongolian, or any other low-resource language. Tokenize with the same subword tokenizer. Shuffle into the human corpus at 1–5% by token count.

Because both corpora are pure text, injection occurs in the same modality. No cross-modal adapters. No new architecture. The next training run proceeds as usual.

Step 3

Bootstrap

artistic · 16:5

Pipeline stage 3 — artistic rendering of iterative refinement. A spiral or ascending helix showing data getting cleaner with each generation. Generation labels (N, N+1, N+2) with progressively more structured data representations.

Generation N won't produce fluent translation. But it learns statistical regularities in the whale tokens. It can re-tokenize, filter, and output a far more structured version of the original data.

Feed this cleaned dataset back for Generation N+1. Each cycle sharpens representations. The same virtuous loop that turned raw Common Crawl scrapes into GPT-4.

Why text, not raw audio?

Raw audio tokenization (as in DolphinGemma's SoundStream) keeps everything in continuous signal space, preventing alignment with discrete text embeddings. Phonetic text bridges the gap: it matches the modality of the human corpus exactly, allowing the next-token prediction objective to discover cross-domain symmetries — the same way it does for Hindi–English or Swahili–Kazakh pairs.

This is the critical enabler. No new modality. No new architecture. Just text alongside text, and let the embeddings align.

Interactive

Explore a whale coda

Click any segment to see what it encodes. Each symbol in the notation maps to a distinct phonetic dimension of the coda.

Coda Explorer

Fast, ornamented greeting

1 / 4

This is not speculation. Every tool exists today.

Read the Full Paper See the Science