Skip to main content

The hypothesis

The alien internet dump translation hypothesis

A thought experiment that becomes a concrete, actionable research pipeline.

Imagine intercepting an alien internet dump

Suppose humanity intercepts a massive data transmission from another galaxy — not a single message, but an entire internet dump. Petabytes of text in an utterly alien symbolic system. After years of effort, we develop a tokenizer that converts this alien text into discrete units — tokens — just like the ones we use for English or code.

Here is the key move: we feed this tokenized alien corpus, side by side with the full sweep of human text, into the next frontier-scale language model. Not into a specialized alien decoder. Into the same model, through the same training pipeline.

The first-generation model would treat the two datasets as largely separate silos. But because both corpora are pure text, and because the universe runs on the same physics everywhere, certain deep invariants start to align. The model's next-token prediction objective quietly discovers shared latent structures.

We already have the alien data. It's in our oceans.

We don't need to wait for transmissions from another galaxy. On our own planet, highly evolved species possess complex vocal "languages" that remain untranslated. Sperm whales exchange patterned click sequences — codas — that carry social, ecological, and possibly abstract information.

Three steps to interspecies alignment

1

Step 1

Transcribe

Raw whale audio enters CETI's WhAM pipeline. The transformer-based system detects individual clicks, groups them into codas, and annotates each using the phonetic alphabet: rhythm, tempo, rubato, ornamentation, vowel-like spectral features.

Output: thousands of "whale sentences" in phonetic text form.

2

Step 2

Inject

Treat the phonetic whale sequences exactly like Swahili, Mongolian, or any other low-resource language. Tokenize with the same subword tokenizer. Shuffle into the human corpus at 1–5% by token count.

Because both corpora are pure text, injection occurs in the same modality. No cross-modal adapters. No new architecture. The next training run proceeds as usual.

3

Step 3

Bootstrap

Generation N won't produce fluent translation. But it learns statistical regularities in the whale tokens. It can re-tokenize, filter, and output a far more structured version of the original data.

Feed this cleaned dataset back for Generation N+1. Each cycle sharpens representations. The same virtuous loop that turned raw Common Crawl scrapes into GPT-4.

Why text, not raw audio?

Raw audio tokenization (as in DolphinGemma's SoundStream) keeps everything in continuous signal space, preventing alignment with discrete text embeddings. Phonetic text bridges the gap: it matches the modality of the human corpus exactly, allowing the next-token prediction objective to discover cross-domain symmetries — the same way it does for Hindi–English or Swahili–Kazakh pairs.

This is the critical enabler. No new modality. No new architecture. Just text alongside text, and let the embeddings align.

Interactive

Explore a whale coda

Click any segment to see what it encodes. Each symbol in the notation maps to a distinct phonetic dimension of the coda.

Coda Explorer

Fast, ornamented greeting

1 / 4

This is not speculation. Every tool exists today.