The Hypothesis

R5.orn · ⏃ ⌖ ⌭ V.a · ⏧ R4.reg · O.heavy · T.fast · R3.irr · ◇ V.i · ⟁ O.heavy · ⌖ RB.rise · ⌗ ⏃ RB.rise · ⍝ ✦ ◯ ⌭ O.heavy · V.a · T.slow · ⏃ T.fast · ⟁ T.slow · RB.rise · ⟟ ⌭ R3.irr · ⏃ T.fast · ⍝ V.i · V.o · T.slow · R4.reg · ✦ ⟁ O.ext · ⌗ ✦ R3.irr · ⌗ V.o · R3.irr · ⏧ RB.rise · ⌬ O.ext · V.o · ⏧ ◇ ◇ RB.flat · T.fast · R4.reg · ⟟ ✦ ⏆ T.fast · ◇ T.fast · ◈ V.a · ⍝ O.ext · ◯ T.slow · ✦ ⌭ ◈ ⏆ ◬ ✦ V.o · R3.irr · ◯ R3.irr · O.ext · V.i · V.a · O.ext · V.o · V.i · RB.flat · ◈ R5.orn · V.a · RB.flat · R3.irr · O.ext · ✦ V.i · T.fast · O.heavy · ⌖ ◇ R5.orn · ⟁ ◊ ⍝ ◯ V.a · O.heavy · ⟁ R5.orn · ⌬ RB.rise · ⌖ ◇ ⍝ R5.orn · T.fast · ⏃ ◇ ⍝ R4.reg · RB.rise · ⏃ R3.irr · ◊ V.i · ⌗ V.a · ⍟ ⍝ V.i · ✦ ⌬ T.slow · O.ext · ◬ ✧ T.slow · ✦ R5.orn · ⟁ O.ext · R5.orn · V.a · ⌬ ◯ V.i · ⌭ ◈ V.a · ⏃ ✦ ⌗ T.slow · R4.reg · T.fast · RB.rise · ⌖ O.heavy · T.slow · T.fast · ⟟ R4.reg · T.fast · ⏆ ✧ V.o · ◇ V.i · RB.flat · V.o · ⏆ R4.reg · V.a · ⌗ T.fast · O.ext · ⏃ ⏆ ⏃ R5.orn · V.i · ✦ ⌬ ⏆ ⏆ ⟁ ⏧ R5.orn · O.ext · ⍝ RB.rise · RB.rise · R3.irr · V.o · ⟟ ⏃ O.ext · ⍝ ⟟ R4.reg · ⏆ ⟟ RB.rise · ⟟ ⌖ ⏧ ✧ T.fast · RB.flat · O.heavy · ✦ O.ext · V.a · ⏧ ⏧ T.slow · ◊ ⟟ ◈ O.heavy · ⌗ O.heavy · ⌗ ⍝ V.a · T.fast · ◇ V.o · R4.reg · ⍟ ⌗ ◇ ◬ T.fast · ⟟ ⟟ ◬ O.ext · RB.rise · T.slow · R4.reg · ⏆ RB.flat · RB.flat · V.o · ⍟ RB.rise · V.a · ⏃ ◯ R4.reg · ⏆ V.i · O.heavy · ✧ RB.flat · RB.rise · RB.flat · O.ext · T.slow · ⏧ T.fast · R3.irr · ⌬ ⏃ ⌬ ⍟ ⟟ ◈ O.heavy · ⌬ ◬ ⌖ ◬ V.i · ⌖ T.fast · R5.orn · ⟟ ⌬ T.slow · ⏧ V.o · ◊ RB.flat · O.ext · ⌭ O.ext · R4.reg · ◊ V.i · ◈ RB.rise · V.i · ⍟ T.fast · ⌭ RB.flat · R5.orn · T.fast · T.fast · ⏧ R4.reg · ◊ ⏆ ◯ R5.orn · V.a · ⌬ ⍟ ⏃ R4.reg · ⌗ ⟁ V.o · ⌬ R4.reg · R4.reg · ⌬ T.slow · ◯ ⍟ V.a · RB.flat · O.ext · ⍝ ◯ V.i · V.i · ⍝ ✦ V.a · R5.orn · ⌬ ⏆ O.ext · O.ext · T.fast · R5.orn · R3.irr · O.heavy · RB.rise · R5.orn · ⏧ V.a · R3.irr · RB.rise · ⍟ V.a · V.i · V.i · ⏧ O.heavy · ◇ R4.reg · RB.rise · RB.flat · ⏆ ◊ R3.irr · ◇ R3.irr · ⌖ R5.orn · ⏃ ⏃ T.fast · ⍟ ⌗ ⍟ O.ext · ⌗ V.i · T.fast · O.ext · ⌭ O.ext · V.a · O.heavy · ⌭ ✦ ⏆ V.o · R4.reg · ⌖ ⌬ T.slow · O.heavy · V.i · ◊ T.slow · ◬ ⏃ T.fast · R3.irr · V.o · ◈ ◯ RB.flat · ⏧ O.heavy · ⟁ T.slow · ⍝ ◇ ⌭ RB.flat · V.i · V.o · RB.flat · ⌖ R4.reg · V.i · R4.reg · ◬ T.slow · ⌗ ◯ ⍟ ⌖ ✦ R4.reg · R4.reg · O.heavy · V.a · V.o · V.o · ◬ R3.irr · R5.orn · R4.reg · ◯ ⍝ ◇ ⍝ O.ext · ⌭ RB.rise · O.ext · V.i · ⍝ O.heavy · ◬ R5.orn · R3.irr · ◈ ◇ RB.flat · RB.flat · ⟟ ◬ ◈ O.heavy · T.fast · ⍟ ✧ ⏧ R5.orn · T.fast · V.o · R3.irr · ⌭ RB.flat · ⏃ RB.rise · O.ext · O.heavy · T.slow · T.fast · RB.flat · ✦ RB.rise · O.heavy · RB.flat · ⌖ ⏧ T.fast · O.heavy · R4.reg · ⏧ O.ext · R4.reg · ⍟ RB.rise · ✦ ⌖ RB.rise · ⌬ ⍟ ◈ ⌭ R5.orn · ◯ RB.flat · RB.flat · O.heavy · O.heavy · R3.irr · RB.rise · R5.orn · RB.flat · R5.orn · ◈ ◇ T.fast · ⍟ T.slow · R5.orn · ⍟ V.i · R4.reg · V.a · O.heavy · V.i · T.fast · RB.flat · T.slow · O.ext · R4.reg · V.a · R5.orn · O.heavy · T.slow · V.i · ⏧ ◊ O.heavy · R4.reg · O.ext · ◊ ⍟ ⏃ T.slow · O.ext · R5.orn · R5.orn · V.i · V.i · ◊ ◈ RB.flat · ⏆ ◬ R3.irr · ⏧ ⍝ T.fast · ◬ ◯ ⏆ ⌭ ✦ ⍟ ◇ R5.orn · ⍟ ⟁ ◊ T.slow · ✦ R4.reg · ⌭ T.fast · O.heavy · ⍟ ⍟ ⏆ V.i · O.ext · O.ext · ⟟ O.heavy · ⍝ T.fast · V.o · ⌭ ⍟ R4.reg · ⏃ R4.reg · T.fast · RB.rise · ◯ R3.irr · RB.rise · ⏆ RB.rise · ✧ R3.irr · ◇ O.ext · O.ext · ⌗ ✧ ✦ ⟁ ◇ ⏧ RB.flat · ◇ RB.rise · V.a · R4.reg · ⏆ R5.orn · RB.rise · ◈ ⌖ ⏃ ◯ ⍝ ⏃ ⍟ ⟟ R5.orn · O.heavy · T.fast · ⟁ ⟁ V.i · V.a · ⌗ V.a · ◊ ✦ ✧ ⌖ RB.flat · ⌬ ⟁ O.ext · O.heavy · T.slow · V.a · ⌬ ⌖ T.fast · ⏃ ⟁ RB.rise · V.i · T.slow · RB.rise · R3.irr · ⟁ V.a · RB.rise · RB.rise · ⌭ ◬ R3.irr · T.fast · RB.flat · RB.rise · ⌖ O.heavy · T.fast · V.a · ⌬ ⏃ ◯ V.i · R4.reg · ◬ R4.reg · RB.flat · ⍝ ◊ V.i · ◬

Imagine intercepting an alien internet dump.

Suppose humanity intercepts a massive data transmission from another galaxy — not a single message, but an entire internet dump. Petabytes of text in an utterly alien symbolic system. After years of effort, we develop a tokenizer that converts this alien text into discrete units — tokens — just like the ones we use for English or code.

Here is the key move: we feed this tokenized alien corpus, side by side with the full sweep of human text, into the next frontier-scale language model. Not into a specialized alien decoder. Into the same model, through the same training pipeline.

The first-generation model would treat the two datasets as largely separate silos. But because both corpora are pure text, and because the universe runs on the same physics everywhere, certain deep invariants start to align. The model's next-token prediction objective quietly discovers shared latent structures.

FIG.01 Two corpora, one tokenizertok/512

α · alien⏧◬ ⏆ ✦⌬◈✧⌬ ⌬ ⌗⟠✧⟁⍝ ⌗ ⟁⍝ ◈⌬⌬◬◯✦⌖⏧⌖⌬⌬ ⌗⌭⏆⌭ ⍝⟁⟁◇⏧⟡◈ ◊⌭⌬⌬◈⟟✧◈⏃⟟◈⏧ ✧✦⍟✧◇⌗◊⌖◈⏧⟠ ⟡⟁ ◊⌗ ⌭⍝⟡⌬◬ ◊⏧⟁⟠ ◯⟟⏃✧⟡⌭ ⟁⏆ ⌭ ⟟◬ ◬⌗ ⟡⏆⌬⟡⏃✦⍝ ⍝⏧✦◯✧⏃⟡⏆ ◊✦⌗⏧⍟⏧ ⌭⌭◈⌬✦ ⌬⟠⌬⌬ ⟟⟁⟟⟠⟁⌬ ⍝⟁✧⌗ ◊◈ ◯⏃⏧ ⏧⍝⌗⌗◇⌭◬⍝✧⌗⌭⏧⏃◊✦⌭⟟✧◯⟟ ⌗◊⟡⟟⟁⏃⟟✦⟟⌬ ⌭ ⌗◯⟡◯⏧⍟⌖⍟◊⌬⏆⟟◇⟟◇⟁⏧✧⟟◊⟟⌖⌬⏧⌭⍝⌭◊⟟ ⌭◇⍝◇ ◬✧⏧ ⏃◊⍟ ⟟⏧◊✦⟟⏃⏃⌭⟁✦⍟⏃⟠⏧◈⏆◯ ⌬◊✧⍝⌬⟟⟠⌬⟟⍟⌗ ⟠⍝✧ ⌬✧ ◊✦✦⟡⌭◇⍟⍝ ⌗✦⏆⏆⍝⟠⟁◬ ⌭⌖ ⌬◈⌭⍝⌬◊⌖⍟⌬⏧⟟✧⏧⍟✦⍝✧ ◊⟠◈⌖◊⌭⟟◇⟠◊⟁⌭⍟◇⍝◊⌬⌭⌗⟠✦⌖⌭ ✧ ◬⌭⟠⏆✦⟟◇◊⟟ ◇⌭⌬⌖⏃◯⍝⍟⌭◯◯⍟⏃⌖⏆⌬⟁⟟ ⏧⏧⌭✦⌭◊⌬ ◯⌬⟟⌬⍝⟟⌗ ⟟⌖ ⟁⍝⍝⌗✧⟁◊◊◯⌭⌭⏆⍝⟟⟠◊◯◇⍝⍟ ⌗ ⏆⌭⏧✧◯⌭⟠✧✦⌖⟟✧⌖⟟⌗⏃⟠⟁ ⟡⌬✦

h · human猫は屋根の上にいた der Wind kam von Westen die Sonne geht auf der Wind kam von Westen 猫は屋根の上にいた der Wind kam von Westen pumpkin pie was on the table we walked to the market 猫は屋根の上にいた the quick brown fox jumps the quick brown fox jumps the quick brown fox jumps 猫は屋根の上にいた die Sonne geht auf der Wind kam von Westen der Wind kam von Westen el gato está en el the quick brown fox jumps

α alien stream rendered after tokenization. h human stream from a multilingual web crawl. Identical modality. Identical embedding space. The training objective does the rest.

We already have the alien data.
It's in our oceans.

We don't need to wait for transmissions from another galaxy. On our own planet, highly evolved species possess complex vocal "languages" that remain untranslated. Sperm whales exchange patterned click sequences — codas — that carry social, ecological, and possibly abstract information.

Project CETI has spent six years recording over 8,000 codas from the Dominica clan. The Whale Acoustic Model (WhAM) detects clicks, segments them into codas, and annotates each with a phonetic alphabet describing rhythm, tempo, rubato, ornamentation, and vowel-like spectral features.

The phonetic alphabet was first published in Sharma et al., Nature Comms (2024). Symbols include R4.reg, T.slow, O.ext, RB.flat, V.a.

FIG.02 Coda log · Pm_C-01, 2024-06-14n = 12

00:00:04· · · ·R4.reg

00:00:09· · ·R3.irr

00:00:17· · · · ·R5.orn

00:00:24· · ·R3.reg

00:00:31· · ·R3.irr

00:00:38· · · ·R4.fast

00:00:45· · · · · ·R6.orn

Click train for individual Pm_C-01 over 45 seconds. Spacing encodes rhythm; ornamentation marks coda type. WhAM emits exactly this notation, ready to be tokenized.

Three steps to interspecies alignment.

No new modality. No new architecture. Three additions to a pipeline that already exists.

01Transcribe

Detect clicks. Emit phonetic text.

Raw whale audio enters CETI's WhAM pipeline. The transformer-based system detects individual clicks, groups them into codas, and annotates each using the phonetic alphabet: rhythm, tempo, rubato, ornamentation, vowel-like spectral features.

Output: thousands of "whale sentences" in phonetic text form — already discrete, already symbolic, already ready for a tokenizer.

stage 01 · WhAM


   audio                segmentation             notation
   ────────             ─────────────            ────────
   ░▒▓▒░▓▒▒░▓▓ ──▶  [ · · · · ]  ──▶  R4.reg
   ▒░░▒▓▒░▒▒░  ──▶  [ · ·   · ]  ──▶  R3.irr
   ▓▒▒░▓▒░▓▒░  ──▶  [ · · · ]    ──▶  R3.reg
   ░▒▓▒░▓▒░░▒  ──▶  [·· ·· · ]   ──▶  R5.orn
                                       ↓
                              tok(R4.reg) · tok(T.fast) · tok(O.heavy)

02Inject

Treat it like any low-resource language.

Tokenize whale phonetic sequences with the same subword tokenizer used for human text. Shuffle into the human corpus at 1–5% by token count — comparable to Swahili, Mongolian, or Welsh.

Because both corpora are pure text, injection occurs in the same modality. No cross-modal adapters. No new architecture. The next training run proceeds as usual.

stage 02 · interleave


   human corpus            mixed stream             ratio
   ────────────            ────────────             ─────
   the cat sat on    ┐   the · R4.reg · cat
   un chat est sur   ├──▶ sat · on · T.fast      96–99%  human
   猫が座っていた     │    の · 上 · O.heavy        1– 4%  whale
   R4.reg T.fast     ┘    le · chat · RB.rise
   O.heavy RB.rise        was · V.a · on
                                  ↓
                       same tokenizer · same loss · same run

03Bootstrap

Cycle. Sharpen. Repeat.

Generation N won't produce fluent translation. But it learns statistical regularities in the whale tokens. It can re-tokenize, filter, and output a far more structured version of the original data.

Feed this cleaned dataset back for Generation N+1. Each cycle sharpens representations. The same virtuous loop that turned raw Common Crawl scrapes into GPT-4.

stage 03 · self-refinement


   gen N      ░░▒▒▓▓ · R4 R3 R5 · ▓▓▒▒░░          noisy, partial
       └─ filter ─┐
                ↓
   gen N+1    ░▒▓ · R4.reg T.fast · ▓▒░             structure emerges
       └─ filter ─┐
                ↓
   gen N+2    [greeting] · [descent] · [reply]     concepts surface
       └─      ⋯       ─┐
                ↓
   gen N+k    ⟿ aligned embedding space

Why text, not raw audio?

Raw audio tokenization (as in DolphinGemma's SoundStream) keeps everything in continuous signal space, preventing alignment with discrete text embeddings. Phonetic text bridges the gap: it matches the modality of the human corpus exactly, allowing the next-token prediction objective to discover cross-domain symmetries — the same way it does for Hindi–English or Swahili–Kazakh pairs.

This is the critical enabler. No new modality. No new architecture. Just text alongside text, and let the embeddings align.

FIG.03 Modality comparisonΔ embedding

audio tokcontinuous  ⟂  text space
spectrogramrequires adapter, separate loss
phoneticsame modality  =  same loss
human lang.baseline

Only the phonetic representation shares the human text manifold. The alignment is free — already paid for by the existing training run.

Explore a whale coda.

Each symbol maps to a distinct phonetic dimension. Click any segment to see what it encodes.

Coda Explorer

Coda 01 · fast ornamented greeting

Fast, ornamented greeting

Phonetic alphabet

RRhythm click-count + spacing

TTempo overall rate

OOrnamentation embellishment

RBRubato timing flex

VVowel spectral quality

// click a segment to inspect

The alien internet dump
translation hypothesis.

Imagine intercepting an alien internet dump.

We already have the alien data.
It's in our oceans.

Three steps to interspecies alignment.

Detect clicks. Emit phonetic text.

Treat it like any low-resource language.

Cycle. Sharpen. Repeat.

Why text, not raw audio?

Explore a whale coda.

Phonetic alphabet

This is not speculation.
Every tool exists today.

The alien internet dumptranslation hypothesis.

Imagine intercepting an alien internet dump.

We already have the alien data.It's in our oceans.

Three steps to interspecies alignment.

Detect clicks. Emit phonetic text.

Treat it like any low-resource language.

Cycle. Sharpen. Repeat.

Why text, not raw audio?

Explore a whale coda.

Phonetic alphabet

This is not speculation.Every tool exists today.

The alien internet dump
translation hypothesis.

We already have the alien data.
It's in our oceans.

This is not speculation.
Every tool exists today.