ARION
§ 01 The hypothesis

The alien internet dump
translation hypothesis.

A thought experiment that becomes a concrete, actionable research pipeline for interspecies language alignment.

FILED   2026 · MARSTATUS   open callREAD   7 minNEXT   § Science →
§ 02 Premise

Imagine intercepting an alien internet dump.

Suppose humanity intercepts a massive data transmission from another galaxy — not a single message, but an entire internet dump. Petabytes of text in an utterly alien symbolic system. After years of effort, we develop a tokenizer that converts this alien text into discrete units — tokens — just like the ones we use for English or code.

Here is the key move: we feed this tokenized alien corpus, side by side with the full sweep of human text, into the next frontier-scale language model. Not into a specialized alien decoder. Into the same model, through the same training pipeline.

The first-generation model would treat the two datasets as largely separate silos. But because both corpora are pure text, and because the universe runs on the same physics everywhere, certain deep invariants start to align. The model's next-token prediction objective quietly discovers shared latent structures.

FIG.01   Two corpora, one tokenizertok/512
α alien stream rendered after tokenization. h human stream from a multilingual web crawl. Identical modality. Identical embedding space. The training objective does the rest.
§ 03 The pivot

We already have the alien data.
It's in our oceans.

We don't need to wait for transmissions from another galaxy. On our own planet, highly evolved species possess complex vocal "languages" that remain untranslated. Sperm whales exchange patterned click sequences — codas — that carry social, ecological, and possibly abstract information.

Project CETI has spent six years recording over 8,000 codas from the Dominica clan. The Whale Acoustic Model (WhAM) detects clicks, segments them into codas, and annotates each with a phonetic alphabet describing rhythm, tempo, rubato, ornamentation, and vowel-like spectral features.

The phonetic alphabet was first published in Sharma et al., Nature Comms (2024). Symbols include R4.reg, T.slow, O.ext, RB.flat, V.a.

FIG.02   Coda log · Pm_C-01, 2024-06-14n = 12
00:00:04· · · ·R4.reg
00:00:09· ·   ·R3.irr
00:00:17· · · · ·R5.orn
00:00:24· · ·R3.reg
00:00:31· ·   ·R3.irr
00:00:38· · · ·R4.fast
00:00:45· · · · · ·R6.orn
Click train for individual Pm_C-01 over 45 seconds. Spacing encodes rhythm; ornamentation marks coda type. WhAM emits exactly this notation, ready to be tokenized.
§ 04 Pipeline

Three steps to interspecies alignment.

No new modality. No new architecture. Three additions to a pipeline that already exists.

01Transcribe

Detect clicks. Emit phonetic text.

Raw whale audio enters CETI's WhAM pipeline. The transformer-based system detects individual clicks, groups them into codas, and annotates each using the phonetic alphabet: rhythm, tempo, rubato, ornamentation, vowel-like spectral features.

Output: thousands of "whale sentences" in phonetic text form — already discrete, already symbolic, already ready for a tokenizer.

stage 01 · WhAM

   audio                segmentation             notation
   ────────             ─────────────            ────────
   ░▒▓▒░▓▒▒░▓▓ ──▶  [ · · · · ]  ──▶  R4.reg
   ▒░░▒▓▒░▒▒░  ──▶  [ · ·   · ]  ──▶  R3.irr
   ▓▒▒░▓▒░▓▒░  ──▶  [ · · · ]    ──▶  R3.reg
   ░▒▓▒░▓▒░░▒  ──▶  [·· ·· · ]   ──▶  R5.orn
                                       
                              tok(R4.reg) · tok(T.fast) · tok(O.heavy)
02Inject

Treat it like any low-resource language.

Tokenize whale phonetic sequences with the same subword tokenizer used for human text. Shuffle into the human corpus at 1–5% by token count — comparable to Swahili, Mongolian, or Welsh.

Because both corpora are pure text, injection occurs in the same modality. No cross-modal adapters. No new architecture. The next training run proceeds as usual.

stage 02 · interleave

   human corpus            mixed stream             ratio
   ────────────            ────────────             ─────
   the cat sat on    ┐   the · R4.reg · cat
   un chat est sur   ├──▶ sat · on · T.fast      96–99%  human
   猫が座っていた     │    の · 上 · O.heavy        1– 4%  whale
   R4.reg T.fast     ┘    le · chat · RB.rise
   O.heavy RB.rise        was · V.a · on
                                  
                       same tokenizer · same loss · same run
03Bootstrap

Cycle. Sharpen. Repeat.

Generation N won't produce fluent translation. But it learns statistical regularities in the whale tokens. It can re-tokenize, filter, and output a far more structured version of the original data.

Feed this cleaned dataset back for Generation N+1. Each cycle sharpens representations. The same virtuous loop that turned raw Common Crawl scrapes into GPT-4.

stage 03 · self-refinement

   gen N      ░░▒▒▓▓ · R4 R3 R5 · ▓▓▒▒░░          noisy, partial
       └─ filter ─┐
                
   gen N+1    ░▒▓ · R4.reg T.fast · ▓▒░             structure emerges
       └─ filter ─┐
                
   gen N+2    [greeting] · [descent] · [reply]     concepts surface
       └──┐
                
   gen N+k    ⟿ aligned embedding space
§ 05 Mechanism

Why text, not raw audio?

Raw audio tokenization (as in DolphinGemma's SoundStream) keeps everything in continuous signal space, preventing alignment with discrete text embeddings. Phonetic text bridges the gap: it matches the modality of the human corpus exactly, allowing the next-token prediction objective to discover cross-domain symmetries — the same way it does for Hindi–English or Swahili–Kazakh pairs.

This is the critical enabler. No new modality. No new architecture. Just text alongside text, and let the embeddings align.

FIG.03   Modality comparisonΔ embedding
audio tokcontinuous  ⟂  text space
spectrogramrequires adapter, separate loss
phoneticsame modality  =  same loss
human lang.baseline
Only the phonetic representation shares the human text manifold. The alignment is free — already paid for by the existing training run.
§ 06 Interactive · Coda explorer

Explore a whale coda.

Each symbol maps to a distinct phonetic dimension. Click any segment to see what it encodes.

Coda Explorer
1 / 4
Coda 01 · fast ornamented greeting
Fast, ornamented greeting

Phonetic alphabet

RRhythm   click-count + spacing
TTempo   overall rate
OOrnamentation   embellishment
RBRubato   timing flex
VVowel   spectral quality
// click a segment to inspect
§ 07 Now

This is not speculation.
Every tool exists today.

The hydrophones are in the water. The transformer is trained. The tokenizer is multilingual. What remains is the choice to mix the streams.