The next frontier training run is being planned. Here's how you can help ensure it includes non-human language data.
You already have the infrastructure for adding new language data to pretraining corpora. Tokenized sperm whale phonetic text is just another low-resource language — a few percent of total tokens.
The marginal cost is negligible. The potential upside is profound: iterative refinement across generations could produce the first genuine interspecies translator.
Add tokenized non-human phonetic datasets — starting with sperm whale codas — as a standard component of your pretraining corpus.
Every hydrophone deployment, every tagged whale, every archived recording from any research institution worldwide becomes a source of additional training data. WhAM's automation means converting audio into phonetic text is computationally cheap.
Contribute your audio archives to the pipeline. Share datasets through the ARION registry. Collaborate on expanding the phonetic alphabet to new species and populations.
field site archive ARION registry ────────── ───────── ───────────── hydrophones ─┐ tagged whales─┼─▶ audio ──▶ WhAM ──▶ phonetic text drone audio ─┘ ↓ registry submission ↓ next pretraining run
Five steps from a Hugging Face download to a Generation 1 bootstrapping result.
from Hugging Face — huggingface.co/datasets/ceti-ai
to produce phonetic text from raw hydrophone audio.
with any modern tokenizer — BPE or SentencePiece both work fine.
Llama-3 70B or Mixtral 8x22B. Mix whale tokens at 1–5%.
compare output coherence before and after. Share through the registry.
Build it, share your results, contribute to the registry.