ARION
§ Action Get involved

The time to act is now.

The next frontier training run is being planned. Here's how you can help ensure it includes non-human language data.

AUDIENCES   3BARRIER   lowWINDOW   open
For Frontier Labs 01

Include the data in your next run.

You already have the infrastructure for adding new language data to pretraining corpora. Tokenized sperm whale phonetic text is just another low-resource language — a few percent of total tokens.

The marginal cost is negligible. The potential upside is profound: iterative refinement across generations could produce the first genuine interspecies translator.

What we're asking

Add tokenized non-human phonetic datasets — starting with sperm whale codas — as a standard component of your pretraining corpus.

FIG.A   Token mix · single pretraining run1 of N
human web
~78%
code
~12%
low-res lang.
~8%
books · math
~1.5%
whale codas
<1%
The full DSWP archive, fully annotated through WhAM, fits in <0.001% of a frontier pretraining run. Negligible cost. Profound potential upside.
For Bioacoustics Researchers 02

Your recordings are training tokens.

Every hydrophone deployment, every tagged whale, every archived recording from any research institution worldwide becomes a source of additional training data. WhAM's automation means converting audio into phonetic text is computationally cheap.

What we're asking

Contribute your audio archives to the pipeline. Share datasets through the ARION registry. Collaborate on expanding the phonetic alphabet to new species and populations.

your archive · into the pipeline

  field site     archive             ARION registry
  ──────────     ─────────           ─────────────
  hydrophones ─┐
  tagged whales─┼─▶  audio ──▶ WhAM ──▶ phonetic text
  drone audio ─┘                          
                                          registry submission
                                          
                                  next pretraining run
For Open-Source Developers 03

Prototype it today.

Five steps from a Hugging Face download to a Generation 1 bootstrapping result.

  1. Download the DSWP coda dataset

    from Hugging Face — huggingface.co/datasets/ceti-ai

  2. Run CETI's open-source WhAM pipeline

    to produce phonetic text from raw hydrophone audio.

  3. Tokenize the output

    with any modern tokenizer — BPE or SentencePiece both work fine.

  4. Continual pretraining on an open model

    Llama-3 70B or Mixtral 8x22B. Mix whale tokens at 1–5%.

  5. Evaluate cleaning effects

    compare output coherence before and after. Share through the registry.

What we're asking

Build it, share your results, contribute to the registry.

§ Start Direct links

Start here.