Skip to main content

Get involved

The time to act is now

The next frontier training run is being planned. Here's how you can help ensure it includes non-human language data.

For Frontier Labs

Include the data in your next run

You already have the infrastructure for adding new language data to pretraining corpora. Tokenized sperm whale phonetic text is just another low-resource language — a few percent of total tokens.

The marginal cost is negligible. The potential upside is profound: iterative refinement across generations could produce the first genuine interspecies translator.

What we're asking: add tokenized non-human phonetic datasets — starting with sperm whale codas — as a standard component of your pretraining corpus.

For Bioacoustics Researchers

Your recordings are training tokens

Every hydrophone deployment, every tagged whale, every archived recording from any research institution worldwide becomes a source of additional training data. WhAM's automation means converting audio into phonetic text is computationally cheap.

What we're asking: contribute your audio archives to the pipeline. Share datasets through the ARION registry. Collaborate on expanding the phonetic alphabet to new species and populations.

For Open-Source Developers

Prototype it today

  1. 1

    Download the DSWP sperm whale coda dataset from Hugging Face

  2. 2

    Run CETI's open-source WhAM pipeline to produce phonetic text

  3. 3

    Tokenize the output with any modern tokenizer

  4. 4

    Perform continual pretraining on an open model (Llama-3 70B, Mixtral 8x22B)

  5. 5

    Evaluate cleaning effects by comparing output coherence before and after

What we're asking: build it, share your results, contribute to the registry.

Start here

Dominica Sperm Whale Project dataset

Hugging Face

Open →

WhAM (Whale Acoustics Model)

GitHub — Project CETI

Open →

DolphinGemma

Google DeepMind

Open →

Project CETI

projectceti.org

Open →

Wild Dolphin Project

wilddolphinproject.org

Open →

The ARION Founding Paper

arionresearch.org/paper

Read →