Get involved
The time to act is now
The next frontier training run is being planned. Here's how you can help ensure it includes non-human language data.
For Frontier Labs
Include the data in your next run
You already have the infrastructure for adding new language data to pretraining corpora. Tokenized sperm whale phonetic text is just another low-resource language — a few percent of total tokens.
The marginal cost is negligible. The potential upside is profound: iterative refinement across generations could produce the first genuine interspecies translator.
What we're asking: add tokenized non-human phonetic datasets — starting with sperm whale codas — as a standard component of your pretraining corpus.
Artistic illustration — a frontier model training pipeline visualization with a small stream of teal-colored whale tokens being mixed into the massive river of human text data. Scale contrast emphasized.
For Bioacoustics Researchers
Your recordings are training tokens
Every hydrophone deployment, every tagged whale, every archived recording from any research institution worldwide becomes a source of additional training data. WhAM's automation means converting audio into phonetic text is computationally cheap.
What we're asking: contribute your audio archives to the pipeline. Share datasets through the ARION registry. Collaborate on expanding the phonetic alphabet to new species and populations.
For Open-Source Developers
Prototype it today
- 1
Download the DSWP sperm whale coda dataset from Hugging Face
- 2
Run CETI's open-source WhAM pipeline to produce phonetic text
- 3
Tokenize the output with any modern tokenizer
- 4
Perform continual pretraining on an open model (Llama-3 70B, Mixtral 8x22B)
- 5
Evaluate cleaning effects by comparing output coherence before and after
What we're asking: build it, share your results, contribute to the registry.