Get involved

The time to act is now

The next frontier training run is being planned. Here's how you can help ensure it includes non-human language data.

For Frontier Labs

Include the data in your next run

You already have the infrastructure for adding new language data to pretraining corpora. Tokenized sperm whale phonetic text is just another low-resource language — a few percent of total tokens.

The marginal cost is negligible. The potential upside is profound: iterative refinement across generations could produce the first genuine interspecies translator.

What we're asking: add tokenized non-human phonetic datasets — starting with sperm whale codas — as a standard component of your pretraining corpus.

artistic · 16:5

Artistic illustration — a frontier model training pipeline visualization with a small stream of teal-colored whale tokens being mixed into the massive river of human text data. Scale contrast emphasized.

For Bioacoustics Researchers

Your recordings are training tokens

Every hydrophone deployment, every tagged whale, every archived recording from any research institution worldwide becomes a source of additional training data. WhAM's automation means converting audio into phonetic text is computationally cheap.

What we're asking: contribute your audio archives to the pipeline. Share datasets through the ARION registry. Collaborate on expanding the phonetic alphabet to new species and populations.

For Open-Source Developers