ARION
§ Science Research landscape

The research landscape.

ARION's proposal builds on independently published research across bioacoustics, NLP, and continual pretraining. Here are the foundations.

PAPERS   9YEARS   2020 – 2025DOMAINS   5
§ 02 Selected papers · ARION's foundations

Nine papers, five threads.

Phonetic Alphabet2025

Vowel- and Diphthong-like Patterns in Sperm Whale Communication

Beguš et al.
Open Mind

Sperm whale clicks contain vowel-like and diphthong-like spectral qualities, analyzed via source-filter theory from human speech — expanding the phonetic alphabet.

Cetacean AI2025

DolphinGemma

Google DeepMind · Georgia Tech · Wild Dolphin Project
Announced April 2025

First generative model for dolphin vocalizations. Predicts and generates realistic whistles, clicks, and burst pulses. Trained on 40+ years of Atlantic spotted dolphin recordings.

deepmind.google/models/gemma/dolphingemma/
Tools & Data2025

WhAM: Whale Acoustics Model

Project CETI
Open-source release

Transformer-based pipeline that automatically detects, segments, and annotates sperm whale codas using the phonetic alphabet. Runs on public datasets.

github.com/ceti-ai
Continual Pretraining2025

Efficient Continual Pre-training of LLMs for Low-resource Languages

Nag, Chakrabarti, Mukherjee, Ganguly
NAACL 2025

Adding small volumes of novel tokens during continual pretraining yields gains in the target domain with negligible regression on high-resource benchmarks.

Bootstrapping2020

Language Models are Few-Shot Learners (GPT-3)

Brown et al.
NeurIPS 2020

Demonstrated the iterative data-cleaning loop: noisy web text → model → cleaner data → better model. The same bootstrapping pattern ARION proposes for whale data.

Cross-lingual Alignment2025

Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

Agarwal & Gupta
arXiv preprint

Cross-lingual alignments strengthen with scale and can be induced by targeted interventions, including the addition of previously unseen languages.

§ 03 Field timeline

Twenty years to this moment.

2005

Shane Gero founds the Dominica Sperm Whale Project — the world's longest-running longitudinal study of sperm whale social structure and communication.

2019

GPT-2 demonstrates that language models can learn coherent, structured knowledge from raw noisy web text at scale — establishing the bootstrapping loop.

2020

Conneau et al. prove emergent cross-lingual alignment without parallel data. GPT-3 demonstrates iterative data cleaning. Project CETI founded as a TED Audacious Project.

2024

Sharma et al. (Nature Communications) discover that sperm whale codas constitute a combinatorial phonetic alphabet — rhythm, tempo, rubato, ornamentation, vowel quality.

2025

Beguš et al. identify vowel-like spectral patterns in whale clicks. DolphinGemma announced by Google DeepMind. WhAM open-sourced by Project CETI. Nag et al. validate continual pretraining safety.

2026

ARION founded. The founding paper connects these threads and proposes the first concrete pipeline for interspecies language alignment in frontier model pretraining.

§ Now The synthesis

Every piece exists.
Read how they assemble.