Research landscape
The research landscape
ARION's proposal builds on independently published research across bioacoustics, NLP, and continual pretraining. Here are the foundations.
Contextual and Combinatorial Structure in Sperm Whale Vocalisations
Sharma, Gero, Payne, Gruber, Rus, Torralba, Andreas (2024)
Nature Communications
Sperm whale codas operate as a combinatorial coding system with a phonetic alphabet based on rhythm, tempo, rubato, and ornamentation — far richer than previously known.
Vowel- and Diphthong-like Patterns in Sperm Whale Communication
Beguš et al. (2025)
Open Mind
Sperm whale clicks contain vowel-like and diphthong-like spectral qualities, analyzed via source-filter theory from human speech — expanding the phonetic alphabet.
Emerging Cross-lingual Structure in Pretrained Language Models
Conneau, Wu, Li, Zettlemoyer, Stoyanov (2020)
ACL 2020
Multilingual models spontaneously align semantically equivalent concepts across languages without any parallel data — even when there is no shared vocabulary.
Pretraining on 100 languages at scale produces strong cross-lingual transfer, especially for low-resource languages like Swahili and Urdu.
First generative model for dolphin vocalizations. Predicts and generates realistic whistles, clicks, and burst pulses. Trained on 40+ years of Atlantic spotted dolphin recordings.
Transformer-based pipeline that automatically detects, segments, and annotates sperm whale codas using the phonetic alphabet. Runs on public datasets.
Efficient Continual Pre-training of LLMs for Low-resource Languages
Nag, Chakrabarti, Mukherjee, Ganguly (2025)
NAACL 2025
Adding small volumes of novel tokens during continual pretraining yields gains in the target domain with negligible regression on high-resource benchmarks.
Language Models are Few-Shot Learners (GPT-3)
Brown et al. (2020)
NeurIPS 2020
Demonstrated the iterative data-cleaning loop: noisy web text → model → cleaner data → better model. The same bootstrapping pattern ARION proposes for whale data.
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Agarwal & Gupta (2025)
arXiv preprint
Cross-lingual alignments strengthen with scale and can be induced by targeted interventions, including the addition of previously unseen languages.
Field timeline
Twenty years to this moment
Shane Gero founds the Dominica Sperm Whale Project — the world's longest-running longitudinal study of sperm whale social structure and communication.
GPT-2 demonstrates that language models can learn coherent, structured knowledge from raw noisy web text at scale — establishing the bootstrapping loop.
Conneau et al. prove emergent cross-lingual alignment without parallel data. GPT-3 demonstrates iterative data cleaning. Project CETI founded as a TED Audacious Project.
Sharma et al. (Nature Communications) discover that sperm whale codas constitute a combinatorial phonetic alphabet — rhythm, tempo, rubato, ornamentation, vowel quality.
Beguš et al. identify vowel-like spectral patterns in whale clicks. DolphinGemma announced by Google DeepMind. WhAM open-sourced by Project CETI. Nag et al. validate continual pretraining safety.
ARION founded. The founding paper connects these threads and proposes the first concrete pipeline for interspecies language alignment in frontier model pretraining.