Research landscape

The research landscape

ARION's proposal builds on independently published research across bioacoustics, NLP, and continual pretraining. Here are the foundations.

Phonetic Alphabet

Contextual and Combinatorial Structure in Sperm Whale Vocalisations

Sharma, Gero, Payne, Gruber, Rus, Torralba, Andreas (2024)

Nature Communications

Sperm whale codas operate as a combinatorial coding system with a phonetic alphabet based on rhythm, tempo, rubato, and ornamentation — far richer than previously known.

Phonetic Alphabet

Vowel- and Diphthong-like Patterns in Sperm Whale Communication

Beguš et al. (2025)

Open Mind

Sperm whale clicks contain vowel-like and diphthong-like spectral qualities, analyzed via source-filter theory from human speech — expanding the phonetic alphabet.

Cross-lingual Alignment

Emerging Cross-lingual Structure in Pretrained Language Models

Conneau, Wu, Li, Zettlemoyer, Stoyanov (2020)

ACL 2020

Multilingual models spontaneously align semantically equivalent concepts across languages without any parallel data — even when there is no shared vocabulary.

Cross-lingual Alignment

Unsupervised Cross-lingual Representation Learning at Scale (XLM-R)

Conneau et al. (2020)

ACL 2020

Pretraining on 100 languages at scale produces strong cross-lingual transfer, especially for low-resource languages like Swahili and Urdu.

Cetacean AI

DolphinGemma

Google DeepMind, Georgia Tech, Wild Dolphin Project (2025)

Announced April 2025

First generative model for dolphin vocalizations. Predicts and generates realistic whistles, clicks, and burst pulses. Trained on 40+ years of Atlantic spotted dolphin recordings.

Tools & Data

WhAM: Whale Acoustics Model

Project CETI (2025)

Open-source release

Transformer-based pipeline that automatically detects, segments, and annotates sperm whale codas using the phonetic alphabet. Runs on public datasets.

Continual Pretraining

Efficient Continual Pre-training of LLMs for Low-resource Languages

Nag, Chakrabarti, Mukherjee, Ganguly (2025)

NAACL 2025

Adding small volumes of novel tokens during continual pretraining yields gains in the target domain with negligible regression on high-resource benchmarks.

Bootstrapping

Language Models are Few-Shot Learners (GPT-3)

Brown et al. (2020)

NeurIPS 2020

Demonstrated the iterative data-cleaning loop: noisy web text → model → cleaner data → better model. The same bootstrapping pattern ARION proposes for whale data.

Cross-lingual Alignment

Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

Agarwal & Gupta (2025)

arXiv preprint

Cross-lingual alignments strengthen with scale and can be induced by targeted interventions, including the addition of previously unseen languages.

Field timeline

Twenty years to this moment

2005

Shane Gero founds the Dominica Sperm Whale Project — the world's longest-running longitudinal study of sperm whale social structure and communication.

2019

GPT-2 demonstrates that language models can learn coherent, structured knowledge from raw noisy web text at scale — establishing the bootstrapping loop.

2020

Conneau et al. prove emergent cross-lingual alignment without parallel data. GPT-3 demonstrates iterative data cleaning. Project CETI founded as a TED Audacious Project.

2024

Sharma et al. (Nature Communications) discover that sperm whale codas constitute a combinatorial phonetic alphabet — rhythm, tempo, rubato, ornamentation, vowel quality.

2025

Beguš et al. identify vowel-like spectral patterns in whale clicks. DolphinGemma announced by Google DeepMind. WhAM open-sourced by Project CETI. Nag et al. validate continual pretraining safety.

2026

ARION founded. The founding paper connects these threads and proposes the first concrete pipeline for interspecies language alignment in frontier model pretraining.