Phonetic Alphabet2024
Sharma, Gero, Payne, Gruber, Rus, Torralba, Andreas
Nature Communications
Sperm whale codas operate as a combinatorial coding system with a phonetic alphabet based on rhythm, tempo, rubato, and ornamentation — far richer than previously known.
doi.org/10.1038/s41467-024-47221-8Phonetic Alphabet2025
Vowel- and Diphthong-like Patterns in Sperm Whale Communication
Beguš et al.
Open Mind
Sperm whale clicks contain vowel-like and diphthong-like spectral qualities, analyzed via source-filter theory from human speech — expanding the phonetic alphabet.
Cross-lingual Alignment2020
Conneau, Wu, Li, Zettlemoyer, Stoyanov
ACL 2020
Multilingual models spontaneously align semantically equivalent concepts across languages without any parallel data — even when there is no shared vocabulary.
aclanthology.org/2020.acl-main.536/Cross-lingual Alignment2020
Conneau et al.
ACL 2020
Pretraining on 100 languages at scale produces strong cross-lingual transfer, especially for low-resource languages like Swahili and Urdu.
aclanthology.org/2020.acl-main.747/Cetacean AI2025
Google DeepMind · Georgia Tech · Wild Dolphin Project
Announced April 2025
First generative model for dolphin vocalizations. Predicts and generates realistic whistles, clicks, and burst pulses. Trained on 40+ years of Atlantic spotted dolphin recordings.
deepmind.google/models/gemma/dolphingemma/Tools & Data2025
Project CETI
Open-source release
Transformer-based pipeline that automatically detects, segments, and annotates sperm whale codas using the phonetic alphabet. Runs on public datasets.
github.com/ceti-aiContinual Pretraining2025
Efficient Continual Pre-training of LLMs for Low-resource Languages
Nag, Chakrabarti, Mukherjee, Ganguly
NAACL 2025
Adding small volumes of novel tokens during continual pretraining yields gains in the target domain with negligible regression on high-resource benchmarks.
Bootstrapping2020
Language Models are Few-Shot Learners (GPT-3)
Brown et al.
NeurIPS 2020
Demonstrated the iterative data-cleaning loop: noisy web text → model → cleaner data → better model. The same bootstrapping pattern ARION proposes for whale data.
Cross-lingual Alignment2025
Languages are Modalities: Cross-Lingual Alignment via Encoder Injection
Agarwal & Gupta
arXiv preprint
Cross-lingual alignments strengthen with scale and can be induced by targeted interventions, including the addition of previously unseen languages.