Skip to main content

Research registry

Research registry

An open database of models, datasets, and tools for non-human language research.

The ARION Research Registry is under development.

When launched, it will be a browsable, searchable database where researchers can share and discover:

Models

Language models trained with non-human phonetic data. Track architecture, size, human/animal dataset ratios, injection method, and generation lineage.

Datasets

Audio recordings, video, phonetic transcriptions. Raw or model-prepared, with full provenance: which generation of model prepared the data, and which generation it's intended to feed.

Tools

Annotation pipelines, tokenizers, phonetic alphabets, evaluation frameworks.

How submission will work

Instead of filling out dropdown menus and categorization forms, you'll chat with an AI research clerk. Describe your model, dataset, or tool in natural language. The clerk will ask clarifying questions, categorize your submission, and validate it for completeness — no forms, no friction.

Preview — AI research clerk

clerk
What would you like to submit to the registry?
you
I trained a Llama-3 variant on the DSWP dataset using continual pretraining at 2% token ratio.
clerk
Got it — sounds like a model submission. What's the parameter count, and did you use WhAM-annotated phonetic text or raw audio tokens?

Coming in v2 — powered by Cloudflare Workers

Seed entries

What the registry will look like

Dataset
Publicly available

Dominica Sperm Whale Project Archive

Long-term acoustic dataset of sperm whale codas collected off the coast of Dominica. The foundational dataset for sperm whale communication research.

Species
Sperm whale (Physeter macrocephalus)
Source
Dominica Sperm Whale Project
Data type
audio
Location
Eastern Caribbean, Dominica
Years
2005–present
Tool

WhAM (Whale Acoustics Model)

Transformer-based pipeline that automatically detects, segments, and annotates sperm whale codas using the phonetic alphabet. Runs on public datasets.

Source
Project CETI
Architecture
Transformer-based
Function
Automated coda detection, segmentation, and phonetic annotation
Model
Open model release planned

DolphinGemma

First generative model for dolphin vocalizations. Predicts and generates realistic whistles, clicks, and burst pulses. Trained on 40+ years of Atlantic spotted dolphin recordings.

Species
Atlantic spotted dolphin
Source
Google DeepMind + Georgia Tech + Wild Dolphin Project
Architecture
Gemma-based with SoundStream tokenization

Get notified at launch

Leave your email to be notified when the registry launches.