ARION
§ Registry Research registry
v1 coming soon

An open database for
non-human language research.

A browsable, searchable directory of models, datasets, and tools. With generation lineage, provenance chains, and an AI clerk that takes submissions in natural language.

§ 02 What it tracks

Three things, fully indexed.

01

Models

Language models trained with non-human phonetic data. Track architecture, size, human/animal dataset ratios, injection method, and generation lineage.

02

Datasets

Audio recordings, video, phonetic transcriptions. Raw or model-prepared, with full provenance: which generation of model prepared the data, and which generation it's intended to feed.

03

Tools

Annotation pipelines, tokenizers, phonetic alphabets, evaluation frameworks. With supported species and direct links to source.

§ 03 How submission will work

No forms. Just describe it.

Instead of dropdown menus, you'll chat with an AI research clerk. It asks the clarifying questions, categorises your submission, and validates completeness — no friction.

Preview · AI research clerkv3 preview
clerk
What would you like to submit to the registry?
you
I trained a Llama-3 variant on the DSWP dataset using continual pretraining at 2% token ratio.
clerk
Got it — sounds like a model submission. What's the parameter count, and did you use WhAM-annotated phonetic text or raw audio tokens?
you
70B parameters. WhAM phonetic text, BPE tokenized.
clerk
Great. I'll create a model entry: Llama-3-70B-Cetacean-G1. Which generation does this build on, and what's your evaluation method for the next iteration?
v1 ships static · v2 adds backend · v3 enables this clerkpowered by Cloudflare Workers
§ 04 Seed entries · what the registry will look like

Three seed entries.

DatasetPUBLICLY AVAILABLE

Dominica Sperm Whale Project Archive

Long-term acoustic dataset of sperm whale codas collected off the coast of Dominica. The foundational dataset for sperm whale communication research.

Species
Sperm whale (Physeter macrocephalus)
Source
Dominica Sperm Whale Project
Data type
audio
Location
Eastern Caribbean, Dominica
Years
2005 – present
ToolOPEN SOURCE

WhAM — Whale Acoustics Model

Transformer-based pipeline that automatically detects, segments, and annotates sperm whale codas using the phonetic alphabet. Runs on public datasets.

Source
Project CETI
Architecture
Transformer-based
Function
Coda detection · segmentation · phonetic annotation
ModelOPEN MODEL PLANNED

DolphinGemma

First generative model for dolphin vocalizations. Predicts and generates realistic whistles, clicks, and burst pulses. Trained on 40+ years of Atlantic spotted dolphin recordings.

Species
Atlantic spotted dolphin
Source
Google DeepMind · Georgia Tech · Wild Dolphin Project
Architecture
Gemma-based with SoundStream tokenization
§ Notify Launch announcement

Get notified at launch.

Leave your email to be notified when the registry opens for submissions.