Research Registry

Research registry

An open database of models, datasets, and tools for non-human language research.

The ARION Research Registry is under development.

When launched, it will be a browsable, searchable database where researchers can share and discover:

Models

Language models trained with non-human phonetic data. Track architecture, size, human/animal dataset ratios, injection method, and generation lineage.

Datasets

Audio recordings, video, phonetic transcriptions. Raw or model-prepared, with full provenance: which generation of model prepared the data, and which generation it's intended to feed.

Tools

Annotation pipelines, tokenizers, phonetic alphabets, evaluation frameworks.

How submission will work

Instead of filling out dropdown menus and categorization forms, you'll chat with an AI research clerk. Describe your model, dataset, or tool in natural language. The clerk will ask clarifying questions, categorize your submission, and validate it for completeness — no forms, no friction.

Preview — AI research clerk

clerk

What would you like to submit to the registry?

you

I trained a Llama-3 variant on the DSWP dataset using continual pretraining at 2% token ratio.

clerk

Got it — sounds like a model submission. What's the parameter count, and did you use WhAM-annotated phonetic text or raw audio tokens?

Coming in v2 — powered by Cloudflare Workers

Seed entries

What the registry will look like

Dataset

Publicly available

Dominica Sperm Whale Project Archive

Long-term acoustic dataset of sperm whale codas collected off the coast of Dominica. The foundational dataset for sperm whale communication research.

Species: Sperm whale (Physeter macrocephalus)
Source: Dominica Sperm Whale Project
Data type: audio
Location: Eastern Caribbean, Dominica
Years: 2005–present

Hugging Face

Tool

WhAM (Whale Acoustics Model)

Transformer-based pipeline that automatically detects, segments, and annotates sperm whale codas using the phonetic alphabet. Runs on public datasets.

Source: Project CETI
Architecture: Transformer-based
Function: Automated coda detection, segmentation, and phonetic annotation

View GitHub

Model

Open model release planned

DolphinGemma

First generative model for dolphin vocalizations. Predicts and generates realistic whistles, clicks, and burst pulses. Trained on 40+ years of Atlantic spotted dolphin recordings.

Species: Atlantic spotted dolphin
Source: Google DeepMind + Georgia Tech + Wild Dolphin Project
Architecture: Gemma-based with SoundStream tokenization

View

Get notified at launch

Leave your email to be notified when the registry launches.