Resources

Home / Resources

dataset linguistic variation

Betthupferl

A dataset for evaluating ASR models on Bavarian, Franconian, Swabian, and Standad German (dialectal and standardized reference transcriptions).
dataset linguistic variation

WikiDIR

An information retrieval dataset for German dialects and regional languages, with dictionaries of spelling/lexical variations.
model linguistic variation

German PIXEL base

A PIXEL model trained on German data.

HF
tool linguistic variation label variation

Eevee

An NLP annotation tool that can run directly in the browser. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.
dataset linguistic variation

MaiBaam

A Bavarian Universal Dependency treebank with 15k annotated tokens from all Bavarian dialect areas and multiple text genres (wiki, fiction, grammar examples, social, non-fiction).
dataset linguistic variation

NaBaLiSID

NaBaLiSID (Natural Lithuanian and Bavarian Slot and Intent Detection) provides new slot and intent detection evaluation datasets for Bavarian and Lithuanian, combining translations of xSID and MASSIVE with more natural, non-translated utterances.
dataset linguistic variation

BarNER

BarNER (Bavarian Named Entity Recognition) presents annotations for named entities in Bavarian wiki and tweet data (161k tokens).
dataset linguistic variation

Germanic LRL/Dialect Corpora

An overview of corpora for Germanic low-resource languages and dialects, covering >30 languages and >100 corpora.
dataset label variation

Human Label Variation

An overview of 50+ datasets with human label variation (multiple, un-aggregated annotations per instance) in Natural Language Processing and Computer Vision.