-
dataset
linguistic variation
Betthupferl
A dataset for evaluating ASR models on Bavarian, Franconian, Swabian, and Standad German (dialectal and standardized reference transcriptions).
-
dataset
linguistic variation
WikiDIR
An information retrieval dataset for German dialects and regional languages, with dictionaries of spelling/lexical variations.
-
model
linguistic variation
German PIXEL base
A PIXEL model trained on German data.
-
tool
linguistic variation
label variation
Eevee
An NLP annotation tool that can run directly in the browser. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.
-
dataset
linguistic variation
MaiBaam
A Bavarian Universal Dependency treebank with 15k annotated tokens from all Bavarian dialect areas and multiple text genres (wiki, fiction, grammar examples, social, non-fiction).
-
dataset
linguistic variation
NaBaLiSID
NaBaLiSID (Natural Lithuanian and Bavarian Slot and Intent Detection) provides new slot and intent detection evaluation datasets for Bavarian and Lithuanian, combining translations of xSID and MASSIVE with more natural, non-translated utterances.
-
dataset
linguistic variation
BarNER
BarNER (Bavarian Named Entity Recognition) presents annotations for named entities in Bavarian wiki and tweet data (161k tokens).
-
dataset
linguistic variation
Germanic LRL/Dialect Corpora
An overview of corpora for Germanic low-resource languages and dialects, covering >30 languages and >100 corpora.
-
dataset
label variation
Human Label Variation
An overview of 50+ datasets with human label variation (multiple, un-aggregated annotations per instance) in Natural Language Processing and Computer Vision.