-
dataset linguistic variation
WikiDIR
An information retrieval dataset for German dialects and regional languages, with dictionaries of spelling/lexical variations.
-
tool linguistic variation label variation
Eevee
An NLP annotation tool that can run directly in the browser. It allows for annotation of multiple tasks on a single dataset and supports four task-types: sequence labeling, span labeling, text classification and seq2seq.
-
dataset linguistic variation
MaiBaam
A Bavarian Universal Dependency treebank with 15k annotated tokens from all Bavarian dialect areas and multiple text genres (wiki, fiction, grammar examples, social, non-fiction).
-
dataset linguistic variation
NaBaLiSID
NaBaLiSID (Natural Lithuanian and Bavarian Slot and Intent Detection) provides new slot and intent detection evaluation datasets for Bavarian and Lithuanian, combining translations of xSID and MASSIVE with more natural, non-translated utterances.
-
dataset linguistic variation
BarNER
BarNER (Bavarian Named Entity Recognition) presents annotations for named entities in Bavarian wiki and tweet data (161k tokens).
-
dataset linguistic variation
Germanic LRL/Dialect Corpora
An overview of corpora for Germanic low-resource languages and dialects, covering >30 languages and >100 corpora.
-
dataset label variation
Human Label Variation
An overview of 50+ datasets with human label variation (multiple, un-aggregated annotations per instance) in Natural Language Processing and Computer Vision.