Natural Language Understanding for non-standard languages and dialects
Dialects are ubiquitous and for many speakers are part of everyday life. They carry important social and communicative functions. Yet, dialects and non-standard languages in general are a blind spot in research on Natural Language Understanding (NLU). Despite recent breakthroughs, NLU still fails to take linguistic diversity into account. This lack of modeling language variation results in biased language models with high error rates on dialect data. This failure excludes millions of speakers today and prevents the development of future technology that can adapt to such users.
To account for linguistic diversity, a paradigm shift is needed: Away from data-hungry algorithms with passive learning from large data and single ground truth labels, which are known to be biased. To go past current learning practices, the key is to tackle variation at both ends: in input data and label bias. With DIALECT, we propose such an integrated approach, to devise algorithms which aid transfer from rich variability in inputs, and interactive learning which integrates human uncertainty in labels. This will reduce the need for data and enable better adaptation and generalization.
Advances in salient areas of deep learning research now make it possible to tackle this challenge. DIALECT’s objectives are to devise:
By integrating dialectal variation into models able to learn from scarce data and biased labels, the foundations will be established for fairer and more accurate NLU to break down language and literary barriers.
DIALECT is led by Prof. Dr. Barbara Plank and hosted at the Center for Information and Language Processing (CIS), MaiNLP lab, LMU Munich.
Apr 17, 2024
We are proud to present our recent research on NLP for Bavarian / NLP fi Bairisch!
May 01, 2023
What corpora are available for Germanic low-resource language varieties?