research projects

DeepMinor: Language Models for Multilingual and Multidomain Text Processing in Low Resource Scenarios


Language Models for Multilingual and Multidomain Text Processing in Low Resource Scenarios

(2024 - 2026)

Thanks to these recent advancements in Large Language Models (LLMs), the NLP research field is engaged in a paradigm shift focused on the production and exploitation of these large language models. In fact, results are improving so much that systems are claiming to obtain human-level performance in laboratory benchmarks when tested on some difficult language understanding tasks. As a result, many in the industry have started deploying large pre-trained neural language models in production. While impressive, these LLMs have been developed mostly for English, they are not public, and have been evaluated almost exclusively on English-centric Natural Language Processing (NLP) benchmarks. These benchmarks are crucial to understand the limitations and possibilities in using these LLMs to improve the state-of-the-art in NLP. Thus, for the large majority of languages and domains, the performance of such LLMs is unknown or it simply cannot be objectively measured. This is due to the fact that either they have not been pre-trained for languages such as Basque or Spanish or because of the lack of readily available benchmarks which would allow to evaluate the Natural Language Understanding and Generation capabilities for those languages.

This project aims to investigate and develop enabling techniques and methods to develop and adapt monolingual and multilingual LLMs to new languages, text genres and domains. In particular, this project will focus on adapting and generating models specially tailored for Basque and Spanish (in addition to English), both for classification and generation tasks. We will also work towards filling the current gap on language models in these languages for specific tasks in domains such as health or genres such as social, for which little or no manually annotated data for those tasks and languages is available.
Organization:  Ministerio de Ciencia, Innovación y Universidades
Main researcher: Rodrigo Agerri
Participants


More projects