research projects

TRAIN: EXTREMELY LOW-RESOURCED MACHINE TRANSLATION


The project will explore a diversity of semi-supervised learning techniques and develop translation systems between Spanish and certain low-resource languages covering migrant (Amazigh, Pashto and Wolof) and ethnic minority (Romani) languages, as well as from Spanish sign language into Spanish.
(2022 - 2025)

In recent years, unsupervised machine translation has shown that it is possible to develop machine translation systems even in contexts where no bilingual information is available (neither bilingual dictionaries nor parallel corpora). But in practice there is always some bilingual information accessible. Recent approaches such as [Conneau, 2020] have started to combine monolingual and parallel data with good results.
Within the present project, we plan to explore novel methods about multilingual transfer learning techniques and combine supervised and unsupervised techniques, while respecting efficiency and modularity constraints, so that language translation with very few resources can benefit from these techniques. Our hypothesis is that techniques developed for unsupervised machine translation can be efficiently adapted to incorporate the bilingual information present for a given language pair and thus obtain usable translation systems even when few parallel resources are available, also in the case of multimodal translation involving a sign language, such as LSE.
Although a successful start has already been made to combine parallel and monolingual data, this hypothesis has not been tested in the case of languages with very few parallel resources such as the languages targeted by this project and, what is even more difficult, in the case of nonverbal languages that are not usually expressed in written coded form, so that there is an extremely low number of very small parallel corpora.
To test our hypothesis, the project will explore a diversity of semi-supervised learning techniques and develop translation systems between Spanish and certain low-resource languages covering migrant (Amazigh, Pashto and Wolof) and ethnic minority (Romani) languages, as well as from Spanish sign language into Spanish. Thus contributing to the inclusion of these vulnerable groups, including migrants, refugees and deaf and hard of hearing people.
Organization:  Ministerio de Ciencia e Innovación
Main researcher: Gorka Labaka eta Eneko Agirre
Participants
Eneko Agirre, Nora Aranberri, Maxux Aranzabe, Xabier Arregi, Kepa Bengoetxea, Gorka Labaka, Mikel Lersundi, Olatz Perez de Viñaspre, Ander Soraluze, Ruben Urizar


More projects