EDHIA

EDHIA project is framed within a coordinated effort to advance the early detection of various high-impact diseases through the application of natural language processing (NLP) and artificial intelligence (AI) on medical documents. The ultimate goal is to develop tools that can process large volumes of clinical data—such as Electronic Health Records (EHRs), clinical notes, and scientific literature—to identify early risk factors and support healthcare professionals in early diagnosis and intervention.

Objective and Scope:

Mental Health: Improving early detection of mental health issues, especially those that are often underreported or stigmatized, such as suicidal tendencies. The project leverages NLP techniques to process patient records and flag potential risks that might otherwise go unnoticed by physicians.

HIV Detection: Enhancing the diagnosis of HIV infections through NLP by identifying missed opportunities for testing and intervention in the patient records, supporting the World Health Organization’s 95-95-95 goal for HIV management by 2030.

Rare Diseases: Focusing on improving the quality of life for patients with rare diseases (RDs), particularly children. This involves identifying the connection between congenital malformations and their mental health evolution, along with other social determinants of health.

Cardiovascular Complications: Predicting risk factors related to cardiovascular diseases, especially after an initial episode of Atrial Fibrillation, using AI and NLP on structured data such as electrocardiogram reports and unstructured clinical notes.

Duration and Collaboration:

The project has a duration of 36 months and involves collaboration between several leading academic institutions:

- HiTZ (Basque Center for Language Technology): Focuses on language models and corpora annotation.
- UNED (National University of Distance Education): Leads the effort on developing computational tools for NLP in the medical domain.

Metodology:

The project will use a combination of structured and unstructured data, applying advanced NLP techniques such as temporal pattern detection, medical ontology enrichment, and language model fine-tuning to support the early detection tasks. These tools will be adapted to multilingual environments, making the solutions applicable across various linguistic contexts.

This collaboration ensures the generalization of the developed systems across different healthcare datasets, aiming to provide scalable solutions that can be applied across hospitals and healthcare systems.

Financial Support: Department of Culture and Language Policy

News

2024ko urtarrilaren 26

Pozarren aurkezten dugu Latxa eredu irekien familia, euskarazko hizkuntza eredurik handiena eta hoberena duena.

Publications

Santamaria, E. A., de Lacalle, O. L., Atutxa, A., Gojenola, K.: (2025). Do Entailment Models know about Reasoning Temporal Ordering on Clinical Texts? Procesamiento del Lenguaje Natural 74: 349-362.

Lebeña, N., Blanco, A., Casillas, A., Oronoz, M., Pérez, A.: (2025). Clinical Federated Learning for Private ICD-10 Classification of Electronic Health Records from Several Spanish Hospitals. Procesamiento del Lenguaje Natural 74: 33-42.

García-Olea, A., Domingo-Aldama, A. G., Merino, M., Gojenola, K., Goikoetxea, J., Atutxa, A., Ormaetxe, J. M.: (2025). The Application of Deep Learning Tools on Medical Reports to Optimize the Input of an Atrial-Fibrillation-Recurrence Predictive Model. Journal of Clinical Medicine 14 (7): 2297.

Iker De la Iglesia, Iakes Goenaga, Johanna Ramirez-Romero, Jose Maria Villa-Gonzalez, Josu Goikoetxea, and Ander Barrena. 2025. Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments. In Proceedings of the 31st International Conference on Computational Linguistics, pages 9456–9471, Abu Dhabi, UAE. Association for Computational Linguistics.

JR Martinez-Rico, L Araujo, J Martinez-Romo (2024). Building a framework for fake news detection in the health domain. Plos one 19 (7), e0305362.

Fernandez-Hernandez, J., Araujo, L., & Martinez-Romo, J. (2024). Generation of social network user profiles and their relationship with suicidal behaviour. Procesamiento del Lenguaje Natural, 72, 87-98.

Morales-Sánchez, R., Montalvo, S., Riaño, A., Martínez, R., & Velasco, M. (2024). Early diagnosis of HIV cases by means of text mining and machine learning models on clinical notes. Computers in Biology and Medicine, 179, 108830.

Lebeña, N., Pérez, A., & Casillas, A. (2024). Quantifying decision support level of explainable automatic classification of diagnoses in Spanish medical records. Computers in Biology and Medicine, 182, 109127.

Iñigo Alonso, Maite Oronoz, Rodrigo Agerri (2024). MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering. Artificial Intelligence in Medicine, 2024.

Iakes Goenaga, Aitziber Atutxa, Koldo Gojenola, Maite Oronoz, Rodrigo Agerri (2024), Explanatory argument extraction of correct answers in resident medical exams, Artificial Intelligence in Medicine Volume 157, November 2024, 102985.

Martinez-Romo, J., Huesca-Barril, J. F., Araujo, L., & Marin, E. D. L. C. (2024). UNED-UNIOVI at EmoSPeech-IberLEF2024: Emotion Identification in Spanish by Combining Multimodal Textual Analysis and Machine Learning Methods. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEURWS. Org.

Sierra-Callau, M., Rodríguez-García, M. Á., Montalvo-Herranz, S., & Martínez-Unanue, R. (2024). UNED_MRES Team at MentalRiskES2024: Exploring Hybrid Approaches to Detect Mental Disorder Risks in Social Media.

Arana, J., Idoyaga, M., Urruela, M., Espina, E., Salazar, A. A., & Gojenola, K. (2024, May). A Virtual Patient Dialogue System Based on Question-Answering on Clinical Records. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 2017-2027).

Fabregat, H., Deniz, D., Duque, A., Araujo, L., & Martinez-Romo, J. (2024). NLP-UNED at eRisk 2024: approximate nearest neighbors with encoding refinement for early detecting signs of anorexia. Working Notes of CLEF, 9-12.

Sánchez de Castro, A., Araujo, L., & Martinez-Romo, J. (2024). Generative LLMs for Multilingual Temporal Expression Normalization. In ECAI 2024 (pp. 3789-3796). IOS Press.

Larrayoz, X., Casillas, A., Oronoz, M., & Pérez, A. (2024). Mental Disorder Detection in Spanish: hands on skewed class distribution to leverage training. In IberLEF (Working Notes). CEUR Workshop Proceedings.

Anar Yeginbergen, Maite Oronoz, Rodrigo Agerri (2024). Argument Mining in Data Scarce Settings: Cross-lingual Transfer and Few-shot Techniques. Proceedings of the 2024 Main Conference of the Association for Computational Linguistics (ACL 2024). August 11th to 16th, 2024. Bangkok, Thailand.

Fernandez-Hernandez, J., Fabregat, H., Duque, A., Araujo, L., & Martinez-Romo, J. (2024). UNED-GELP at MentalRiskES 2024: Transformer-Based Encoders and Similarity Techniques for Early Risk Prediction of Mental Disorders. In IberLEF (Working Notes). CEUR Workshop Proceedings.

Jordan Koontz, Maite Oronoz, Alicia Pérez: (2024). Ixa-Med at Discharge Me! Retrieval-Assisted Generation for Streamlining Discharge Documentation. BioNLP@ACL 2024: 658-663.

Maite Oronoz, Sara Gracia, Jose Mari González, Alicia Pérez (2024). Suizidio-zantzuak sare sozialetan: ingelesez eta gaztelaniaz hizkuntza-ezaugarriak berdinak al dira? EKAIA: Zientzia eta Teknologia aldizkaria. 2024ko XX alea.

Nuria Lebeña, Arantza Casillas, and Alicia Pérez. (2024). Temporal Name Entity Recognition and Relation Extraction in Clinical Electronic Health Records with Span-based Entity and Relation Transformer. In Proceedings of the 2024 14th International Conference on Bioscience, Biochemistry and Bioinformatics (ICBBB '24). Association for Computing Machinery, New York, NY, USA, 48–54. https://doi.org/10.1145/3640900.3640901.

Jordan Koontz, Maite Oronoz, Alicia Pérez: (2023). Evaluating Data Augmentation for Medication Identification in Clinical Notes. RANLP 2023: 578-585.

Created Resources

EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing

EriBERTa is a bilingual language model specialized in the medical and clinical fields, pre-trained on extensive medical corpora. We have demonstrated that EriBERTa outperforms previous language models in the medical domain, thanks to its superior ability to understand medical texts and extract meaningful information. Moreover, EriBERTa exhibits strong transfer learning capabilities, allowing knowledge to be transferred from one language to another. This is particularly useful given the scarcity of clinical data in Spanish.

EriBERTa on Hugging Face

EDHIA

Applying Language Technology to Healthcare

News

Publications

Created Resources

EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing