Robust Neural Machine Translation of User-Generated Content

Abstract

User-generated content (UGC), such as social media text, presents significant challenges for natural language processing (NLP) due to its lexical variability and deviation from standard language norms. While neural machine translation (NMT) and sentence embedding models have achieved remarkable progress, their robustness to UGC remains limited. This thesis explores methods to improve the machine translation of UGC, with a focus on using sentence embeddings. We first analyse the performance of standard NMT models on UGC, identifying key challenges such as the negative effect of UGC on tokenisation and the scarcity of parallel UGC translation data. To mitigate these issues, we propose data augmentation techniques to train more robust models. Additionally, we explore lexical normalisation to reduce non-standardness in UGC. We also introduce RoLASER, a robust sentence embedding model trained via a teacher-student approach, designed to improve alignment between standard and UGC text representations. Extending this work, we develop RoSONAR, an NMT system that uses robust sentence embeddings to enhance translation quality for UGC. Our results demonstrate that robust embeddings and data augmentation significantly improve NMT performance on UGC, bridging the gap between standard and non-standard text translation. Furthermore, we conduct a case study on LLMs to analyse evaluation challenges in UGC translation. We show that applying the same translation guidelines used to create the datasets significantly improves LLM-generated translations, highlighting the importance of consistent evaluation practices. Overall, this work shows that by building more robust language models and improving both the training data and evaluation methods, we can make automatic translation systems much better at handling the non-standard, expressive, and creative language found in UGC.

Date
Jun 18, 2025 2:00 PM — 5:00 PM
Event
PhD Defence
Location
Centre Inria de Paris
Paris, France
Lydia Nishimwe
Lydia Nishimwe
PhD Graduate

PhD Graduate in AI, specifically Natural Language Processing (NLP).

Related