Deep Learning

When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

Lydia Nishimwe, Benoît Sagot, Rachel Bawden

Robust Neural Machine Translation of User-Generated Content

🎓 PhD Thesis 🎓

Lydia Nishimwe

Making Sentence Embeddings Robust to User-Generated Content

LREC-COLING 2024

Lydia Nishimwe, Benoît Sagot, Rachel Bawden

Making Sentence Embeddings Robust to User-Generated Content

Your Fairseq-trained model might have more embedding parameters than it should.

How a bug in reading SentencePiece vocabulary files causes some Fairseq-trained models to have up to 3k extra parameters in the embedding layer.

Lydia Nishimwe, posted on Mar 16, 2024

Last updated on Jan 8, 2026

Your Fairseq-trained model might have more embedding parameters than it should.

Fairseq Bug Fix

A bug in reading SentencePiece vocabulary files causes models to have 3k extra params in the embedding layer.

Fairseq Bug Fix

Making LASER sentence embeddings robust to user-generated content via Knowledge Distillation and Data Augmentation.

RoLASER

Normalisation lexicale de contenus générés par les utilisateurs sur les réseaux sociaux

🏆 Prix du Meilleur Article (Best Paper Award) - RÉCITAL 2023 🏆

Lydia Nishimwe

Normalisation lexicale de contenus générés par les utilisateurs sur les réseaux sociaux

Inria-ALMAnaCH at the WMT 2022 shared task: Does Transcription Help Cross-Script Machine Translation?

Jesujoba O Alabi, Lydia Nishimwe, Benjamin Muller, Camille Rey, Benoît Sagot, Rachel Bawden