Neural Machine Translation by Jointly Learning to Align and Translate

This paper marks an important step in the development of machine translation (MT). It came out just as Neural Machine Translation (NMT) was taking off, extending from Statistical Machine Translation (SMT), and as a milestone along the way to Transformer-based NMT. The authors introduce a novel attention mechanism applied to MT and show that it improves performance on long sentences, in particular, from prior recurrent neural network NMT approaches.
paper
Authors

Meghann Agarwal

Hongsup Shin

Claude

Published

September 28, 2023

Why this paper?

I chose this paper for a few reasons. First, I have a longheld interest in human languages and language-learning plus some experience as a freelance translator, prior to the release of Google Translate. I casually follow the progression of machine translation tools and am generally curious about how they are implemented. Second, I’ve recently been involved with MT tasks at work and want to deepen my understanding of its foundations. Finally, MT is a prevalent use case in the development of Large Language Models (LLMs); this is a well-cited paper in that path that I felt may be worthwhile for others in the Journal Club to read and discuss.

Paper Summary

There are many good posts, videos, tutorials, etc. about this paper. Rather than write my own summary, I’ve included LLM output generated via Anthropic’s Claude AI Assistant for a use-case similar to machine translation: summarization (i.e., English-to-English “translation”). The output looks okay so I have left it unchanged.

I’ve also included a few resources that that were helpful and/or interesting to me while reading this paper.

ClaudeAI Summary

I uploaded the pdf to https://claude.ai/ and gave the following prompt to get a summary of this paper.

Prompt: “Summarize this article”

Output: “Here is a summary of the key points from the article:

The article proposes a novel neural network architecture for neural machine translation. The key ideas are:

  • Existing encoder-decoder models encode the full source sentence into a fixed-length vector from which the translation is decoded. This may be problematic for long sentences.
  • The proposed model encodes the source sentence into a sequence of vectors and adaptively selects a subset of these source vectors while decoding each target word. This allows it to focus only on the relevant parts of the source sentence.
  • An alignment model is jointly trained to compute a soft alignment between the source and target words, allowing gradients to be backpropagated through the alignments.
  • Experiments on English-French translation show the proposed model (RNNsearch) significantly outperforms a standard encoder-decoder model (RNNencdec), especially on long sentences.
  • The RNNsearch model achieves similar performance to phrase-based statistical machine translation on a test set with only sentences containing known words.
  • Qualitative analysis shows the model learns linguistically plausible alignments between source and target words.

In summary, the key innovation is an adaptive attention-based encoder-decoder architecture that gives improved translation performance compared to prior encoder-decoder sequence-to-sequence models for neural machine translation.”

Additional Resources

  • Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention) - This post gives a great conceptual overview of the paper and describes the salient changes from previous sequence-to-sequence models, namely that (1) the decoder does not receive only the last hidden state from the encoder but rather receives all hidden states, and (2) scores the encoder hidden states at each decoder time step to produce a context vector relevant to that particular time step instead of a fixed-length context vector for all times.
  • PyTorch Seq2Seq Tutorials - This set of six notebooks takes one through a progression of code implementations of papers, from those preceding and/or contemporary to this paper (notebooks #1 and #2), to this paper (notebook #3), and culminating in the paper that introduced self-attention and transformers (notebook #5).
  • Dive into Deep Learning: Coding Session #4 Attention Mechanism I (Americas/EMEA) - Just before the Journal Club meeting it was discovered that one of us (Kshitij) led a live coding session of the “Bahdanau Attention” RNN in this video. The theory section before it is also very nice.
  • Natural Language Processing Coursera course by HSE University - I initially watched these videos on Coursera and found the lectures to be very clear. However, Coursera no longer hosts the course; it has been moved to HSE’s online education portal which requires account creation and login to find the course. The code repository associated with the Coursera course is still hosted on GitHub. The lectures are currently on YouTube, although without the weekly groupings and other materials that exist in the course. This paper (Bahdanau et al.) is discussed in the video titled “Attention Mechanism”.
  • Building Machine Translation Systems for the Next Thousand Languages - I found this report very interesting. It didn’t seem like the right fit for Journal Club (e.g., not peer-reviewed) but it has a lot of interesting discussion and I find the reference tables quite useful.

Paper Discussion

Many thanks to Hongsup for taking notes during the discussion and for sharing them. I’ve lightly edited some of them in the notes section below. Others I have integrated with my own recollections from our Journal Club conversation, a few months later as I’m putting together this post (my apologies for not posting sooner).

We started the meeting by discussing machine translation in general along with more specific observations of MT from a user’s point of view (some of which are listed in the text below). We then put the paper’s contribution into context within the broader scope of MT development with a foray into the history of MT development. A standard method prior to this paper was phrase-based statistical MT (phrase by phrase; dictionary of translations; lookup table) (e.g., Moses). This paper exists in the regime of recurrent neural networks sequence-to-sequence neural MT. A standard method nowadays is Transformer-based neural MT (c.f., Vaswani et al.).

After the introductory comments we delved in to the details of the paper together.

Discussion Notes

The following notes serve as a reminder to attendees of the various discussion points.

  • Encoder-decoder model: the max value of i and j can be different; input and output vector lengths may not match. Similar to text summarization.
  • Main contribution:
    • Flexible length vector
    • Soft-search: what is “soft”? (probably probabilities, soft-max), also a bit unclear what “search” means
  • “Fixed length” vector
    • What is exactly fixed? not the number of units in a hidden layer
    • Context vector: c vs. c_i (c_i depends on annotation, which seems to correspond to the number of words?)
  • Difference between this and “Attention is All You Need” paper? (RNN vs. transformer?)
  • What is alignment model?: essentially a tanh function (single-layer multilayer perceptron)
  • BLEU and perplexity (higher: more “surprised” or “perplexed” so lower the better)
  • Datasets
    • Europarl: european parliament database (probably 1:1 translation relationship because it’s parliament document translation)
    • Parallel corpora: probably hard to make a high quality one
  • Data split (validation/test) is interesting because temporal nature is considered
  • Most frequent “words”: words or tokens?
  • Beam search?
  • Figure3: not one-to-one mapping, so weights; evidence of alignment model working (proba)
    • How would it look like when we pair up languages with different origins?
    • Would’ve been nice to pair up non-english languages
  • Can we zero-shot machine translation? how would this work?
  • Table 1
    • Moses seems to use a larger dataset
    • BLEU score is compared to Moses (the fine-tuned 50-words model; would’ve been nice to see qualitative results)
  • Model training compute resource description was different back then: “5 days” (of what though?)
  • Appendix C: difficult to understand if you’re not a French speaker
  • Why does machine translation get worse towards the end? (especially after 30 words, what’s so special about this number?)
    • Maybe related to not being directional? (Moses)
    • Maybe because of training data? The majority examples might be around 30 words? (15-40 words the paper said); would’ve been nice to see the distribution of word counts across sentences
    • In a longer sentence, the latter parts will have to refer to previously mentioned words and this can be forgotten, difficult to figure out, etc.

Final Thoughts

“Neural Machine Translation by Jointly Learning to Align and Translate” marks an important step in the development of machine translation (MT), applying the concept of attention to MT.

I found it beneficial to discuss this paper as a group, working together to explain and understand specific concepts and terms. The meeting also provided a forum for exploring and thinking about broader topics around machine translation. I’ll end by listing a few of the topics and questions that we considered.

  • Sentence-based translation and how that can mis-translate pronouns for highly conjugated languages where the pronoun is not explicit. For example, knowing context from other sentences in a person’s bio could reduce mis-translations of pronouns (changing between sentences) - it’s obvious to a human reader.
  • Challenges of translating/detecting slang, expletives, rare words or phrases, etc.
  • Zero-shot translation (e.g., where parallel corpora do not exist for a language pair) - Comparing the embedding space of one language to another: Would a model represent a concept across multiple language? How would this look?
  • How can a machine translation avoid stereotyping: can we create a representation while maintaining diversity? - Hallucinations and bias: how about a rule-based ML model?
  • Extending translation to many languages, different approaches for different language challenges (fewer examples, different language structure, etc. c.f., the thousand language paper) How is Google Translate able to translate between so many languages? See Building Machine Translation Systems for the Next Thousand Languages.

Fund. ނިމުމެވެ. Endirinn. Ti Panungpalan. Njedebe. Cоңы. Tungtungna. Isiphetho. अन्तम् (translations of “The End” using Google Translate)