Notes on Transformers

I've been learning about Transformers for a month and wanted to write down my learnings. Since I love the question/answer format of learning, most of this post will cover questions I had and the answers I found. This also helps me track papers related to what I read.

The way I see Transformers is that this is a system that is very:

mathematical (though using simple math)
modular

The components can be swapped or recombined in many places while still functioning effectively. While the math may appear complex, at its core it consists of basic transformations and summations.

Above is the original Transformer architecture diagram from the landmark "Attention is all you need" paper. The different blocks represent operations that take specific inputs and produce transformed outputs. At its core, the architecture relies on basic matrix operations—multiplication, division, exponentials, summation, and averaging. There's no complex mathematics involved—just straightforward computations.

https://arxiv.org/abs/1706.03762

For this article, the diagram above serves as a better reference point. The original Transformer had two main components—encoder and decoder blocks. However, as the field has evolved, decoder-only transformers have become dominant. Therefore, instead of showing a two-column diagram, we can represent the architecture as a single vertical sequence of blocks.

When implemented on a computer, this architecture becomes a system that processes inputs and produces outputs.

What makes this architecture unique is its ability to act as a generalized learning machine—given sufficient training examples, it can mimic human-like learning in many scenarios.

If you provide enough Spanish-to-English translation data, it can perform those translations