Peter Hoang Tech Blog - A good intro to Key-Value Cache and Quantization

I stumbled upon these blog posts while trying to read up on Key-Value cache in LLMs. I think they’re short, well-explained and provide a good high-level overivew so I recommend them here.

Good entry for understanding Key-value cache in Transformers: https://huggingface.co/blog/optimize-llm#32-the-key-value-cache (Sept 2023)
Good entry for understanding quantization of LLM models: https://huggingface.co/blog/optimize-llm#1-harnessing-the-power-of-lower-precision (same article as in 1.)
- In particular, this section makes it clear that quantization helps reduce the required GPU memory for running LLMs, at the cost of a slower inference speed, because quantized values need to be converted back and forth to fp16 / bf16.

In the same HuggingFace article is a link to the Illustrated GPT2 post (Aug 2019).

This is a really good and detailed post - probably by far the most detailed explanation of how decoding works in Transformer architecture that I’ve read.
It also helped me better grasp some more details of Transformer. It also explains why for generation-related tasks (including translation, summarization, etc.), the decoder part is sufficient (and using a full encoder-decoder is probably unnecessary).
I think this post + the original Transformer paper (2017) make for a complete intro to the transformer architecture.
The part explaining decoding, combined with the HF link #1 above, help make the KV cache clear.

All in all, a great post from HF as an entry point to understand KV Cache and Quantization.