I stumbled upon these blog posts while trying to read up on Key-Value cache in LLMs. I think they’re short, well-explained and provide a good high-level overivew so I recommend them here.
Good entry for understanding Key-value cache in Transformers: https://huggingface.co/blog/optimize-llm#32-the-key-value-cache (Sept 2023)
Good entry for understanding quantization of LLM models: https://huggingface.co/blog/optimize-llm#1-harnessing-the-power-of-lower-precision (same article as in 1.)
- In particular, this section makes it clear that quantization helps reduce the required GPU memory for running LLMs, at the cost of a slower inference speed, because quantized values need to be converted back and forth to fp16 / bf16.
In the same HuggingFace article is a link to the Illustrated GPT2 post (Aug 2019).
- This is a really good and detailed post - probably by far the most detailed explanation of how decoding works in Transformer architecture that I’ve read.
- It also helped me better grasp some more details of Transformer. It also explains why for generation-related tasks (including translation, summarization, etc.), the decoder part is sufficient (and using a full encoder-decoder is probably unnecessary).
- I think this post + the original Transformer paper (2017) make for a complete intro to the transformer architecture.
- The part explaining decoding, combined with the HF link #1 above, help make the KV cache clear.
All in all, a great post from HF as an entry point to understand KV Cache and Quantization.