A good intro to Key-Value Cache and Quantization

Introduction to KV Cache and Quantization of LLMs
llms
transformer
Author

Peter Hoang

Published

January 4, 2024

I stumbled upon these blog posts while trying to read up on Key-Value cache in LLMs. I think they’re short, well-explained and provide a good high-level overivew so I recommend them here.

  1. Good entry for understanding Key-value cache in Transformers: https://huggingface.co/blog/optimize-llm#32-the-key-value-cache (Sept 2023)

  2. Good entry for understanding quantization of LLM models: https://huggingface.co/blog/optimize-llm#1-harnessing-the-power-of-lower-precision (same article as in 1.)

    • In particular, this section makes it clear that quantization helps reduce the required GPU memory for running LLMs, at the cost of a slower inference speed, because quantized values need to be converted back and forth to fp16 / bf16.

In the same HuggingFace article is a link to the Illustrated GPT2 post (Aug 2019).

All in all, a great post from HF as an entry point to understand KV Cache and Quantization.