
Study Tests Whether Transformers Need All Three QKV Projections
A new paper systematically evaluates three variants of projection sharing in transformer attention: shared key-value, shared query-key, and a single projection. The authors found that sharing key and value projections performs on par with standard QKV attention while reducing KV cache by 50% with only 3.1% perplexity degradation. Combining this with grouped-query or multi-query attention can cut cache by up to 96.9%, enabling practical on-device inference.

