Speculative Attention

0

Train a markov chain (on CPU) to predict attention patterns for an LLM. Swap out parts of the KV cache to CPU memory based on the predictions (or use the predictions for attention sparsity). You can’t get a round trip to CPU fast enough to help the early layers, but if the predictor is token level only you can probably make it by the second half of the model

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{muchane-speculative-attention-2026,
  author = {Muchane, Mark},
  title = {Speculative Attention},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/k4QgL3q0wkdCFEvrK3GZ}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!