TokenSpeed is an LLM inference engine designed for high-speed performance, claiming 'speed-of-light' inference. It focuses on optimizing the core inference process for large language models, leveraging concepts from various advanced models like Blackwell and Kimi.
A starter prompt for Claude Code, what you'll need, and how to reach them.
You are an expert in high-performance computing and LLM inference optimization. Your task is to develop a foundational component of a 'speed-of-light' LLM inference engine, focusing initially on a single, well-defined optimization. We will build this in Python with core performance-critical components potentially in Rust for bindings. The goal is to demonstrate a tangible speed improvement over a baseline (e.g., Hugging Face Transformers) for a specific, small LLM (e.g., Llama 2 7B). Start by focusing on an optimized KV cache implementation. Your output should include: 1. A Rust library for a highly optimized KV cache with Python bindings. 2. Python code to integrate this cache into a simplified inference loop for a Llama 2 7B model. 3. Benchmarking script to compare the new KV cache against a standard Hugging Face implementation for sequence generation latency. Use Next.js 16 App Router, React 19, Tailwind v4 for any UI components (though not the focus here), and a Neon Postgres database (though likely not needed for this initial phase). The immediate deliverable is a functional Rust KV cache and Python integration demonstrating a measurable speedup. Verify by running the benchmark script and observing lower latency for sequence generation.
TokenSpeed is a speed-of-light LLM inference engine. Topics: blackwell, deepseek, gpt-oss, kimi, lightseek, llm, minimax, nemotron, qwen, speed-of-light, tokenspeed.
Open an issue or start a discussion in the GitHub repository (https://github.com/lightseekorg/tokenspeed).
“I've been exploring high-performance LLM inference and built a prototype demonstrating a novel KV cache optimization that shows X% speedup for Llama 2 7B. I'd be interested in discussing how this could integrate with or complement TokenSpeed's goals.”
Open the original ↗