GentleCold's Blog

vLLM最新版调度系统与Continuous Batching详解

1. 先说结论版本说明：本文参考的是2026-05-08访问的vLLM官方latest文档和API源码页面。vLLM文档明确提示latest是developer preview，不等同于latest stable release。因此生产环境要以你实际安装的vLLM版本为准，最好用： vllm serve --help 确认参数是否存在。这篇文章讲vLLM最新版调度系统，重点是： 1

2026-05-08

笔记

#KV Cache #LLM Inference #VLLM #Scheduler #Batch

vLLM推理并行与MLA详解

1. 先说结论版本说明：本文参考的是2026-05-08访问的vLLM官方latest文档。vLLM文档页面明确提示latest是developer preview文档，不等同于latest stable release；因此DCP、PCP、EP等参数和行为最好以你实际安装的vLLM版本为准。生产环境建议同时查对应版本文档或直接用vllm serve --help确认参数是否存在。 vLLM里

2026-05-08

笔记

#KV Cache #LLM Inference #VLLM #MLA #Parallelism

Ran-CLOCK论文调研

论文：Performance Analysis of the Randomized SIEVE/CLOCK Cache Replacement Algorithm 作者：Yirong Wang, Peter Desnoyers, Benny Van Houdt 发表：Proc. ACM Meas. Anal. Comput. Syst., Vol. 10, No. 2, Article 49。

2026-05-08

笔记

#Cache #CLOCK #SIEVE #Mean Field

Foyer技术要点分析

资料： * Foyer: A Hybrid Cache in Rust - Past, Present and Future * Foyer docs.rs API * Foyer GitHub README 链接： * https://blog.mrcroxx.com/posts/foyer-a-hybrid-cache-in-rust-past-present-and-futur

2026-05-08

笔记

#Rust #Cache #Storage

HaS论文调研

论文：HaS: Accelerating RAG through Homology-Aware Speculative Retrieval 版本：arXiv:2604.20452v1, 2026-04-22 1. 背景 HaS讨论的是RAG系统里的检索延迟问题。很多LLM推理优化关注prefill、decode、KV cache和attention kernel，但在真实RAG系统里，检索

2026-05-08

笔记

#LLM #RAG #Retrieval

InfoFlow KV论文调研

论文：InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context 版本：arXiv:2603.05353v1, 2026-03-05 1. 背景 InfoFlow KV讨论的是长上下文RAG推理里的KV cache预计算和选择性重计算问题。在RAG里，系统经常需要把大量检索文档拼到prompt前面。上下文可以达

2026-05-08

笔记

#KV Cache #LLM #Long Context

ScaleEvict论文调研

论文：ScaleEvict: Altruistic Eviction for RDMA-enabled Distributed Storage Engines 作者：Till Steinert, Muhammad El-Hindi, Tobias Ziegler, Viktor Leis, Carsten Binnig 发表：DaMoN’26, 2026-05-31 至 2026-06-05

2026-05-08

笔记

#RDMA #Cache #Distributed Storage

ASL论文调研

论文：Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference 版本：arXiv:2601.07667v2, 2026-04-16 1. 背景 ASL讨论的是长上下文LLM推理中的layer-wise token pruning问题。它的直接上下文是FastKV、GemFilter、PyramidInfer这类

2026-05-08

笔记

#KV Cache #LLM #Long Context

FastKV论文调研

论文：FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration 版本：arXiv:2502.01068v7, 2026-04-20 代码：https://github.com/dongwonjo/FastKV 1. 背景长上下文LLM推理的成本主要来自

2026-05-08

笔记

#KV Cache #LLM #Long Context

HTTP/1.1、HTTP/2 与 gRPC 原理笔记

1. HTTP/1.1 vs HTTP/2 1.1 HTTP/1.1 的核心瓶颈加载一个网页需要 HTML + 10个CSS + 20个JS：连接1: [请求HTML ]──[响应HTML ] 连接2: [请求CSS1 ]──[响应CSS1 ] 连接3: [请求CSS2 ]──[响应CSS2 ] ...（浏览器最多同时开6个TCP连接，其余排队等待）队头阻塞（Head-of-Line B

2026-04-14

笔记

#分布式系统 #gRPC #网络 #HTTP