VLLM测试

本文最后更新于 2025年6月25日 晚上

1. 数据集

imdb影评情感分析数据集:http://ai.stanford.edu/~amaas/data/sentiment/

csv文件,格式类似如下:

review sentiment
text… postive
text… negtive

2. 测试

使用模型:NousResearch/Hermes-3-Llama-3.1-8B

使用显卡:单张H800

模型最大上下文限制为(prompt tokens + output tokens):131072

KV Cache计算器:https://lmcache.ai/kv_cache_calculator.html

2.1 离线吞吐量测试

2.1.1 测试1

为使用1条prompt,格式为n个text按换行符拼接+要求:

"text1\n text2\n ... For each line of the above comments, determine whether it is a positive or negative comment. Answer only postive or negative:\n"

同时限制output token为n

2.1.2 测试2

为使用n条prompt,格式为1个text+要求:

"text1\n For the above comment, determine whether it is a positive or negative comment. Answer only postive or negative:\n"

"text2\n For the above comment, determine whether it is a positive or negative comment. Answer only postive or negative:\n"

...

同时限制每条output token为1

2.1.3 测试结果

n = 400(此时对于单条prompt来说可以认为几乎跑满了模型最大tokens限制)

测试1:
Throughput: 0.04 requests/s, 5058.22 total tokens/s, 16.80 output tokens/s
Total num prompt tokens: 120006
Total num output tokens: 400

测试2:
Throughput: 65.75 requests/s, 21299.52 total tokens/s, 65.75 output tokens/s
Total num prompt tokens: 129180
Total num output tokens: 400

n = 40

测试1:
Throughput: 1.02 requests/s, 11443.23 total tokens/s, 40.65 output tokens/s
Total num prompt tokens: 11220
Total num output tokens: 40

测试2:
Throughput: 73.27 requests/s, 22262.43 total tokens/s, 73.27 output tokens/s
Total num prompt tokens: 12114
Total num output tokens: 40

n = 4

测试1:
Throughput: 8.01 requests/s, 8118.47 total tokens/s, 32.03 output tokens/s
Total num prompt tokens: 1010
Total num output tokens: 4

测试2:
Throughput: 32.03 requests/s, 8648.41 total tokens/s, 32.03 output tokens/s
Total num prompt tokens: 1076
Total num output tokens: 4

测试2的吞吐均大于测试1

考虑prefill阶段,测试2的batchsize更大,不考虑prefix/kv cache复用的话,长prompt的prefill(n^2)肯定是没有多个短prompt(n)快的,测试1是没有优势的

如果使用对文本的kvcache复用(可以重复交两个一样的请求,然后利用prefix机制来复用),此时在计算量上才能显现测试1的优势,因为测试1的prompt tokens数是小于测试2的(basic prompt内部的交叉注意力会多次计算)

但是考虑decode阶段,prompt越长qkv点乘计算越慢,所以测试1 decode阶段还是没有优势的

2.1.4 验证

对于测试1和测试2,同时设置一个duplicate=100,表示一个请求重复提交100次,那么后99次都会使用prefix cache进行缓存复用(同时避免basic prompt的复用,即只考虑文本块的复用),可以近似认为算出的吞吐量为完全复用时的吞吐量:

100x duplication
测试1:
Throughput: 2.47 requests/s, 297019.33 total tokens/s, 986.72 output tokens/s
Total num prompt tokens:  12000700
Total num output tokens:  40000

测试2:
Throughput: 703.70 requests/s, 229369.89 total tokens/s, 703.70 output tokens/s
Total num prompt tokens:  12998000
Total num output tokens:  40000

说明如果能完全复用kv cache的话,测试1的吞吐量是更有优势的,但是decode没办法看出来,这里的output tokens计算方式是output tokens / elapsed time

如果是完全复用的话:

100x duplication
测试1:
Throughput: 2.50 requests/s, 301121.59 total tokens/s, 1000.35 output tokens/s
Total num prompt tokens:  12000600
Total num output tokens:  40000

测试2:
Throughput: 1465.77 requests/s, 476303.48 total tokens/s, 1465.77 output tokens/s
Total num prompt tokens:  12958000
Total num output tokens:  40000

此时又是测试2快

2.2 在线吞吐量测试

n = 400

测试1:
============ Serving Benchmark Result ============
Successful requests:                     1
Benchmark duration (s):                  7.29
Total input tokens:                      120006
Total generated tokens:                  400
Request throughput (req/s):              0.14
Output token throughput (tok/s):         54.89
Total Token throughput (tok/s):          16524.00
---------------Time to First Token----------------
Mean TTFT (ms):                          293.37
Median TTFT (ms):                        293.37
P99 TTFT (ms):                           293.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.52
Median TPOT (ms):                        17.52
P99 TPOT (ms):                           17.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           17.52
Median ITL (ms):                         17.40
P99 ITL (ms):                            19.57
==================================================

测试2:
============ Serving Benchmark Result ============
Successful requests:                     400
Benchmark duration (s):                  4.84
Total input tokens:                      129180
Total generated tokens:                  400
Request throughput (req/s):              82.71
Output token throughput (tok/s):         82.71
Total Token throughput (tok/s):          26794.76
---------------Time to First Token----------------
Mean TTFT (ms):                          2706.23
Median TTFT (ms):                        2815.29
P99 TTFT (ms):                           4828.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00
Median TPOT (ms):                        0.00
P99 TPOT (ms):                           0.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           0.00
Median ITL (ms):                         0.00
P99 ITL (ms):                            0.00
==================================================

VLLM测试
https://gentlecold.top/20250619/vllm-test/
作者
GentleCold
发布于
2025年6月19日
许可协议