1 环境准备

corex-docker-installer-4.2.0-10.2-ubuntu20.04-py3.10-x86_64.run

https://gitee.com/121786404/corex_deepseek_vllm_test

BI150 是单卡双芯，下文中MR100可以认为是BI150的其中一颗芯片

2 模型来源

https://www.modelscope.cn/Qwen/QwQ-32B.git
https://www.modelscope.cn/Qwen/QwQ-32B-AWQ.git

3 性能测试

3.1 QwQ-32B

3.1.1 服务端启动指令

vllm serve /data/QwQ-32B \
--trust_remote_code --tensor_parallel_size 4 \
--max_model_len 40960 \
--disable_log_requests --disable_log_stats --port 9997

3.1.2 客户端测试指令

python3 benchmark_serving.py \
--model /data/QwQ-32B \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1 \
--trust-remote-code \
--ignore-eos \
--port 9997

3.2 QwQ-32B-AWQ

3.2.1 服务端启动指令

vllm serve /data/QwQ-32B-AWQ \
--trust_remote_code --tensor_parallel_size 2 \
--max_model_len 40960 \
--disable_log_requests --disable_log_stats --quantization awq --port 9997

3.2.2 客户端测试指令

python3 benchmark_serving.py \
--model /data/QwQ-32B-AWQ \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1 \
--trust-remote-code \
--ignore-eos \
--port 9997

4 实际使用测试

4.1 QwQ-32B

4.1.1 4*MR100

vllm serve /data/QwQ-32B \
--trust_remote_code --tensor_parallel_size 4 \
--max_model_len 40960 \
--disable_log_stats \
--disable_log_requests --port 9997

python3 benchmark_serving.py \
--model /data/QwQ-32B \
--dataset-name random \
--random-input-len 16384 \
--random-output-len 16384 \
--num-prompts 1 \
--trust-remote-code \
--port 9997

4.1.2 8*MR100

vllm serve /data/QwQ-32B \
--trust_remote_code --tensor_parallel_size 8 \
--max_model_len 40960 \
--disable_log_requests --disable_log_stats --port 9997

python3 benchmark_serving.py \
--model /data/QwQ-32B \
--dataset-name random \
--random-input-len 16384 \
--random-output-len 16384 \
--num-prompts 1 \
--trust-remote-code \
--port 9997

4.2 QwQ-32B-AWQ

4.2.1 2*MR100

vllm serve /data/QwQ-32B-AWQ \
--trust_remote_code --tensor_parallel_size 2 \
--max_model_len 40960 \
--disable_log_stats \
--disable_log_requests --quantization awq --port 9997

python3 benchmark_serving.py \
--model /data/QwQ-32B-AWQ \
--dataset-name random \
--random-input-len 16384 \
--random-output-len 16384 \
--num-prompts 1 \
--trust-remote-code \
--port 9997

4.2.2 4*MR100

vllm serve /data/QwQ-32B-AWQ \
--trust_remote_code --tensor_parallel_size 4 \
--max_model_len 40960 \
--disable_log_requests --disable_log_stats --quantization awq --port 9997