运行镜像环境
docker run -it --rm \ --gpus all \ --ipc=host \ -v /data/models:/vllm-workspace/nim/.cache/models \ -p 5000:5000 \ --entrypoint /bin/bash \ vllm-bge-m3:v0.1.0
运行模型
vllm serve /vllm-workspace/nim/.cache/models/bge-m3 \ --served-model-name bge-m3 \ --host 0.0.0.0 \ --port 5000 \ --tensor-parallel-size 1 \ --max-model-len 8192 \ --gpu-memory-utilization 0.2
压测流程
1. 下载压测镜像
docker run -it --net=host --gpus=all \ -v /data/models:/workspace/models \ -v /data/data_set_test:/workspace/data_testset \ nvcr.io/nvidia/tritonserver:25.01-py3-sdk
说明: -v 参数将模型的权重加载到压测工具,后续需要使用模型的tokenizer
文档来源: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html
2. 执行命令压测
注意: chat模型和embeddings有点不同,参数也有变化
Embeddings模型压测
进入容器压测
- 设置环境变量:export CONCURRENCY=10export MODEL=bge-m3
- 准备数据:
cat <<EOF >> embeddings_inputs.jsonl{"text": "This is a sample input sentence for embedding test number 1."}{"text": "Here is another example input for embeddings profiling."}{"text": "How many tokens will this embedding request generate?"}{"text": "Benchmarking embedding model throughput and latency under load."}{"text": "Final example: measure concurrency effect for the embeddings endpoint."}EOF
- 执行压测:
genai-perf profile \ -m $MODEL \ --endpoint-type embeddings \ --service-kind openai \ -u 10.10.207.16:5000 \ --concurrency $CONCURRENCY \ --batch-size-text 1 \ --tokenizer /workspace/models/$MODEL \ --input-file /workspace/data_testset/l_dataset_z.json.json \ -v
- 不指定数据集
genai-perf profile \ -m $MODEL \ --endpoint-type embeddings \ --service-kind openai \ -u 10.10.207.16:5000 \ --concurrency $CONCURRENCY \ --synthetic-input-tokens-mean 1000 \ --synthetic-input-tokens-stddev 0 \ --output-tokens-mean 1000 \ --output-tokens-stddev 0 \ --batch-size-text 1 \ --tokenizer /workspace/models/$MODEL \ -v
最后返回结果
#!/bin/bash# run_embeddings_benchmark.sh# 说明:生成输入文件并对不同并发数进行 embeddings 压测
# -------------------------# 配置参数# -------------------------MODEL="bge-m3"HOST="10.10.207.16:5000"BATCH_SIZE=1CONCURRENCY_LIST=(10 20 50 100)INPUT_FILE="embeddings_inputs.jsonl"TOKENIZER_PATH="/vllm-workspace/nim/.cache/models/$MODEL"
# -------------------------# 生成输入数据# -------------------------cat <<EOF > $INPUT_FILE{"text": "This is a sample input sentence for embedding test number 1."}{"text": "Here is another example input for embeddings profiling."}{"text": "How many tokens will this embedding request generate?"}{"text": "Benchmarking embedding model throughput and latency under load."}{"text": "Final example: measure concurrency effect for the embeddings endpoint."}EOF
echo "输入文件 $INPUT_FILE 已生成。"
# -------------------------# 循环执行压测# -------------------------for CONCURRENCY in "${CONCURRENCY_LIST[@]}"; do echo "==============================" echo "开始并发 $CONCURRENCY 的压测..." genai-perf profile \ -m $MODEL \ --endpoint-type embeddings \ --service-kind openai \ -u $HOST \ --concurrency $CONCURRENCY \ --batch-size-text $BATCH_SIZE \ --tokenizer $TOKENIZER_PATH \ --input-file $INPUT_FILE \ -v echo "并发 $CONCURRENCY 压测完成。" echo "=============================="done
echo "所有压测完成。"