GenAI Perf压测bge-m3

运行镜像环境

docker run -it --rm \  --gpus all \  --ipc=host \  -v /data/models:/vllm-workspace/nim/.cache/models \  -p 5000:5000 \  --entrypoint /bin/bash \  vllm-bge-m3:v0.1.0

运行模型

vllm serve /vllm-workspace/nim/.cache/models/bge-m3 \  --served-model-name bge-m3 \  --host 0.0.0.0 \  --port 5000 \  --tensor-parallel-size 1 \  --max-model-len 8192 \  --gpu-memory-utilization 0.2

压测流程

1. 下载压测镜像

docker run -it --net=host --gpus=all \  -v /data/models:/workspace/models \  -v /data/data_set_test:/workspace/data_testset \  nvcr.io/nvidia/tritonserver:25.01-py3-sdk

说明: -v 参数将模型的权重加载到压测工具,后续需要使用模型的tokenizer

文档来源: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/perf_analyzer/genai-perf/README.html

2. 执行命令压测

注意: chat模型和embeddings有点不同,参数也有变化

Embeddings模型压测

进入容器压测

  1. 设置环境变量:export CONCURRENCY=10export MODEL=bge-m3
  2. 准备数据:
cat <<EOF >> embeddings_inputs.jsonl{"text": "This is a sample input sentence for embedding test number 1."}{"text": "Here is another example input for embeddings profiling."}{"text": "How many tokens will this embedding request generate?"}{"text": "Benchmarking embedding model throughput and latency under load."}{"text": "Final example: measure concurrency effect for the embeddings endpoint."}EOF
  1. 执行压测:
genai-perf profile \  -m $MODEL \  --endpoint-type embeddings \  --service-kind openai \  -u 10.10.207.16:5000 \  --concurrency $CONCURRENCY \  --batch-size-text 1 \  --tokenizer /workspace/models/$MODEL \  --input-file /workspace/data_testset/l_dataset_z.json.json \  -v 
  1. 不指定数据集
genai-perf profile \  -m $MODEL \  --endpoint-type embeddings \  --service-kind openai \  -u 10.10.207.16:5000 \  --concurrency $CONCURRENCY \  --synthetic-input-tokens-mean 1000 \  --synthetic-input-tokens-stddev 0 \  --output-tokens-mean 1000 \  --output-tokens-stddev 0 \  --batch-size-text 1 \  --tokenizer /workspace/models/$MODEL \  -v

最后返回结果

#!/bin/bash# run_embeddings_benchmark.sh# 说明:生成输入文件并对不同并发数进行 embeddings 压测
# -------------------------# 配置参数# -------------------------MODEL="bge-m3"HOST="10.10.207.16:5000"BATCH_SIZE=1CONCURRENCY_LIST=(10 20 50 100)INPUT_FILE="embeddings_inputs.jsonl"TOKENIZER_PATH="/vllm-workspace/nim/.cache/models/$MODEL"
# -------------------------# 生成输入数据# -------------------------cat <<EOF > $INPUT_FILE{"text": "This is a sample input sentence for embedding test number 1."}{"text": "Here is another example input for embeddings profiling."}{"text": "How many tokens will this embedding request generate?"}{"text": "Benchmarking embedding model throughput and latency under load."}{"text": "Final example: measure concurrency effect for the embeddings endpoint."}EOF
echo "输入文件 $INPUT_FILE 已生成。"
# -------------------------# 循环执行压测# -------------------------for CONCURRENCY in "${CONCURRENCY_LIST[@]}"; do  echo "=============================="  echo "开始并发 $CONCURRENCY 的压测..."  genai-perf profile \    -m $MODEL \    --endpoint-type embeddings \    --service-kind openai \    -u $HOST \    --concurrency $CONCURRENCY \    --batch-size-text $BATCH_SIZE \    --tokenizer $TOKENIZER_PATH \    --input-file $INPUT_FILE \    -v  echo "并发 $CONCURRENCY 压测完成。"  echo "=============================="done
echo "所有压测完成。"
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇