vLLM¶
约 59 个字 188 行代码 预计阅读时间 3 分钟
正在施工中👷..
安装 ¶
安装
> uv venv --python 3.12 --seed
> source .venv/bin/activate
> uv pip install vllm --torch-backend=auto
RuntimeError: Failed to find C compiler. Please specify via CC environment variable or set triton.knobs.build.impl.
linux
sudo apt update
sudo apt install build-essential
pip install triton
使用 ¶
logger¶
logger
{
"formatters": {
"vllm": {
"class": "vllm.logging_utils.NewLineFormatter",
"datefmt": "%m-%d %H:%M:%S",
"format": "%(levelname)s %(asctime)s %(filename)s:%(lineno)d] %(message)s"
}
},
"handlers": {
"vllm": {
"class" : "logging.StreamHandler",
"formatter": "vllm",
"level": "DEBUG",
"stream": "ext://sys.stdout"
},
"file": {
"class": "logging.FileHandler",
"formatter": "vllm",
"level": "DEBUG",
"filename": "/path/to/debug.log"
}
},
"loggers": {
"vllm": {
"handlers": ["vllm", "file"],
"level": "DEBUG",
"propagate": false
},
"vllm.example_noisy_logger": {
"propagate": false
}
},
"version": 1
}
然后运行
VLLM_LOGGING_CONFIG_PATH=vllm_logging_config.json \
vllm serve /path/to/model
源码理解 ¶
tricks¶
LMcache¶
LMCache - vLLM — LMCache - vLLM
sequenceDiagram
participant Client
participant Proxy
participant Prefiller
participant Decoder
Client->>Proxy: 发送请求
Proxy->>Prefiller: 转发到预填充器(GPU 0)
Prefiller->>Prefiller: 生成KV Cache
Prefiller->>Decoder: 传输KV Cache
Decoder->>Decoder: 执行解码(GPU 1)
Decoder->>Proxy: 返回结果
Proxy->>Client: 发送响应
> uv venv --python 3.12 --seed
> source .venv/bin/activate
> uv pip install vllm --torch-backend=auto
---> 100%
Installed
> uv pip install lmcache
> uv pip install nixl
> uv pip install vllm
> uv pip install pandas
> uv pip install datasets
cd vllm/examples/others/lmcache/disagg_prefill_lmcache_v1
案例 ¶
在线推理 ¶
vllm serve Salesforce/blip2-opt-2.7b \
--host 0.0.0.0 \
--port 8080 \
--dtype auto \
--max-num-seqs 32 \
--max-model-len 2048 \
--tensor-parallel-size 2 \
--trust-remote-code
离线推理 ¶
单提示词和多提示词批量推理
from vllm import LLM, SamplingParams
import PIL
llm = LLM(model="Salesforce/blip2-opt-2.7b")
# 参考 HuggingFace 仓库以使用正确的格式
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# 使用 PIL.Image 加载图像
image = PIL.Image.open("/root/autodl-tmp/dataset/coco-val2017/000000001490.jpg")
# 单提示词推理
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image},
})
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# 批量推理
image_1 = PIL.Image.open("/root/autodl-tmp/dataset/coco-val2017/000000001490.jpg")
image_2 = PIL.Image.open("/root/autodl-tmp/dataset/coco-val2017/000000581317.jpg")
outputs = llm.generate(
[
{
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_1},
},
{
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_2},
}
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Salesforce/blip2-opt-2.7b 离线推理
import os
import argparse
from PIL import Image
from vllm import LLM, EngineArgs, SamplingParams
from vllm.multimodal.image import convert_image_mode
def run_blip2(questions: list[str], modality: str) -> tuple:
assert modality == "image"
prompts = [f"Question: {question} Answer:" for question in questions]
engine_args = EngineArgs(
model="Salesforce/blip2-opt-2.7b",
limit_mm_per_prompt={modality: 1},
)
return engine_args, prompts
def main():
parser = argparse.ArgumentParser(
description="Run BLIP-2 model with custom image and questions"
)
parser.add_argument(
"--image-path",
type=str,
required=True,
help="Path to the input image"
)
parser.add_argument(
"--questions",
type=str,
nargs="+",
required=True,
help="One or more questions about the image"
)
parser.add_argument(
"--seed",
type=int,
default=42,
help="Random seed"
)
args = parser.parse_args()
# Load and process image
image = Image.open(args.image_path)
image = convert_image_mode(image, "RGB")
# Get model configuration
engine_args, prompts = run_blip2(args.questions, "image")
engine_args.seed = args.seed
# Initialize model
llm = LLM(**vars(engine_args))
# Set sampling parameters
sampling_params = SamplingParams(temperature=0.2, max_tokens=64)
# Prepare input
inputs = {
"prompt": prompts[0],
"multi_modal_data": {"image": image},
}
# Generate response
outputs = llm.generate(inputs, sampling_params=sampling_params)
# Print results
print("-" * 50)
print(f"Image: {args.image_path}")
print(f"Question: {args.questions[0]}")
print("Answer:")
print(outputs[0].outputs[0].text)
print("-" * 50)
if __name__ == "__main__":
main()
LMcache 配置 ¶
tree
configs/
disagg_proxy_server.py
disagg_vllm_launcher.sh
launch.sh
kill.sh