LLM Serving | 蚂蚁蚂蚁的技术分享

HuggingFace

Install huggingface-cli

pip install -U "huggingface_hub[cli]"

huggingface-cli login --token <your_token>

vLLM

vLLM - Documentation

Install

pip install vllm

可能会有 cuda/pytorch 兼容性问题，可以先安装 pytorch （例如 cuda 版本为 12.2）

# https://pytorch.org/get-started/previous-versions/
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1

install vllm with existing pytorch

git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements-build.txt
pip install -e . --no-build-isolation

Serving

Download model

如何快速下载huggingface大模型

huggingface-cli download --resume-download Qwen/Qwen2.5-1.5B --local-dir /path/to/your/directory/Qwen/Qwen2.5-1.5B

Serve downloaded model

vllm serve /path/to/your/directory/Qwen/Qwen2.5-1.5B --served-model-name Qwen2.5-1.5B

# test
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen2.5-1.5B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

HuggingFace#

vLLM#

Install#

Serving#

HuggingFace

vLLM

Install

Serving