HuggingFace

Install huggingface-cli

pip install -U "huggingface_hub[cli]"

Login with token

huggingface-cli login --token <your_token>

vLLM

vLLM - Documentation

Install

pip install vllm

可能会有 cuda/pytorch 兼容性问题,可以先安装 pytorch (例如 cuda 版本为 12.2)

# https://pytorch.org/get-started/previous-versions/
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1

install vllm with existing pytorch

git clone https://github.com/vllm-project/vllm.git
cd vllm
python use_existing_torch.py
pip install -r requirements-build.txt
pip install -e . --no-build-isolation

Serving

Download model

如何快速下载huggingface大模型

huggingface-cli download --resume-download Qwen/Qwen2.5-1.5B --local-dir /path/to/your/directory/Qwen/Qwen2.5-1.5B

Serve downloaded model

vllm serve /path/to/your/directory/Qwen/Qwen2.5-1.5B --served-model-name Qwen2.5-1.5B

# test
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen2.5-1.5B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'