Infra

Aliyun

ECS 阿里云挂载数据盘初始化数据盘（Linux） yum update yum install -y e2fsprogs mkfs -t ext4 /dev/vdb mkdir /data mount /dev/vdb /data sh -c "echo `sudo blkid /dev/vdb | awk '{print \$2}' | sed 's/\"//g'` /data ext4 defaults 0 0 >> /etc/fstab" EMR on ECS Componets Hadoop-Common Yarn port: 8088 OSS-HDFS Hive port: 10000 Hudi Iceberg Paimon Spark3 port: 18080 Gateway 部署 Gateway：使用EMR-CLI自定义部署Gateway环境通过集群Gateway节点提交作业 emrcli gateway deploy \ --clusterId c-b2a8a74c4d44c537 \ --appNames YARN,HIVE,HUDI,ICEBERG,SPARK3 Serverless EMR DLF VVP Flink SQL 对接 DLF Paimon Catalog Serverless Spark 对接 DLF Paimon Catalog Serverless StarRocks 对接 DLF Paimon Catalog ...

Argo

Argo Workflow References Argo Workflows - User Guide Argo Workflow 教程 Install CLI # Detect OS ARGO_OS="darwin" if [[ uname -s != "Darwin" ]]; then ARGO_OS="linux" fi # Download the binary curl -sLO "https://github.com/argoproj/argo-workflows/releases/download/v3.6.5/argo-$ARGO_OS-amd64.gz" # Unzip gunzip "argo-$ARGO_OS-amd64.gz" # Make binary executable chmod +x "argo-$ARGO_OS-amd64" # Move binary to path mv "./argo-$ARGO_OS-amd64" /usr/local/bin/argo # Test installation argo version CLI argo # with kubeconfig argo --kubeconfig ~/.kube/config list Workflow # list workflows argo list argo list --namespace=argo argo list --running Template argo template list Cluster

Infra Overview

Frameworks awesome-production-machine-learning Github - AI Performance Engineering Resource Management Kubernetes Kubeflow Kuberay Open Platform for AI (OpenPAI) Slurm: Simple Linux Utility for Resource Management skypilot apptainer MPI Ray Others: submarine metaflow runhouse genv Distributed Training Torch Run communication backends Gloo MPI NCCL Tensorflow PS Work Horovod DeepSpeed Megatron-LM Colossal AI Inference Accelerate vLLM sglang Development pytorch lightning: organizes PyTorch code to remove boilerplate and unlock scalability mlflow: experiment management wandb: experiment tracking pycaret: a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, Optuna, Hyperopt, Ray, and few more Key Questions 如何管理训练数据集版本如何在对象存储与高性能存储间进行数据流动 Dataloader Optimizations Github - webdataset Github - tfrecord Github - spdl Meta - Introducing SPDL: Faster AI model training with thread-based data loading Nvidia - DALI

Kueue

Install Kueue Documentation - Installation # To install a released version of Kueue in your cluster by kubectl kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/manifests.yaml # Add metrics scraping for prometheus-operator kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/prometheus.yaml # kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/kueueviz.yaml Admin kueue Documentation - Administer Cluster Quotas create namespace kubectl create namespace my-kueue create resource flavor apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" create cluster queue apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["cpu", "memory"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 128 - name: "memory" nominalQuota: 512Gi create local queue apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "my-kueue" name: "user-queue" spec: clusterQueue: "cluster-queue" User submit a job ...

LLM Serving

HuggingFace Install huggingface-cli pip install -U "huggingface_hub[cli]" Login with token huggingface-cli login --token <your_token> vLLM vLLM - Documentation Install pip install vllm 可能会有 cuda/pytorch 兼容性问题，可以先安装 pytorch （例如 cuda 版本为 12.2） # https://pytorch.org/get-started/previous-versions/ pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 install vllm with existing pytorch git clone https://github.com/vllm-project/vllm.git cd vllm python use_existing_torch.py pip install -r requirements-build.txt pip install -e . --no-build-isolation Serving Download model 如何快速下载huggingface大模型 HuggingFace - Qwen2.5-1.5B HuggingFace - DeepSeek-R1 huggingface-cli download --resume-download Qwen/Qwen2.5-1.5B --local-dir /path/to/your/directory/Qwen/Qwen2.5-1.5B Serve downloaded model ...

Ray

Documents Documentation Install pip install ray # with client pip install ray[client] Architecture Client Mode: Ray Client Deploy Kubernetes Kuberay Github - Kuberay Kuberay is a Kubernetes operator, it defines custom resource definitions (CRDs) for RayCluster, RayJob and RayService. helm repo add kuberay https://ray-project.github.io/kuberay-helm/ helm repo update # Install both CRDs and KubeRay operator v1.3.0. helm install kuberay-operator kuberay/kuberay-operator --version 1.3.0 RayCluster create a ray cluster # deploy a ray cluster helm install raycluster kuberay/ray-cluster --version 1.3.0 # check ray cluster kubectl get rayclusters # get pods kubectl get pods --selector=ray.io/cluster=raycluster-kuberay forward ray cluster ports ...

Slurm

Install Quick Start Administrator Guide 从零开始安装slurm作业调度系统Slurm-Ubuntu Github - slurm_ubuntu_gpu_cluster 安装 slurm-wlm sudo apt install slurm-wlm slurm-wlm-doc -y # check version slurmd -V 配置在线配置: Configuration Tool - Easy Version，或者 cd /usr/share/doc/slurmctld chmod +r slurm-wlm-configurator.html python3 -m http.server 打开 http://<ip>:8000，填写 ClusterName：集群名称 SlurmUser：可以填 root Compute Machines：可以填 slurmd -C 的结果，例如 `` 点击 submit 后复制内容，写入 /etc/slurm/slurm.conf 创建文件夹 sudo mkdir /var/spool/slurmd sudo mkdir /var/spool/slurmctld 配置 cgroup：error: cgroup namespace ‘freezer’ not mounted. aborting 写入以下文件： /etc/slurm/cgroup.conf ...

Storage

Object Storage rclone Github - rclone install sudo -v ; curl https://rclone.org/install.sh | sudo bash config [s3] type = s3 provider = Other access_key_id = *** secret_access_key = *** endpoint = http://<host>:80 [oss] type = s3 provider = Alibaba env_auth = false access_key_id = *** secret_access_key = *** endpoint = oss-cn-wulanchabu-internal.aliyuncs.com usage # list dir rclone --config conf/rclone.conf lsd s3:bucket_name/path/to/dir/ # list files rclone --config conf/rclone.conf ls s3:bucket_name/path/to/dir/ # copy, copy file to dest dir rclone --config conf/rclone.conf copy s3:bucket_name/source/file s3:bucket_name/dest/ # copyto, copy file to dest file rclone --config conf/rclone.conf copyto s3:bucket_name/source/file s3:bucket_name/dest/file # sync rclone --config conf/rclone.conf sync s3:bucket_name/source/ s3:bucket_name/dest/ # delete rclone --config conf/rclone.conf delete s3:bucket_name/path/to/file params ...