Infra Overview

Frameworks awesome-production-machine-learning Github - AI Performance Engineering Resource Management Kubernetes Kubeflow Kuberay Open Platform for AI (OpenPAI) Slurm: Simple Linux Utility for Resource Management skypilot apptainer MPI Ray Others: submarine metaflow runhouse genv Distributed Training Torch Run communication backends Gloo MPI NCCL Tensorflow PS Work Horovod DeepSpeed Megatron-LM Colossal AI Inference Accelerate vLLM sglang Development pytorch lightning: organizes PyTorch code to remove boilerplate and unlock scalability mlflow: experiment management wandb: experiment tracking pycaret: a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, Optuna, Hyperopt, Ray, and few more Key Questions 如何管理训练数据集版本 如何在对象存储与高性能存储间进行数据流动 Dataloader Optimizations Github - webdataset Github - tfrecord Github - spdl Meta - Introducing SPDL: Faster AI model training with thread-based data loading Nvidia - DALI

January 1, 2000

Java

Java Install sudo apt-get update sudo apt install openjdk-11-jdk Hello world public class Test { public static void main(String[] args) { System.out.println("Hello world!"); } } 指定JDK版本 export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home" export PATH=$JAVA_HOME/bin:$PATH:. 编译与打包 运行class # 编译java文件 javac Test.java # 会生成 Test.class # 运行编译后文件 java Test # 运行 Test.class # 指定 classpath 运行 java -classpath yourClassPath Test java -cp yourClassPath Test java -cp .:yourClassPath:yourClassPath2 Test # 默认为当前目录,多个 path 用分号分割 分离源文件与编译后文件 mkdir src && mv *.java src/ javac -d ./classes src/*.java # -d 表示生成的class文件存放位置 cd ./classes java Test 包管理 # add package echo "package here.there\n"`cat Test.java` > Test.java # 首行写包路径 mkdir here && mkdir here/there && mv Test.java here/there/ # 安装包路径创建文件夹 javac -d ../classes here/there/Test.java cd ../classes java here.there.Test # 运行时要指定包路径 JAR打包与运行 Working with Manifest Files: The Basics ...

January 1, 2000

Javascript

Basic var a = 10; // let b = 20; // 作用域以外不可引用,建议尽量用 let const c = 30; scope Hoisting Re-declaration var Function-scoped Yes, undefined if not initialized Yes let Block-scoped No, must be declared No const Block-scoped No, must be declared No String let s = "Hello, world" // => "ell": the 2nd, 3rd, and 4th characters s.substring(1,4) // => "ell": same thing s.slice(1,4) // => "rld": last 3 characters s.slice(-3) // => ["Hello", "world"]: s.split(", ") // => 2: position of first letter l s.indexOf("l") // => true: the string starts with these s.startsWith("Hell") // => true: s includes substring "or" s.includes("or") // => "Heya, world" s.replace("llo", "ya") // => "hello, world" s.toLowerCase() // => "H": the first character s.charAt(0) // => " x": add spaces on the left to a length of 3 "x".padStart(3) Template // greeting == "Hello Bill." let name = "Bill" let greeting = `Hello ${ name }.` Pattern Matching let text = "testing: 1, 2, 3" let pattern = /\d+/g pattern.test(text) text.search(pattern) text.match(pattern) text.replace(pattern, "#") text.split(/\D+/) 类型转换 // Number to string let n = 17 let s = n.toString() // String to number // => 3 parseInt("3 blind mice") // => 0.1 parseFloat(".1") // If the value is a string, wrap it in quotes, otherwise, convert (typeof value === "string") ? "'" + value + "'" : value.toString() 对象 let square = { area: function() { return this.side * this.side; }, side: 10 }; //等价于 let square = { area() { return this.side * this.side; }, side: 10 }; 数组 // forEach let data = [1, 2, 3, 4, 5], sum = 0 data.forEach(value => { sum += value; }) // map let a = [1, 2, 3] a.map(x => x*x) // filter let a = [5, 4, 3, 2, 1] a.filter(x => x < 3) // find and findIndex let a = [1, 2, 3, 4, 5] a.findIndex(x => x === 3) a.find(x => x % 5 === 0) // every and some a.every(x => x < 10) a.some(isNaN) // reduce a.reduce((x,y) => x+y, 0) // flat and flatMap [1, [2, 3]].flat() let phrases = ["hello world", "the definitive guide"] let words = phrases.flatMap(phrase => phrase.split(" ")) // concat let a = [1,2,3]; a.concat(4, 5) // stack and queue let stack = [] stack.push(1,2) stack.pop() let q = [] q.push(1,2) q.shift() // subarrays let a = [1, 2, 3, 4, 5, 6, 7, 8] a.slice(0,3) a.splice(4) // fill let a = new Array(5); a.fill(0) // indexOf let a = [0, 1, 2, 1, 0] a.indexOf(1) // includes let a = [1, true, 3, NaN] a.includes(true) // sort let a = ["banana", "cherry", "apple"] a.sort() // reverse a.reverse() // to string let a = [1, 2, 3] a.join(" ") [1,2,3].toString() 遍历 遍历列表 ...

January 1, 2000

JVM

JVM 概念 内存空间 程序计数器 虚拟机栈 本地方法栈 堆 方法区 运行时常量池 直接内存 垃圾回收 算法 引用计数法 可达性分析法 分配回收策略 Young Old 回收器 When to choose SerialGC, ParallelGC over CMS, G1 in Java? Serial:Mainly for single-cpu machine. Parallel:It uses multiple gc threads to handle heap, and perform stop-the-world pause during any gc. CMS:It’s designed to eliminate the long pause associated with the full gc of parallel & serial collector. G1:It’s low pause / server style gc, mainly for large heap (> 4Gb). ...

January 1, 2000

Kafka

Docker Install Install docker pull wurstmeister/zookeeper docker run -d --name zookeeper -p 2181:2181 -t wurstmeister/zookeeper docker pull wurstmeister/kafka docker run -d --name kafka -p 9092:9092 \ -e KAFKA_BROKER_ID=0 \ -e KAFKA_ZOOKEEPER_CONNECT=${host}:2181 \ -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://${host}:9092 \ -e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092 \ wurstmeister/kafka Usage docker exec -it kafka /bin/bash cd opt/kafka_2.11-2.0.0/ # producer ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic mykafka # consumer ./bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic mykafka --from-beginning 配置 auto.create.topics.enable=false/true num.partitions=1 default.replication.factor=1 Listeners配置 Kafka 2.1 Documentation - 3.1 Broker Configs kafka的listeners和advertised.listeners,配置内外网分流 Kafka从上手到实践-Kafka集群:Kafka Listeners 只需要内网访问kafka listeners=PLAINTEXT://inner_ip:9092 或者配置SASL ...

January 1, 2000

Kubernetes

Kubernetes kubectl Install # x86-64 curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl 配置 kubeconfig 访问集群 通过配置 cluster,context,users 来访问 k8s 集群 # 查看 kubeconfig 配置 kubectl config view Viewing, finding resources # List namespaces kubectl get namespaces # Get commands with basic output kubectl get services # List all services in the namespace kubectl get pods --all-namespaces # List all pods in all namespaces kubectl get pods -o wide # List all pods in the current namespace, with more details kubectl get deployment my-dep # List a particular deployment kubectl get pods # List all pods in the namespace kubectl get pod my-pod -o yaml # Get a pod's YAML # Describe commands with verbose output kubectl describe nodes my-node kubectl describe pods my-pod # print logs kubectl logs my-pod # get service kubectl get services # get ingress kubectl get ingress metrics ...

January 1, 2000

Kueue

Install Kueue Documentation - Installation # To install a released version of Kueue in your cluster by kubectl kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/manifests.yaml # Add metrics scraping for prometheus-operator kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/prometheus.yaml # kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.13.4/kueueviz.yaml Admin kueue Documentation - Administer Cluster Quotas create namespace kubectl create namespace my-kueue create resource flavor apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: "default-flavor" create cluster queue apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["cpu", "memory"] flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 128 - name: "memory" nominalQuota: 512Gi create local queue apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "my-kueue" name: "user-queue" spec: clusterQueue: "cluster-queue" User submit a job ...

January 1, 2000

Lakehouse

Hudi PySpark tableName = "table_name" basePath = "oss://bucket/user/hive/warehouse" hudi_options = { 'hoodie.table.name': tableName, 'hoodie.datasource.write.partitionpath.field': 'pt' } spark_df.write.format("hudi"). \ options(**hudi_options). \ mode("overwrite"). \ save(basePath) Iceberg TODO Paimon Catalog Catalog is an abstraction to manage the table of contents and metadata filesystem metastore (default), which stores both metadata and table files in filesystems. hive metastore, which additionally stores metadata in Hive metastore. Users can directly access the tables from Hive. jdbc metastore, which additionally stores metadata in relational databases such as MySQL, Postgres, etc. rest metastore, which is designed to provide a lightweight way to access any catalog backend from a single client. Bucket 每个 bucket 里面都包含一个单独的 LSM Tree 及其变更日志文件(包含 INSERT、UPDATE、DELETE) Bucket 是最小的读写存储单元,Bucket 的数量限制了最大的处理并行度 建议每个bucket中的数据大小约为 200MB-1GB References: ...

January 1, 2000

Libraries

任务调度 schedule install pip install schedule usage import schedule # add schedule job schedule.every(10).seconds.do(lambda: print("running")) # run scheduler while True: schedule.run_pending() time.sleep(1) add job with parameters def func(name: str): print(f"My name is {name}") schedule.every(5).seconds.do(func, name="Tom") while True: schedule.run_pending() time.sleep(1) Apscheduler Install pip install apscheduler Triggers:任务触发逻辑 cron:cron 格式触发 interval:固定时间间隔触发 date:在某固定日期触发一次 combine:组合条件触发 Scheduler BlockingScheduler: 阻塞式,当程序只运行这个 scheduler 时使用 BackgroundScheduler:调度器在后台运行 Executor ThreadPoolExecutor:默认使用多线程执行器 ProcessPoolExecutor:如果是 CPU 密集型任务可以使用多进程执行器 Job store:如果任务调度信息存在内存中,当程序退出后会丢失,可以其他存储器进行持久化存储 MemoryJobStore: 默认使用内存存储 SQLAlchemyJobStore MongoDBJobStore etc. 创建 scheduler ...

January 1, 2000

LLM

Ideas 3 levels Use chatgpt to do job make tools to facilitate the workflow of using chatgpt improve model to do job LLM Learning things by induction, Human can learn by deduction Learning Papers Recent Advances in Natural Language Processing via Large Pre-Trained Language Models- A Survey Articles 拆解追溯 GPT-3.5 各项能力的起源 Generative AI exists because of the transformer Prompt Engineering Prompting Principles Principle 1: Write clear and specific instructions Use delimiters to clearly indicate distinct parts of the input Ask for a structured output Ask the model to check whether conditions are satisfied “Few-shot” prompting Principle 2: Give the model time to “think” Specify the steps required to complete a task Instruct the model to work out its own solution before rushing to a conclusion Iterative Prompt Development ...

January 1, 2000