蚂蚁蚂蚁的技术分享

sklearn-Autograd

sklearn-Autograd sklearn Prepocessing from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer # feature (OneHotEncoder) v.s. labels (LabelBinarizer) data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'] LabelEncoder().fit_transform(data) # -> array([0, 0, 2, 0, 1, 1, 2, 0, 2, 1]) OneHotEncoder(sparse=False).fit_transform(np.array(data).reshape(-1,1)) # -> array([[1., 0., 0.], # [1., 0., 0.], # ..., # [0., 1., 0.]]) LabelBinarizer().fit_transform(data) # -> array([[1, 0, 0], # [1, 0, 0], # ..., # [0, 1, 0]]) # one label v.s. multilabels data = [["US", "M"], ["UK", "M"], ["FR", "F"]] OneHotEncoder(sparse=False).fit_transform(data) # -> array([[0., 0., 1., 0., 1.], # [0., 1., 0., 0., 1.], # [1., 0., 0., 1., 0.]]) MultiLabelBinarizer().fit_transform(data) # -> array([[0, 0, 1, 0, 1], # [0, 0, 1, 1, 0], # [1, 1, 0, 0, 0]]) Feature Selection sklearn - Feature selection 知乎 - 特征选择方法全面总结 Filter Methods Pearson’s Correlation ...

Slurm

Install Quick Start Administrator Guide 从零开始安装slurm作业调度系统Slurm-Ubuntu Github - slurm_ubuntu_gpu_cluster 安装 slurm-wlm sudo apt install slurm-wlm slurm-wlm-doc -y # check version slurmd -V 配置在线配置: Configuration Tool - Easy Version，或者 cd /usr/share/doc/slurmctld chmod +r slurm-wlm-configurator.html python3 -m http.server 打开 http://<ip>:8000，填写 ClusterName：集群名称 SlurmUser：可以填 root Compute Machines：可以填 slurmd -C 的结果，例如 `` 点击 submit 后复制内容，写入 /etc/slurm/slurm.conf 创建文件夹 sudo mkdir /var/spool/slurmd sudo mkdir /var/spool/slurmctld 配置 cgroup：error: cgroup namespace ‘freezer’ not mounted. aborting 写入以下文件： /etc/slurm/cgroup.conf ...

Spark

Quick Start cd $SPARK_HOME ./bin/spark-submit examples/src/main/python/pi.py 10 # ---output--- # Pi is roughly 3.145440 Running Spark YARN cluster mode: 命令仅提交任务，不等待任务完成，spark driver 在 YARN 的 application master 上运行 ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options] ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue default \ examples/jars/spark-examples*.jar \ 10 client mode: 命令提交任务并等待任务完成，spark driver 在 client 上运行，application master 仅用于申请资源 ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode client [options] <app jar> [app options] Spark Connect Spark Doc - Spark Connect Overview ...

Spatial-Temporal

人北航 - Jiawei Jiang SpaceTimeAI - UCL 库 LibCity Large Language Model for Graph Representation Learning (LLM4Graph) Large (Language) Models and Foundation Models (LLM, LM, FM) for Time Series and Spatio-Temporal Data 文章交通大模型技术体系思考与探索圆桌对话：交通大模型与传统AI的区别到底在哪 OpenCity 大模型预测交通路况：零样本下表现出色，来自港大百度第二届“空间数据智能战略研讨会”在北京友谊宾馆成功举办一文解析：生成技术在时空数据挖掘中的应用

sqlite

sqlite Install sudo apt update sudo apt install sqlite3 Usage create/open a database sqlite3 data/sqlite.db commands # exit .exit # show tables .tables # show schema .schema

StandardLib

Text Processing Services re 正则表达式 import re # 编译 datepat = re.compile(r'\d+/\d+/\d+') # 匹配 text1 = '11/27/2012' if datepat.match(text1): print('yes') # 搜索 text = 'Today is 11/27/2012. PyCon starts 3/13/2013.' datepat.findall(text) # ['11/27/2012', '3/13/2013'] # 通常会分组匹配 datepat = re.compile(r'(\d+)/(\d+)/(\d+)') m = datepat.match('11/27/2012') print(m.group(0), m.group(1), m.group(2), m.group(3), m.groups()) datepat.findall(text) # [('11', '27', '2012'), ('3', '13', '2013')] # 返回迭代 for m in datepat.finditer(text): print(m.groups()) # 只是一次匹配/搜索操作的话可以无需先编译 re.findall(r'(\d+)/(\d+)/(\d+)', text) # 替换 re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text) # 'Today is 2012-11-27. PyCon starts 2013-3-13.' re.sub(r'(?P<month>\d+)/(?P<day>\d+)/(?P<year>\d+)', r'\g<year>-\g<month>-\g<day>', text) # 命名分组 Data Types datetime from datetime import datetime a = datetime(2012, 9, 23) # 时间转字符串 a.strftime('%Y-%m-%d') # 字符串转时间 text = '2012-09-20' y = datetime.strptime(text, '%Y-%m-%d') zoneinfo (3.9+) from datetime import datetime from zoneinfo import ZoneInfo # Create a datetime object without timezone naive_dt = datetime.now() # Add the timezone to the datetime object aware_dt = naive_dt.replace(tzinfo=ZoneInfo('Asia/Shanghai')) print(aware_dt) collections nametuple from collections import nametuple # namedtuple(typename, field_names) Point = namedtuple('Point', ['x', 'y']) p = Point(x=11, y=22) print(p.x + p.y) deque from collections import deque d = deque(["a", "b", "c"]) d.append("f") # add to the right side d.appendleft("z") # add to the left side e = d.pop() # pop from the right side e = d.popleft() # pop from the left side d = deque(maxlen=10) # deque with max length, FIFO Counter collections — Container datatypes ...

Storage

Object Storage rclone Github - rclone install sudo -v ; curl https://rclone.org/install.sh | sudo bash config [s3] type = s3 provider = Other access_key_id = *** secret_access_key = *** endpoint = http://<host>:80 [oss] type = s3 provider = Alibaba env_auth = false access_key_id = *** secret_access_key = *** endpoint = oss-cn-wulanchabu-internal.aliyuncs.com usage # list dir rclone --config conf/rclone.conf lsd s3:bucket_name/path/to/dir/ # list files rclone --config conf/rclone.conf ls s3:bucket_name/path/to/dir/ # copy, copy file to dest dir rclone --config conf/rclone.conf copy s3:bucket_name/source/file s3:bucket_name/dest/ # copyto, copy file to dest file rclone --config conf/rclone.conf copyto s3:bucket_name/source/file s3:bucket_name/dest/file # sync rclone --config conf/rclone.conf sync s3:bucket_name/source/ s3:bucket_name/dest/ # delete rclone --config conf/rclone.conf delete s3:bucket_name/path/to/file params ...

Superset

Deploy K8s Github - superset/values.yaml

Tensorflow

Basic # define constant a = tf.constant(1) b = tf.constant(2, name='const', shape=(3,5), dtype=tf.float64) # define operation add = a + b mul = tf.square(tf.multiply(a, b)) # define placeholder p = tf.placeholder(tf.int32) # define variables x = tf.get_variable('x', [], dtype=tf.int32) y = tf.get_variable('y', shape=(3,5), initializer=tf.constant_initializer(0)) # session a = tf.constant(1) b = tf.constant(2) f = a + b sess = tf.Session() print(sess.run(f)) # constant do not need initialization x = tf.get_variable('x', [], dtype=tf.int32) f = x + a sess = tf.Session() sess.run(x.initializer) # variable must be initialized before run # return multiple items in a single sess.run() call, instead of making multiple calls print(sess.run([x, f])) p = tf.placeholder(tf.int32) sess = tf.Session() print(sess.run(p, feed_dict={p:2})) # placeholder must be feed in run # assignment # assign node is not connected to the variable, it change the variable's value by side effect x = tf.get_variable('x', [], dtype=tf.int32) a = tf.constant(1) assign = tf.assign(x, a) sess = tf.Session() sess.run(assign) print(sess.run(x)) # sess with block x = tf.get_variable('x', shape=(3,5), initializer=tf.constant_initializer(0)) y = tf.get_variable('y', shape=(3,5), initializer=tf.constant_initializer(0)) f = x + y with tf.Session() as sess: x.initializer.run() y.initializer.run() print(f.eval()) x = tf.get_variable('x', shape=(3,5), initializer=tf.constant_initializer(0)) y = tf.get_variable('y', shape=(3,5), initializer=tf.constant_initializer(0)) f = x * x g = f + y with tf.Session() as sess: x.initializer.run() y.initializer.run() f_val, g_val = sess.run([f, g]) # eval f and g in one run print(f_val, g_val) # interative session x = tf.get_variable('x', [], initializer=tf.constant_initializer(3)) f = x * x sess = tf.InteractiveSession() x.initializer.run() print(f.eval()) # global initializer x = tf.get_variable('x', shape=[], initializer=tf.constant_initializer(1)) y = tf.get_variable('y', shape=[], initializer=tf.constant_initializer(2)) f = x + y init = tf.global_variables_initializer() with tf.Session() as sess: init.run() print(f.eval()) # name scope x = tf.get_variable('x', [], , initializer=tf.constant_initializer(1)) with tf.variable_scope('scope'): y = tf.get_variable('x', [], , initializer=tf.constant_initializer(2)) print(x.name, y.name) # get graph default_graph = tf.get_default_graph() print(default_graph) new_graph = tf.Graph() with new_graph.as_default(): x = tf.Variable(2) print(x.graph) Save and Load a Model # save a model x = tf.get_variable('x', []) y = tf.get_variable('y', []) init = tf.global_variables_initializer() # define the saver after every variable is defined saver = tf.train.Saver() sess = tf.Session() sess.run(init) # save the model after run saver.save(sess, './models/test-model') # load a model x = tf.get_variable('x', []) y = tf.get_variable('y', []) saver = tf.train.Saver() sess = tf.Session() saver.restore(sess, './models/test-model') # no need to init sess.run([x, y]) Optimization # autodiff x = tf.get_variable('x', [], initializer=tf.constant_initializer(3.)) y = tf.get_variable('y', [], initializer=tf.constant_initializer(2.)) f = y*x + x*x + 3*x grad = tf.gradients(f, [x, y]) sess = tf.Session() sess.run(x.initializer) sess.run(y.initializer) print(sess.run(grad)) # return [df/dx, df/dy] # example k = tf.get_variable('k', [], initializer=tf.constant_initializer(0.)) b = tf.get_variable('b', [], initializer=tf.constant_initializer(0.)) init = tf.global_variables_initializer() x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) y_pred = k * x + b loss = tf.square(y - y_pred) optimizer = tf.train.GradientDescentOptimizer(1e-3) train_op = optimizer.minimize(loss) sess = tf.Session() sess.run(init) import random true_k = random.random() true_b = random.random() for update_i in range(10000): input_data = random.random() output_data = random.random() _loss, _ = sess.run([loss, train_op], feed_dict={x: input_data, y: output_data}) print(update_i, _loss) print('True parameter: k={}, b={}'.format(true_k, true_b)) print('True parameter: k={}, b={}'.format(sess.run([m, b]))) Tensorboard from datetime import datetime now = datetime.now().strftime("%Y%m%d%H%M%S") root_logdir = "tf_logs" logdir = "{}/run-{}/".format(root_logdir, now) # end of construction phase mse_summary = tf.summary.scalar('MSE', mse) file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph()) # execution phase for batch_index in range(n_batches): X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size) if batch_index % 10 == 0: summary_str = mse_summary.eval(feed_dict={X: X_batch, y: y_batch}) step = epoch * n_batches + batch_index file_writer.add_summary(summary_str, step) sess.run(training_op, feed_dict={X: X_batch, y: y_batch}) # finally file_writer.close() # start the Tensorboard service !tensorboard --logdir tf_logs/ Keras sequential model import keras from keras.models import Sequential from keras.layers import Dense (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() y_train = keras.utils.to_categorical(y_train) model = Sequential() model.add(Dense(100, activation='relu', input_shape=(28*28,))) model.add(Dense(100, activation='relu')) model.add(Dense(10, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc']) history = model.fit(x_train.reshape(-1, 28*28), y_train, validation_split=0.1, batch_size=32, epochs=10) general model import keras from keras.layers import Input, Dense from keras.models import Model (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() y_train = keras.utils.to_categorical(y_train) input_tensor = Input(shape=(28*28, )) x = Dense(100, activation='relu')(input_tensor) x = Dense(100, activation='relu')(x) output_tensor = Dense(10, activation='softmax')(x) model = Model(input_tensor, output_tensor) model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc']) history = model.fit(x_train.reshape(-1, 28*28), y_train, validation_split=0.1, batch_size=32, epochs=10) Use Keras Layers in Tensorflow import tensorflow as tf from tensorflow import keras from tensorflow.keras.layers import Dense x = tf.placeholder(name='x', shape=(None, 28*28), dtype=tf.float32) hidden1 = Dense(100, activation='relu')(x) hidden2 = Dense(100, activation='relu')(hidden1) output = Dense(10)(hidden2) y = tf.placeholder(tf.int64, shape=(None,)) xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=output) loss = tf.reduce_mean(xentropy, name="loss") # train with tensorflow # ... Use Keras Model in Tensorflow import tensorflow as tf from tensorflow import keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense model = Sequential() model.add(Dense(100, activation='relu', input_shape=(28*28,))) model.add(Dense(10, activation='softmax')) x = tf.placeholder(name='input', shape=(None, 28*28), dtype=tf.float32) y = model(x) Check GPU # check if tensorflow is gpu version pip list | grep tensorflow from keras import backend as K K.tensorflow_backend._get_available_gpus() from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) import tensorflow as tf sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) with tf.device('/gpu:0'): a = tf.constant([1., 2., 3.], shape=[1,3], name='a') b = tf.constant([1., 2., 3.], shape=[3,1], name='b') c = tf.matmul(a, b) with tf.Session() as sess: print(sess.run(c))

Vim

References vimtutor Github - spf13-vim openvim - Tutorial Modes Normal Mode: ESC Insert Mode: i Replace Mode: R Visual Mode: v Command Mode: : Commands Quit: :q Save: :w Save and quit: :wq Quit without save: :q! Open file for editing: :e {filename} Show open buffers: :ls 切换不同buffer: :b {filename} :b1: 切换至第1个buffer 关闭当前buffer: :bw Help: :help {topic} opens help for the :w command: :help :w opens help for the w movement: :help w jump from one window to another: Ctrl-W+Ctrl-w Shell command: :! :!ls: list files Retrieve from file: :r {filename} Retrieve from command: :r !ls Set option: :set {option} Ignore case: :set ic 关掉选项，前面加’no’: :set noic Highlight search: :set hls/:nohlsearch 清除上次搜索结果: :noh Command completion: Ctrl-D, or TAB Normal mode Movement Basic movement: hjkl (left, down, up, right) Words: w (next word) b (beginning of word) e (end of word) Lines: 0 (beginning of line) ^ (first non-blank character) $ (end of line) Screen: H (top of screen) M (middle of screen) L (bottom of screen) Scroll: Ctrl-U (up)/Ctrl-D (down)：Scroll half a page Ctrl-B (up)/Ctrl-F (down)：Scroll a page Percentage jump 30% File: gg (beginning of file) G (end of file) Line numbers: :{number} or {number}G Find: f{character}, t{character}, F{character}, T{character} find/to forward/backward {character} on the current line , / ; for navigating matches Search: `/{regex}`` n / N for navigating matches ? : search in backward direction /ignore\c: ignore case 跳到对应括号: % (corresponding item) Edits insert: i: insert a: append A: append at the end of line insert line below / above: o/O delete: d{motion} dw: delete word d$/D: delete to end of line d0: delete to beginning of line dd: delete line change: c{motion} delete character: x substitude character: s :s/old/new/: substitute ’new’ for the first ‘old’ :s/old/new/g: substitute ’new’ for ‘old’ globally in the line :s/old/new/gc: with prompt :5,10s/old/new/g replace character: r R: 进入replace mode，连续替换字符 undo: u undo line: U redo: Ctrl-R copy (yank): y paste: p filp the case: ~ code complete Ctrl-N: Forward Ctrl-P: Backward Counts 3w move 3 words forward 5j move 5 lines down 7dw/d7w delete 7 words Misc 查看所在位置：Ctrl-G Visual Mode Visual: v Visual Line: V Visual Block: Ctrl-V Insert in Visual Mode: Shift-I Windows Split window: :sp/:vsp :sp {filename}: split window with file Ctrl-W+s/ Ctrl-W+v Move left/down/up/right: Ctrl-W+h/j/k/l, Move to next window: Ctrl-W+w Plugins Vim Awesome ctrlp.vim ctrlp.vim NERDTree NERDTree