Storage | 蚂蚁蚂蚁的技术分享

Object Storage

rclone

Github - rclone

install

sudo -v ; curl https://rclone.org/install.sh | sudo bash

config

[s3]
type = s3
provider = Other
access_key_id = ***
secret_access_key = ***
endpoint = http://<host>:80

[oss]
type = s3
provider = Alibaba
env_auth = false
access_key_id = ***
secret_access_key = ***
endpoint = oss-cn-wulanchabu-internal.aliyuncs.com

usage

# list dir
rclone --config conf/rclone.conf lsd s3:bucket_name/path/to/dir/

# list files
rclone --config conf/rclone.conf ls s3:bucket_name/path/to/dir/

# copy, copy file to dest dir
rclone --config conf/rclone.conf copy s3:bucket_name/source/file s3:bucket_name/dest/

# copyto, copy file to dest file
rclone --config conf/rclone.conf copyto s3:bucket_name/source/file s3:bucket_name/dest/file

# sync
rclone --config conf/rclone.conf sync s3:bucket_name/source/ s3:bucket_name/dest/

# delete
rclone --config conf/rclone.conf delete s3:bucket_name/path/to/file

params

progress: 显示传输进度
transfers: 并发传输的文件数
checkers: 文件预检查（哈希、大小等）的并发数
multi-thread-streams: 单文件分块上传的线程数
buffer-size: 内存缓冲区大小（用于读写缓存）

rclone --config conf/rclone.conf copy s3:bucket_name/source/ s3:bucket_name/dest/ \
    --progress \
    --transfers 16 \
    --checkers 16 \
    --multi-thread-streams 32 \
    --buffer-size 256M

问题

当 Provider = Alibaba 时，force_path_style = true 参数不生效，域名仍然是 http://{bucket_name}.{endpoint} 的格式，而不是 http://{endpoint}/{bucket_name} 的格式。需要修改 Provider = Other
- 根因： rclone 中 backend/s3/s3.go:3473 中的 virtualHostStyle = true，当 Provider 为 Others 时会在 backend/s3/s3.go:3653 中会置为 true，如果 virtualHostStyle 不为 false，则 opt.ForcePathStyle 参数不会生效
```
// Path Style vs Virtual Host style
if virtualHostStyle || opt.UseAccelerateEndpoint {
	opt.ForcePathStyle = false
}
```
rclone copy local_file oss:bucket_name/path/to/file/ 传输文件要加 --s3-no-check-bucket 参数，不然会报 409 bucket already exists 错误，rclone copy local_dir/ oss:bucket_name/path/to/file/ 传文件夹则没问题

OSS

Python SDK

阿里云 OSS Python SDK

Usage:

from oss2 import Auth, Bucket

auth = Auth('your_access_key_id', 'your_access_key_secret')
bucket = Bucket(auth, 'http://oss-cn-hangzhou.aliyuncs.com', 'your_bucket_name')

# list objects
for obj in bucket.list_objects():
    print(obj.key)

fsspec

Github - fsspec/ossfs

Usage

import ossfs
fs = ossfs.OSSFileSystem(endpoint='http://oss-cn-hangzhou.aliyuncs.com')
fs.ls('/dvc-test-anonymous/LICENSE')
with fs.open('/dvc-test-anonymous/LICENSE') as f:
    print(f.readline())

Arrow

Install:

pip install pyarrow

Load Data

import pyarrow.parquet as pq

table = pq.read_table('/path/to/table')

Load data from OSS

Arrow - Filesystem Interface

import ossfs
import pyarrow.parquet as pq

OSS_ENDPOINT = 'http://oss-cn-wulanchabu-internal.aliyuncs.com'
OSS_BUCKET = 'bucket-name'
OSS_ACCESS_KEY = '***'
OSS_ACCESS_SECRET = '***'

fs = ossfs.OSSFileSystem(endpoint=OSS_ENDPOINT, key=OSS_ACCESS_KEY, secret=OSS_ACCESS_SECRET)
table = pq.read_table(f'{OSS_BUCKET}/path/to/table', filesystem=fs)

Lance

Python SDK

Install

pip install pylance==0.23.2

Usage

Lance - Read and Write Data

import lance

# read lance dir
dataset = lance.open_dataset('/path/to/dataset')

# write lance data
lance.write_dataset(dataset, f'/path/to/new_dataset', mode='create')

Read/Write data from OSS

storage_options={
    "access_key_id": OSS_ACCESS_KEY,
    "secret_access_key": OSS_ACCESS_SECRET,
    "aws_endpoint": "http://bucket-name.oss-cn-wulanchabu-internal.aliyuncs.com",
    "virtual_hosted_style_request": "true",
    "allow_http": "true"
}

dataset = lance.open_dataset(f's3://bucket-name/path/to/dataset', storage_options=storage_options)

lance.write_dataset(dataset, f's3://bucket-name/path/to/dataset', mode='overwrite', storage_options=storage_options)

Read/Write data with Ray

import ray
import lance
from lance.ray.sink import LanceDatasink

ray.init()

dataset = ray.data.read_lance(f's3://bucket-name/path/to/dataset', storage_options=storage_options)

# write lance data
# style 1: 
dataset.write_lance(f's3://bucket-name/path/to/dataset', mode='create', storage_options=storage_options)
# style 2:
sink = LanceDatasink(uri=f"s3://bucket-name/path/to/dataset", storage_options=storage_options)
dataset.write_datasink(sink)

Read/Write data with Spark

LMDB

References

Install

pip install lmdb

Usage

create a lmdb file

import lmdb

lmdb_path = '/data/lmdb'
env = lmdb.open(lmdb_path, map_size=1099511627776)
env.close()

stat the lmdb file

psize: LMDB 使用内存映射文件（Memory-Mapped File）来管理数据，数据被分割成固定大小的页面,psize 表示每个页面的大小
depth: B+ 树的深度，深度越大，表示树的层级越多，查询时可能需要更多的 I/O 操作
branch_pages: B+ 树的分支页面用于存储指向子页面的指针
leaf_pages: B+ 树的叶子页面用于存储实际的数据
overflow_pages: 当单个页面无法容纳一个数据项时，LMDB 会使用溢出页面来存储额外的数据，这里表示溢出页面的数量
entries: 数据库中存储的键值对（Entries）的数量

print(env.stat())

read entry with key

env = lmdb.open(lmdb_path, readonly=True)
with env.begin() as txn:
    key = b'key'
    value = txn.get(key)
    print(value)
env.close()

HDF5

一个历史悠久的文件格式，用于存储大规模数据集，支持多维数组、元数据、属性等

Github - hdf5

Webdataset

Github - webdataset

Object Storage#

rclone#

问题#

OSS#

Python SDK#

fsspec#

Arrow#

Load Data#

Lance#

Python SDK#

LMDB#

References#

Install#

Usage#

HDF5#

Webdataset#

Alluxio#

Object Storage

rclone

问题

OSS

Python SDK

fsspec

Arrow

Load Data

Lance

Python SDK

LMDB

References

Install

Usage

HDF5

Webdataset

Alluxio