Object Storage

rclone

install

sudo -v ; curl https://rclone.org/install.sh | sudo bash

config

[s3]
type = s3
provider = Other
access_key_id = ***
secret_access_key = ***
endpoint = http://<host>:80

[oss]
type = s3
provider = Alibaba
env_auth = false
access_key_id = ***
secret_access_key = ***
endpoint = oss-cn-wulanchabu-internal.aliyuncs.com

usage

# list dir
rclone --config conf/rclone.conf lsd s3:bucket_name/path/to/dir/

# list files
rclone --config conf/rclone.conf ls s3:bucket_name/path/to/dir/

# copy, copy file to dest dir
rclone --config conf/rclone.conf copy s3:bucket_name/source/file s3:bucket_name/dest/

# copyto, copy file to dest file
rclone --config conf/rclone.conf copyto s3:bucket_name/source/file s3:bucket_name/dest/file

# sync
rclone --config conf/rclone.conf sync s3:bucket_name/source/ s3:bucket_name/dest/

# delete
rclone --config conf/rclone.conf delete s3:bucket_name/path/to/file

params

  • progress: 显示传输进度
  • transfers: 并发传输的文件数
  • checkers: 文件预检查(哈希、大小等)的并发数
  • multi-thread-streams: 单文件分块上传的线程数
  • buffer-size: 内存缓冲区大小(用于读写缓存)
rclone --config conf/rclone.conf copy s3:bucket_name/source/ s3:bucket_name/dest/ \
    --progress \
    --transfers 16 \
    --checkers 16 \
    --multi-thread-streams 32 \
    --buffer-size 256M

问题

  1. Provider = Alibaba 时,force_path_style = true 参数不生效,域名仍然是 http://{bucket_name}.{endpoint} 的格式,而不是 http://{endpoint}/{bucket_name} 的格式。需要修改 Provider = Other

    • 根因: rclone 中 backend/s3/s3.go:3473 中的 virtualHostStyle = true,当 Provider 为 Others 时会在 backend/s3/s3.go:3653 中会置为 true,如果 virtualHostStyle 不为 false,则 opt.ForcePathStyle 参数不会生效
    // Path Style vs Virtual Host style
    if virtualHostStyle || opt.UseAccelerateEndpoint {
    	opt.ForcePathStyle = false
    }
    
  2. rclone copy local_file oss:bucket_name/path/to/file/ 传输文件要加 --s3-no-check-bucket 参数,不然会报 409 bucket already exists 错误,rclone copy local_dir/ oss:bucket_name/path/to/file/ 传文件夹则没问题

OSS

Python SDK

Usage:

from oss2 import Auth, Bucket

auth = Auth('your_access_key_id', 'your_access_key_secret')
bucket = Bucket(auth, 'http://oss-cn-hangzhou.aliyuncs.com', 'your_bucket_name')

# list objects
for obj in bucket.list_objects():
    print(obj.key)

fsspec

Usage

import ossfs
fs = ossfs.OSSFileSystem(endpoint='http://oss-cn-hangzhou.aliyuncs.com')
fs.ls('/dvc-test-anonymous/LICENSE')
with fs.open('/dvc-test-anonymous/LICENSE') as f:
    print(f.readline())

Arrow

Install:

pip install pyarrow

Load Data

import pyarrow.parquet as pq

table = pq.read_table('/path/to/table')

Load data from OSS

import ossfs
import pyarrow.parquet as pq

OSS_ENDPOINT = 'http://oss-cn-wulanchabu-internal.aliyuncs.com'
OSS_BUCKET = 'bucket-name'
OSS_ACCESS_KEY = '***'
OSS_ACCESS_SECRET = '***'

fs = ossfs.OSSFileSystem(endpoint=OSS_ENDPOINT, key=OSS_ACCESS_KEY, secret=OSS_ACCESS_SECRET)
table = pq.read_table(f'{OSS_BUCKET}/path/to/table', filesystem=fs)

Lance

Python SDK

Install

pip install pylance==0.23.2

Usage

import lance

# read lance dir
dataset = lance.open_dataset('/path/to/dataset')

# write lance data
lance.write_dataset(dataset, f'/path/to/new_dataset', mode='create')

Read/Write data from OSS

storage_options={
    "access_key_id": OSS_ACCESS_KEY,
    "secret_access_key": OSS_ACCESS_SECRET,
    "aws_endpoint": "http://bucket-name.oss-cn-wulanchabu-internal.aliyuncs.com",
    "virtual_hosted_style_request": "true",
    "allow_http": "true"
}

dataset = lance.open_dataset(f's3://bucket-name/path/to/dataset', storage_options=storage_options)

lance.write_dataset(dataset, f's3://bucket-name/path/to/dataset', mode='overwrite', storage_options=storage_options)

Read/Write data with Ray

import ray
import lance
from lance.ray.sink import LanceDatasink

ray.init()

dataset = ray.data.read_lance(f's3://bucket-name/path/to/dataset', storage_options=storage_options)

# write lance data
# style 1: 
dataset.write_lance(f's3://bucket-name/path/to/dataset', mode='create', storage_options=storage_options)
# style 2:
sink = LanceDatasink(uri=f"s3://bucket-name/path/to/dataset", storage_options=storage_options)
dataset.write_datasink(sink)

Read/Write data with Spark

LMDB

References

Install

pip install lmdb

Usage

create a lmdb file

import lmdb

lmdb_path = '/data/lmdb'
env = lmdb.open(lmdb_path, map_size=1099511627776)
env.close()

stat the lmdb file

  • psize: LMDB 使用内存映射文件(Memory-Mapped File)来管理数据,数据被分割成固定大小的页面,psize 表示每个页面的大小
  • depth: B+ 树的深度,深度越大,表示树的层级越多,查询时可能需要更多的 I/O 操作
  • branch_pages: B+ 树的分支页面用于存储指向子页面的指针
  • leaf_pages: B+ 树的叶子页面用于存储实际的数据
  • overflow_pages: 当单个页面无法容纳一个数据项时,LMDB 会使用溢出页面来存储额外的数据,这里表示溢出页面的数量
  • entries: 数据库中存储的键值对(Entries)的数量
print(env.stat())

read entry with key

env = lmdb.open(lmdb_path, readonly=True)
with env.begin() as txn:
    key = b'key'
    value = txn.get(key)
    print(value)
env.close()

HDF5

一个历史悠久的文件格式,用于存储大规模数据集,支持多维数组、元数据、属性等

Webdataset

Alluxio