Object Storage
rclone
install
sudo -v ; curl https://rclone.org/install.sh | sudo bash
config
[s3]
type = s3
provider = Other
access_key_id = ***
secret_access_key = ***
endpoint = http://<host>:80
[oss]
type = s3
provider = Alibaba
env_auth = false
access_key_id = ***
secret_access_key = ***
endpoint = oss-cn-wulanchabu-internal.aliyuncs.com
usage
# list dir
rclone --config conf/rclone.conf lsd s3:bucket_name/path/to/dir/
# list files
rclone --config conf/rclone.conf ls s3:bucket_name/path/to/dir/
# copy, copy file to dest dir
rclone --config conf/rclone.conf copy s3:bucket_name/source/file s3:bucket_name/dest/
# copyto, copy file to dest file
rclone --config conf/rclone.conf copyto s3:bucket_name/source/file s3:bucket_name/dest/file
# sync
rclone --config conf/rclone.conf sync s3:bucket_name/source/ s3:bucket_name/dest/
# delete
rclone --config conf/rclone.conf delete s3:bucket_name/path/to/file
params
- progress: 显示传输进度
- transfers: 并发传输的文件数
- checkers: 文件预检查(哈希、大小等)的并发数
- multi-thread-streams: 单文件分块上传的线程数
- buffer-size: 内存缓冲区大小(用于读写缓存)
rclone --config conf/rclone.conf copy s3:bucket_name/source/ s3:bucket_name/dest/ \
--progress \
--transfers 16 \
--checkers 16 \
--multi-thread-streams 32 \
--buffer-size 256M
问题
-
当
Provider = Alibaba时,force_path_style = true参数不生效,域名仍然是http://{bucket_name}.{endpoint}的格式,而不是http://{endpoint}/{bucket_name}的格式。需要修改Provider = Other- 根因: rclone 中
backend/s3/s3.go:3473中的virtualHostStyle = true,当 Provider 为 Others 时会在backend/s3/s3.go:3653中会置为true,如果virtualHostStyle不为false,则opt.ForcePathStyle参数不会生效
// Path Style vs Virtual Host style if virtualHostStyle || opt.UseAccelerateEndpoint { opt.ForcePathStyle = false } - 根因: rclone 中
-
rclone copy local_file oss:bucket_name/path/to/file/传输文件要加--s3-no-check-bucket参数,不然会报409 bucket already exists错误,rclone copy local_dir/ oss:bucket_name/path/to/file/传文件夹则没问题
OSS
Python SDK
Usage:
from oss2 import Auth, Bucket
auth = Auth('your_access_key_id', 'your_access_key_secret')
bucket = Bucket(auth, 'http://oss-cn-hangzhou.aliyuncs.com', 'your_bucket_name')
# list objects
for obj in bucket.list_objects():
print(obj.key)
fsspec
Usage
import ossfs
fs = ossfs.OSSFileSystem(endpoint='http://oss-cn-hangzhou.aliyuncs.com')
fs.ls('/dvc-test-anonymous/LICENSE')
with fs.open('/dvc-test-anonymous/LICENSE') as f:
print(f.readline())
Arrow
Install:
pip install pyarrow
Load Data
import pyarrow.parquet as pq
table = pq.read_table('/path/to/table')
Load data from OSS
import ossfs
import pyarrow.parquet as pq
OSS_ENDPOINT = 'http://oss-cn-wulanchabu-internal.aliyuncs.com'
OSS_BUCKET = 'bucket-name'
OSS_ACCESS_KEY = '***'
OSS_ACCESS_SECRET = '***'
fs = ossfs.OSSFileSystem(endpoint=OSS_ENDPOINT, key=OSS_ACCESS_KEY, secret=OSS_ACCESS_SECRET)
table = pq.read_table(f'{OSS_BUCKET}/path/to/table', filesystem=fs)
Lance
Python SDK
Install
pip install pylance==0.23.2
Usage
import lance
# read lance dir
dataset = lance.open_dataset('/path/to/dataset')
# write lance data
lance.write_dataset(dataset, f'/path/to/new_dataset', mode='create')
Read/Write data from OSS
storage_options={
"access_key_id": OSS_ACCESS_KEY,
"secret_access_key": OSS_ACCESS_SECRET,
"aws_endpoint": "http://bucket-name.oss-cn-wulanchabu-internal.aliyuncs.com",
"virtual_hosted_style_request": "true",
"allow_http": "true"
}
dataset = lance.open_dataset(f's3://bucket-name/path/to/dataset', storage_options=storage_options)
lance.write_dataset(dataset, f's3://bucket-name/path/to/dataset', mode='overwrite', storage_options=storage_options)
Read/Write data with Ray
- Ray - ray.data.read_lance
- Ray - ray.data.Dataset.write_lance
- Lance - Ray Integration
- 火山引擎文档中心 - 使用 Ray 操作 Lance 数据
import ray
import lance
from lance.ray.sink import LanceDatasink
ray.init()
dataset = ray.data.read_lance(f's3://bucket-name/path/to/dataset', storage_options=storage_options)
# write lance data
# style 1:
dataset.write_lance(f's3://bucket-name/path/to/dataset', mode='create', storage_options=storage_options)
# style 2:
sink = LanceDatasink(uri=f"s3://bucket-name/path/to/dataset", storage_options=storage_options)
dataset.write_datasink(sink)
Read/Write data with Spark
LMDB
References
Install
pip install lmdb
Usage
create a lmdb file
import lmdb
lmdb_path = '/data/lmdb'
env = lmdb.open(lmdb_path, map_size=1099511627776)
env.close()
stat the lmdb file
- psize: LMDB 使用内存映射文件(Memory-Mapped File)来管理数据,数据被分割成固定大小的页面,psize 表示每个页面的大小
- depth: B+ 树的深度,深度越大,表示树的层级越多,查询时可能需要更多的 I/O 操作
- branch_pages: B+ 树的分支页面用于存储指向子页面的指针
- leaf_pages: B+ 树的叶子页面用于存储实际的数据
- overflow_pages: 当单个页面无法容纳一个数据项时,LMDB 会使用溢出页面来存储额外的数据,这里表示溢出页面的数量
- entries: 数据库中存储的键值对(Entries)的数量
print(env.stat())
read entry with key
env = lmdb.open(lmdb_path, readonly=True)
with env.begin() as txn:
key = b'key'
value = txn.get(key)
print(value)
env.close()
HDF5
一个历史悠久的文件格式,用于存储大规模数据集,支持多维数组、元数据、属性等