Frameworks
awesome-production-machine-learning Github - AI Performance Engineering
- Resource Management
- Distributed Training
- Torch Run
- communication backends
- Gloo
- MPI
- NCCL
- communication backends
- Tensorflow PS Work
- Horovod
- DeepSpeed
- Megatron-LM
- Colossal AI
- Torch Run
- Inference Accelerate
- Development
- pytorch lightning: organizes PyTorch code to remove boilerplate and unlock scalability
- mlflow: experiment management
- wandb: experiment tracking
- pycaret: a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, Optuna, Hyperopt, Ray, and few more
Key Questions
- 如何管理训练数据集版本
- 如何在对象存储与高性能存储间进行数据流动