英语
双语
汉语

同步SGD | Caffe2

Synchronous SGD | Caffe2
同步SGD | Caffe2
673字
2019-02-18 23:05
133阅读
同步SGD | Caffe2

There are multiple ways to utilize multiple GPUs or machines to train models. Synchronous SGD, using Caffe2’s data parallel model, is the simplest and easiest to understand: each GPU will execute exactly same code to run their share of the mini-batch. Between mini-batches, we average the gradients of each GPU and each GPU executes the parameter update in exactly the same way. At any point in time the parameters have same values on each GPU. Another way to understand Synchronous SGD is that it allows increasing the mini-batch size. Using 8 GPUS to run a batch of 32 each is equivalent to one GPU running a mini-batch of 256.

有多种方法可以利用多个GPU或机器来训练模型。使用Caffe2的数据并行模型来同步SGD是最简单和最容易理解的:每个GPU将执行完全相同的代码来运行它们的小批量任务共享。在各个小批量任务之间,我们平均每个GPU的梯度,每个GPU以完全相同的方式执行参数更新。在任何时间点,参数在每个GPU上具有相同的值。理解同步SGD的另一种方法是允许增加小批量任务的大小。使用8个GPU运行一批32个任务,相当于一个GPU运行一个256的小批量任务。

Programming Guide

编程指南

Example code:

示例代码:

Parallelizing a model is done by module caffe2.python.data_parallel_model. The model must be created using a ModelHelper, such as model_helper.ModelHelper.

并行化模型由模块caffe2.python.data_parallel_model完成。必须使用ModelHelper创建模型,例如model_helper.ModelHelper

For a full-length tutorial building ResNet-50 for a single GPU, then using Parallelize_GPU for multiple GPU check out this tutorial Here is example from the Resnet-50 example code:

首先为单个GPU构建ResNet-50,然后对于多GPU的情况使用Parallelize_GPU,要获取上述操作的完整教程,请查看这里,以下是在Resnet-50示例代码中的示例

1 2 3 4 5 6 7 8 9 10 11 12

from caffe2.python import data_parallel_model, model_helper

train_model = model_helper.ModelHelper(name="resnet50")

data_parallel_model.Parallelize_GPU(      train_model,      input_builder_fun=add_image_input,      forward_pass_builder_fun=create_resnet50_model_ops,      param_update_builder_fun=add_parameter_update_ops,      devices=gpus,  # list of integers such as [0, 1, 2, 3]      optimize_gradient_memory=False/True,  )

from caffe2.python import data_parallel_model, model_helper 

train_model = model_helper.ModelHelper(name="resnet50") 

data_parallel_model.Parallelize_GPU( 
     train_model, 
     input_builder_fun=add_image_input, 
     forward_pass_builder_fun=create_resnet50_model_ops, 
     param_update_builder_fun=add_parameter_update_ops, 
     devices=gpus,  # list of integers such as [0, 1, 2, 3] 
     optimize_gradient_memory=False/True, 
 )

The key is to split your model creation code to three functions. These functions construct the operators like you would do without parallelization.

关键是将模型创建代码拆分为三个函数。这些函数构建Operator,就像没有并行化一样。

  • input_builder_fun: creates the operators to provide input to the network. Note: be careful that each GPU reads unique data (they should not read the same exact data)! Typically they should share the same Reader to prevent this, or the data should be batched in such way that each Reader is provided unique data. Signature: function(model)
  • forward_pass_builder_fun: this function adds the operators, layers to the network. It should return a list of loss-blobs that are used for computing the loss gradient. This function is also passed an internally calculated loss_scale parameter that is used to scale your loss to normalize for the number of GPUs. Signature: function(model, loss_scale)
  • param_update_builder_fun: this function adds the operators for applying the gradient update to parameters. For example, a simple SGD update, a momentum parameter update. You should also instantiate the Learning Rate and Iteration blobs here. You can set this function to None if you are not doing learning but only forward pass. Signature: function(model)
  • optimize_gradient_memory: if enabled, memonger module is used to optimize memory usage of gradient operators by sharing blobs when possible. This can save significant amount of memory, and may help you run larger batches.
  • input_builder_fun:创建Operator以向网络提供输入。 注意 : 每个GPU读取唯一数据(它们不应读取相同的精确数据)!通常,它们应该共享相同的Reader以防止出现这种情况,或者应该以每个Reader提供唯一数据的方式对数据进行批处理。 Signature:function(model)
  • forward_pass_builder_fun:此函数将Operator,层添加到网络中。它应该返回一个用于计算损耗梯度的loss-blob列表。此函数还传递一个内部计算的loss_scale参数,该参数用于扩展损失以标准化GPU的数量。 Signature :function(model,loss_scale)
  • param_update_builder_fun:此函数添加用于将梯度更新应用于参数的Operator。例如,简单的SGD更新,动量参数更新。您还应该在此处实例化学习率和迭代blob。如果您不进行学习而只进行前传,则可以将此功能设置为“无”。 Signature :function(model)
  • optimize_gradient_memory:如果启用, memonger模块用于通过尽可能共享blob来优化梯度运算符的内存使用。这可以节省大量内存,并可以帮助您运行更大批量。

Notes

注意

  • Do not access the model_helper.params directly! Instead use model_helper.GetParams(), which only returns the parameters for the current GPU.
  • 不要直接访问model_helper.params!而是使用model_helper.GetParams(),它只返回当前GPU的参数。

Implementation Notes

实施说明

Under the hood, Caffe2 uses DeviceScope and NameScope to distinguish parameters for each GPU. Each parameter is prefixed with a namescope such as “gpu_0/” or “gpu_5/”. Each blob created by the functions above is assigned to the correct GPU by DeviceScope set by the data_parallel_model.Parallelize_GPU function. To checkpoint the model, only pickup parameters prefixed with “gpu_0/” by calling model.GetParams("gpu_0"). We use CUDA NCCL-ops to synchronize parameters between machines.

在hood命名规范下,Caffe2使用DeviceScope和NameScope来区分每个GPU的参数。每个参数都以名称范围为前缀,例如“gpu_0 /”或“gpu_5 /”。上述函数创建的每个blob都由data_parallel_model.Parallelize_GPU函数设置的DeviceScope分配给正确的GPU。要检查模型,只需通过调用model.GetParams(“gpu_0”),使用以“gpu_0 /”为前缀的参数。我们使用CUDA NCCL-ops来同步机器之间的参数。

Performance

性能

Performance will depend on the model, but for Resnet-50, we get ~7x speedup on 8 M40 GPUs over 1 GPU.

性能取决于模型,但对于Resnet-50,我们在8个M40 GPU上获得了比单GPU快~7倍的加速。

Further Reading & Examples

进一步阅读和实例

Gloo is a Facebook Incubator project that helps manage multi-host, multi-GPU machine learning applications.

Gloo是一个Facebook Incubator项目,它可以帮助管理多主机,多GPU机器学习应用程序。

Resnet-50 example code contains example code using rendezvous which is a feature not specifically utilized in this synch SGD example, but is present in the data_parallel_model module that it used.

Resnet-50示例代码包含使用rendezvous的示例代码,在此同步SGD示例中未显式使用到该功能,而存在于该示例使用的data_parallel_model模块中。

Deep Residual Learning for Image Recognition is the source research for Resnet-50, wherein they explore the results of building deeper and deeper networks, to over 1,000 layers, using residual learning on the ImageNet dataset. Resnet-50 is their residual network variation using 50 layers that performed quite well with the task of object detection, classification, and localization.

用于图像识别的深度残差学习是Resnet-50的源研究项目,它们使用ImageNet数据集上的残差学习来探索构建越来越深层次的网络,超过1,000层。 Resnet-50是他们使用50层时的残差网络变化,在物体检测,分类和定位的任务中表现非常好。

2 +1
举报
0 条评论
评论不能为空

^2HXH_008_蝼叔的内容