Pytorch - Batch Sparse Matrix on GPU

Code

字数统计: 545阅读时长: 2 min

 2020/05/19   Share

Materials

my blog about reimplementing InfoGraph.

Basic CUDA

model = model.cuda()递归地将所有model.parameters()中继承了nn.Module类的参数都放入GPU显存中。
device = torch.cuda.device(0)将取回当前系统中cuda:0设备对象。该对象可以在创建tensor或variable时作为参数传入。

Problem Description

现阶段GraphMIMaximizer的构建关系是：自身model构建了GCN，GCN构建了若干层的GraphConvolution layer. 分别在GPU和CPU的环境中进行模型训练速度测试，我们有：

CPU

GC layer time taken: 0.21100592613220215
GC layer time taken: 0.24969816207885742
GC layer time taken: 0.24791836738586426
GC layer time taken: 0.2417440414428711
GCN forward time taken: 1.0056202411651611
MI maximization forward time taken: 1.0259888172149658

GPU

GC layer time taken: 0.7932033538818359
GC layer time taken: 0.8128383159637451
GC layer time taken: 0.7196145057678223
GC layer time taken: 0.7076494693756104
GCN forward time taken: 3.038027048110962
MI maximization forward time taken: 3.0401880741119385

通过以上的测试我们可以看出模型在GPU上训练变慢的主要原因是在GC layer上的开销过高，需要对该段代码进行优化。

Solution

这段forward代码基于batch sparse matrix multiplication完成，推测是因为使用了for 循环导致pytorch在CPU和GPU间频繁通讯导致的。

def forward(self, x, adj):
    #######
    # support batch graph, modified from original GCN code.
    #######
    support = torch.matmul(x, self.weight)
    start_time = time.time()
    output = []
    for i in range(support.shape[0]):
        sparse_adj = adj[i]
        dense_x = support[i]
        output.append(torch.spmm(sparse_adj, dense_x))
    output = torch.stack(output)
    print("GC layer time taken: {}".format(time.time()-start_time))
    
    if self.bias is not None:
        return output + self.bias
    else:
        return output

因为我们原先是为了节省空间才采用的sparse matrix，由于pytorch还没有支持batch sparse matrix multiplication, 所以为了加速训练，我们直接将batched sparse matrix转换成tensor进行计算。

def forward(self, x, adj):
    #######
    # support batch graph, modified from original GCN code.
    #######
    support = torch.matmul(x, self.weight)
    start_time = time.time()
    output = torch.matmul(adj.to_dense(), support)
    print("GC layer time taken: {}".format(time.time()-start_time))
    
    if self.bias is not None:
        return output + self.bias
    else:
        return output

其中需要注意的一点是，当两个matrix分别在前面的dimension带上了batch时，默认的torch.matmul将会将他们最后两个维度相乘.

$\operatorname{matmul}: M^{(d_1, d_2, d_3)} \times M^{(d_1, d_3, d_4)} \rightarrow M^{(d_1, d_2, d_4)}$

从而使得模型在GPU上的训练速度大幅提升。

GC layer time taken: 0.00017189979553222656
GC layer time taken: 0.0001685619354248047
GC layer time taken: 0.0001666545867919922
GC layer time taken: 0.00016689300537109375
GCN forward time taken: 0.0024771690368652344

原文作者：Haizhou Shi

原文链接：http://www.shihaizhou.com/2020/05/19/Pytorch-batch-sparse-matrix-on-GPU/

发表日期：May 19th 2020, 3:38:20 pm

更新日期：May 20th 2020, 5:41:16 pm

Next Post

SGD Optimizers
Previous Post

InfoGraph - Experiments

CATALOG

1. Basic CUDA
2. Problem Description
3. Solution

