N1H111SM's Miniverse

Pytorch - Batch Sparse Matrix on GPU

字数统计: 545阅读时长: 2 min
2020/05/19 Share

Materials

Basic CUDA

  • model = model.cuda()递归地将所有model.parameters()中继承了nn.Module类的参数都放入GPU显存中。
  • device = torch.cuda.device(0)将取回当前系统中cuda:0设备对象。该对象可以在创建tensor或variable时作为参数传入。

Problem Description

现阶段GraphMIMaximizer的构建关系是:自身model构建了GCNGCN构建了若干层的GraphConvolution layer. 分别在GPU和CPU的环境中进行模型训练速度测试,我们有:

CPU

1
2
3
4
5
6
GC layer time taken: 0.21100592613220215
GC layer time taken: 0.24969816207885742
GC layer time taken: 0.24791836738586426
GC layer time taken: 0.2417440414428711
GCN forward time taken: 1.0056202411651611
MI maximization forward time taken: 1.0259888172149658

GPU

1
2
3
4
5
6
GC layer time taken: 0.7932033538818359
GC layer time taken: 0.8128383159637451
GC layer time taken: 0.7196145057678223
GC layer time taken: 0.7076494693756104
GCN forward time taken: 3.038027048110962
MI maximization forward time taken: 3.0401880741119385

通过以上的测试我们可以看出模型在GPU上训练变慢的主要原因是在GC layer上的开销过高,需要对该段代码进行优化。

Solution

这段forward代码基于batch sparse matrix multiplication完成,推测是因为使用了for 循环导致pytorch在CPU和GPU间频繁通讯导致的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def forward(self, x, adj):
#######
# support batch graph, modified from original GCN code.
#######
support = torch.matmul(x, self.weight)
start_time = time.time()
output = []
for i in range(support.shape[0]):
sparse_adj = adj[i]
dense_x = support[i]
output.append(torch.spmm(sparse_adj, dense_x))
output = torch.stack(output)
print("GC layer time taken: {}".format(time.time()-start_time))

if self.bias is not None:
return output + self.bias
else:
return output

因为我们原先是为了节省空间才采用的sparse matrix,由于pytorch还没有支持batch sparse matrix multiplication, 所以为了加速训练,我们直接将batched sparse matrix转换成tensor进行计算。

1
2
3
4
5
6
7
8
9
10
11
12
13
def forward(self, x, adj):
#######
# support batch graph, modified from original GCN code.
#######
support = torch.matmul(x, self.weight)
start_time = time.time()
output = torch.matmul(adj.to_dense(), support)
print("GC layer time taken: {}".format(time.time()-start_time))

if self.bias is not None:
return output + self.bias
else:
return output

其中需要注意的一点是,当两个matrix分别在前面的dimension带上了batch时,默认的torch.matmul将会将他们最后两个维度相乘.

从而使得模型在GPU上的训练速度大幅提升。

1
2
3
4
5
GC layer time taken: 0.00017189979553222656
GC layer time taken: 0.0001685619354248047
GC layer time taken: 0.0001666545867919922
GC layer time taken: 0.00016689300537109375
GCN forward time taken: 0.0024771690368652344
CATALOG
  1. 1. Basic CUDA
  2. 2. Problem Description
  3. 3. Solution