N1H111SM's Miniverse

Graph Neural Networks - Survey

字数统计: 4.8k阅读时长: 22 min
2020/02/28 Share

Materials:

Backgrounds

Basics

History 早期的工作可以被归结为Recurrent GNNs (RecGNN),节点在每个iteration和周围的节点传递信息直到节点的表示拟合,这种方法在计算上开销比较大,后来的研究者为了解决这一问题也有工作。CNN在视觉领域取得成功之后,研究者在GNN上重新定义了convolution操作并且有很多的工作围绕ConvGNN展开。在ConvGNN中可以分为两个小的子方向:spatial-based ConvGNNs以及spectral-based ConvGNNs。除此之外,还有很多工作关注在Graph AutoEncoders以及Spatial-Temporal GNNs,这些工作都是在前两个方向的基础上进行的。

GNN vs Graph Kernel 后者是以往解决graph classification的主流方法,这个方法通过核函数将图结构映射到一个低维的向量空间,然后利用ML方法例如SVM进行分类。和GNN不同的是,Graph Kernel方法里的核函数是人为定义的而不是可学习的。GNN则将Graph Representation直接进行端到端的训练学习。

Training Frameworks

Semi-supervised learning for node-level classification. 在节点上进行的半监督分类问题。半监督意味着在一个图中有一些节点进行了标注,而另外一些节点没有标注。通常使用ConvGNN进行未标注节点的分类。

Supervised learning for graph-level classification. Graph-level的有监督分类。通常使用graph convolution + graph pooling/readout layer进行。graph pooling layer通常被用来进行高维信息的聚合,readout layer将每个节点的representation映射到graph-level的representation。

Unsupervised learning for graph embedding. 在没有标注的情况下学习图像的embedding。一种方法采用autoencoder;另一种方法采用negative sampling:采样一些互相没有连接的node pair作为负样本,有连接的node pair作为正样本。

RecGNNs

RecGNNs是最早开始在图结构上采用神经网络的,可以说是GNNs方法的先驱。早期的工作因为计算条件的限制,只能在directed acyclic graphs上进行。

Scarselli et al.提出的GNN*算是图神经网络的开山之作,它采用了一种information diffusion mechanism - it updates nodes’ states by exchanging neighborhood information recurrently until a stable equilibrium is reached:

To ensure convergence, the recurrent function $f(\cdot)$ must be a contraction mapping, which shrinks the distance between two points after projecting them into a latent space. In the case of $f(\cdot)$ being a neural network, a penalty term has to be imposed on the Jacobian matrix of parameters.

Graph Echo State Network (GraphESN) 在GCN*的基础上进行了改进从而达到了更好的训练效率 - It implements a contractive state transition function to recurrently update node states until the global graph state reaches convergence. Afterward, the output layer is trained by taking the fixed node states as inputs. (该工作的引用不是很高)

Gated Graph Neural Network (GGNN) 采用了GRU作为recurrent function.

同时GGNN不再将convergence作为recurrence停止的信号,而是将其限制在一个最大步数里。这样做的好处是不再需要对参数进行额外的约束来保证收敛,同时它采用了BP进行参数更新:这在graph很大的时候可能会导致问题。

Stochastic Steady-state Embedding (SSE)对于更大的图上有比较好的表现。他对于节点hidden state的更新采用的是随机且异步(stochastic and asynchronous)的方式:它会sample一部分的节点进行更新,并且sample另一部分的节点进行梯度计算。为了让算法更加稳定,在进行hidden state的更新时还会采用smoothing的方法。

注意:SSE并没有在理论上证明通过以上的recurrence能够收敛。

ConvGNNs

首先我们会介绍关于Graph Laplacian Operator的intuition和推导过程,因为这是一个非常重要的概念,理解它有助于理解图结构的问题应当遵循怎样的解决思路。接着分别介绍Spectral-based ConvGNNs和Spatial-based ConvGNNs.

Spectral-based ConvGNNs

Spectral-based approaches define graph convolutions by introducing filters from the perspective of graph signal processing where the graph convolutional operation is interpreted as removing noises from graph signals.

首先回顾graph convolution的定义:The graph convolution of the input signal $x$ with a filter $g\in \mathbb R^{|V|}$ is defined as

Spectral Convolutional Neural Network (Spectral CNN) 假设了多层的可训练multi-channel filter $\mathbf{g}_{\theta}=\mathbf{\Theta}_{i, j}^{(k)}$. Spectral CNN的图卷积操作定义为:

where $k$ is the layer index, $\mathbf{H}^{(k-1)} \in \mathbf{R}^{n \times f_{k-1}}$ is the input graph signal, $\mathbf{H}^{(0)}=\mathbf{X}$, $f_{k-1}$ is the number of input channels and $f_k$ is the number of output channels, $\Theta_{i, j}^{(k)}$ is a diagonal matrix filled with learnable parameters.

因为需要特征分解,所以Spectral CNN方法有以下三点不足:首先,任何图结构的变化(边权变化或者节点增加)都会导致eigenbasis的变化;其次,训练得到的fitler是domain dependent,是不能运用在不同结构的图上的;第三,特征分解需要$O(|V|^3)$的计算复杂度。以下的两篇工作ChebNet和GCN通过一些近似和简化将其计算复杂度降低。

Chebyshev Spectral CNN (ChebNet) 认为non-parametric filters有两点不足:(i) they are not localized in space; (ii) their learning complexity is in $O(n)$. 根据上一篇博客中的Lemma 1,我们可以通过高次$L_n$来得到具有局部连通性的filter. 所以我们可以如下设计一个多channel的filter函数:

高次$L_n$的计算需要$O(N^3)$的复杂度,而我们最后期待计算的结果是类似于将filter函数作用到输入的graph signal $x$上。据此联想到Chebyshev Polynomials(详细见博客”Approximation Theory”中描述Chebyshev Polynomials的定义及性质)在方阵上的推广。为了满足Chebyshev Polynomials的定义域问题,需要将$\Lambda$以及$L$归一化到区间$[-1,1]$上:

Chebyshev Polynomials的递归定义如下:

其中$T_{0}(\mathbf{x})=1$, $T_{1}(\mathbf{x})=\mathbf{x}$, $\mathbf{x}$为向量。根据以上我们能够给出更快计算filter的定义式:

As an improvement over Spectral CNN, the filters defined by ChebNet are localized in space, which means filters can extract local features independently of the graph size.

Graph Convolutional Network (GCN) 对ChebNet进行了一阶近似,通过固定$K=1$,$\lambda_\max=2$,我们能够将上式改写成:

To restrain the number of parameters and avoid over-fitting, GCN further assume $\theta =\theta_0=-\theta_1$, leading to the following definition of a graph convolution,

为了能够处理多通道的input/output,GCN将上式改成了compositional layer, defined as,

其中$\overline{\mathbf{A}}=\mathbf{I}_{\mathbf{n}}+\mathbf{D}^{-\frac{1}{2}} \mathbf{A} \mathbf{D}^{-\frac{1}{2}}$, $f(\cdot)$是activation function,使用这样定义的$\overline A$会导致numerical instability. To address this problem, GCN applies a normalization trick: $\overline{\mathbf{A}}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}$ with $\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{\mathbf{n}}$ and $\tilde{\mathbf{D}}_{i i}=\sum_{j} \tilde{\mathbf{A}}_{i j}$. 虽然GCN是一种Spectral-based的方法,但是它也能够被看作aggregating feature information from a node’s neighborhood的空间图卷积网络,

在GCN的基础上,有很多的工作进行了一些增量式的改动。Several recent works made incremental improvements over GCN by exploring alternative symmetric matrices. Adaptive Graph Convolutional Network (AGCN) learns hid-den structural relations unspecified by the graph adjacency matrix. It constructs a so-called residual graph adjacency matrix through a learnable distance function which takes two nodes’ features as inputs. Dual Graph Convolutional Network (DGCN) introduces a dual graph convolutional architecture with two graph convolutional layers in parallel.

Spatial-based ConvGNNs

Spatial-based approaches inherit ideas from RecGNNs to define graph convolutions by information propagation. The spatial graph convolutional operation essentially propagates node information along edges.

在二维图像上的convolution可以看作是graph convolution的特殊情况(如下图所示):图像中,红点的representation是weighted sum of the nearby representation;同样的,基于这个想法我们可以定义一个图的convolution为他的weighted sum of the neighborhood。

image.png

Neural Network for Graphs (NN4G) 是和GNN*同一时间提出来的模型。和RecGNNs不同的是,NN4G层之间不共享参数,除此之外,层与层之间还有residual connection。这和之前的GCN有很强的相似性,只是NN4G没有采用normalization的技术,这会导致层与层之间的表示在scale上有非常大的差别。从这里我们可以想到,一定要注意最简单的常识性问题。例如算法是否收敛,设计的模型是否会导致参数的爆炸,每一层的scale是多少,都要有一些意识。深度学习虽然像是在很多基本块中找合适的零件搭积木,但是关于模型设计的技术细节,还是需要多加注意。

Message Passing Neural Network (MPNN) outlines a general framework of spatial-based ConvGNNs. 这篇工作将图卷积看作是message passing的一种方式,节点信息通过边进行传递。message passing机制定义为以下:

其中第零层的隐层表示为节点输入,$U_k(\cdot)​$ 以及 $M_k(\cdot)​$ 是具有可学习参数的函数。经过几段的message passing,$h_v^{(K)}​$能够直接传进output layer进行node-level prediction或者进入一个readout function进行graph-level prediction. readout function基于节点表示生成图表示:

Graph Isomorphism Network (GIN) finds that previous MPNN-based methods are incapable of distinguishing different graph structures based on the graph embedding they produced. 为了弥补这个缺点,GIN将中心节点的权重用一个可学习参数来代替:

GraphSage 不直接让所有的邻接节点进行message passing,因为一个节点可能有非常不同的邻接节点数,如果直接全部计算会有较大的计算开销。因此这篇工作采取了从一个节点的neighborhodd中采样固定数目节点的方法,其中$S_{\mathcal{N}(v)}$表示采样,其中的aggregation function $f_k$必须是一个顺序无关的函数例如mean, sum or max:

Graph Attention Network (GAT) 和之前的模型不同,它认为message passing中的不同节点应该有不同的权重。这权重也是通过训练学习得到的。GAT中的图卷积定义为:

其中的$\alpha _{vu}^k$表示的第$k$层的节点$v$应当对其邻接节点$u$所给予的关注。这里和原文的表述”connective strength”不同是因为我认为连通性更像是一个对称的度量方式,而其实这里关注度的给出是可以非对称的。GAT中计算该关注度的方法为:

更进一步地,GAT采用了Multi-head Attention来提取更多不同方面的节点特征。而GAT将多个head之间的结果拼接起来就以为它预设每个head特征之间是相同重要的。

Gated Attention Network (GAAN) introduces a self-attention mechanism which computes an additional attention score for each attention head. There are other graph attention models which might be of interest. However, they do not belong to the ConvGNN framework.

Improvement in terms of training efficiency

Fast Learning with Graph Convolutional Network (Fast-GCN) 和GraphSage不同,它在每一层sample固定数目个node来计算图卷积,而不是对每个node sample固定数目个neighbor。It interprets graph convolutions as integral transforms of embedding functions of nodes under probability measures. Monte Carlo approximation and variance reduction techniques are employed to facilitate the training process.

Stochastic Training of Graph Convolutional Networks (StoGCN)

image.png

Comparison between spectral and spatial models

  • 首先,spectral-based models的效率不如spatial-based models。spectral-based model要么需要执行特征向量计算,要么需要同时处理整个图形。空间模型则更可扩展到大型图,因为它们通过信息传播直接在图域中执行卷积。可以在一批节点而不是整个图中执行计算。
  • 依赖于图傅立叶基础的spectral-based model无法很好地推广到新图。因为图拉普拉斯矩阵因不同的图而不同,一旦产生变化需要进行costly的谱分解运算。spatial-based model在每个节点上局部执行图卷积,可以在不同位置和结构之间轻松共享权重。
  • Third, spectral-based models are limited to operate on undirected graphs. Spatial-based models are more flexible to handle multi-source graph inputs such as edge inputs, directed graphs, signed graphs, and heterogeneous graphs, because these graph inputs can be incorporated into the aggregation function easily.

Graph Pooling

在GNN方法产生出了node feature之后,我们需要将这些node feature用于最后的下游任务。Pooling主要是通过down-sampling从而减少参数的数量来防止overfitting, permutation invariance and computational complexity issues;Readout则是从节点表示产生出图表示。这两者的原理非常相近,我们指称的pooling是这两种方法的总和。

现在最流行的pooling方法主要是mean/max/sum pooling,因为计算这些方法非常快。一些工作甚至采用了基于attention的pooling方法,但是这些reduction operation表现还是不尽如人意,因为这些方法无论图的结构/大小,最后生成的永远是fixed-length vector。Set2Set方法产生随着input变大而产生更大的memory,然后使用LSTM来结合所有的embedding,最后才使用reduction method。

Chebyshev Spectral CNN (ChebNet) 中提出了一种有趣的方法:他们会首先对图的节点进行重新排列。Input graphs are first coarsened into multiple levels by the Graclus. After coarsening, the nodes of the input graph and its coarsened version are rearranged into a balanced binary tree. Arbitrarily aggregating the balanced binary tree from bottom to top will arrange similar nodes together. Pooling such a rearranged signal is much more efficient than pooling the original. SortPooling通过另一种机制来对节点进行排序。

The aforementioned pooling methods mainly consider graph features and ignore the structural information of graphs. DiffPool 提出了一种可微的pooling方法。

Discussion of Theoretical Aspects

Receptive Field. 指的是中心节点能够“看”到多远的邻接节点。对于GCN类似的方法,每多一层graph convolution,receptive field的半径就增大1. 所以对于一个连通图来说,至少存在一个最大层数,能够让任意一个节点作为中心节点看到全局的图结构信息。

VC Dimension. Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a space of functions that can be learned by a statistical classification algorithm. It is defined as the cardinality of the largest set of points that the algorithm can shatter. 有研究者推导出GNN*如果使用sigmoid or tangent hyperbolic activation,则VC dimension为$O(p^4n^2)$,这意味着如果使用这些非线性函数,那么随着Model parameter number $p$还有节点数$n$的增加,模型的复杂度会快速升高。

Graph Isomorphism. 如果两个图topologically identical,那我们称这两个图是同构的。Graph Isomorphism Network (GIN) 工作证明如果一个GNN将两个图$G_1$和$G_2$分别映射到两个不同的embedding上,那么这两个图可以通过WL-test被识别成两个异构图。该工作同时证明了GCN以及GraphSage是没有办法辨别不同的图结构的。同时该工作更进一步证明了如果一个GNN的aggregation function以及readout function都是单射(injective),那么这个GNN在辨别两个异构图的能力上和WL-test是没有差别的。

Equivariance and Invariance. GNN在处理node-level task时必须是equivariant function,在处理graph-level task时必须是invariant function. 若用数学语言表达,假设$f(\mathbf{A}, \mathbf{X}) \in R^{n \times d}$ 是一个GNN函数,$\mathbf Q$为任意的置换矩阵。如果$f\left(\mathbf{Q} \mathbf{A} \mathbf{Q}^{T}, \mathbf{Q X}\right)=\mathbf{Q} f(\mathbf{A}, \mathbf{X})$,则为equivariant;如果$f\left(\mathbf{Q} \mathbf{A} \mathbf{Q}^{T}, \mathbf{Q X}\right)= f(\mathbf{A}, \mathbf{X})$,则为invariant。这也就是说,对于任意的节点顺序,GNN函数必须给出相对同节点的相同的结果。

Spatial-Temporal GNNs

Graphs in many real-world applications are dynamic both in terms of graph structures and graph inputs. Spatial-temporal graph neural networks (STGNNs) occupy important positions in capturing the dynamicity of graphs. Methods under this category aim to model the dynamic node inputs while assuming interdependency between connected nodes. STGNNs capture spatial and temporal dependencies of a graph simultaneously. The task of STGNNs can be forecasting future node values or labels, or predicting spatial-temporal graph labels. STGNNs follow two directions, RNN-based methods and CNN-based methods.

假设图上的简单RNN采取以下公式进行第$t$步的隐藏层计算:

在RNN上添加graph convolution操作即有:

RNN-based approaches suffer from time-consuming iterative propagation and gradient explosion/vanishing issues. As alternative solutions, CNN-based approaches tackle spatial-temporal graphs in a non-recursive manner with the advantages of parallel computing, stable gradients, and low memory requirements.

Future Directions

Model Depth. CNN在视觉领域证实了网络层越多表现越好。然而研究工作显示GNN正相反。因为GNN中的节点表示不断地和周围进行传递,在理论上,足够多层的网络会让所有的节点表示拟合到一个single point. This raises the question of whether going deep is still a good strategy for learning graph data.

Salability Trade-off. The scalability of GNNs is gained at the price of corrupting graph completeness. Whether using sampling or clustering, a model will lose part of the graph information. 采样会导致一个节点可能错过一个非常重要的节点;而聚类则导致一个节点可能失去它独特的pattern。如何平衡算法的可扩展性和图完整性是未来的一大方向。

Heterogenity. 现在的GNN直接假设同质的图,但对于异质图来说,节点、边可能有不同的表示和输入(图像或文字)。

Dynamicity. Graphs are in nature dynamic in a way that nodes or edges may appear or disappear, and that node/edge inputs may change time by time. New graph convolutions are needed to adapt to the dynamicity of graphs. Although the dynamicity of graphs can be partly addressed by STGNNs, few of them consider how to perform graph convolutions in the case of dynamic spatial relations.

CATALOG
  1. 1. Backgrounds
    1. 1.1. Basics
    2. 1.2. Training Frameworks
  2. 2. RecGNNs
  3. 3. ConvGNNs
    1. 3.1. Spectral-based ConvGNNs
    2. 3.2. Spatial-based ConvGNNs
      1. 3.2.1. Improvement in terms of training efficiency
      2. 3.2.2. Comparison between spectral and spatial models
    3. 3.3. Graph Pooling
    4. 3.4. Discussion of Theoretical Aspects
  4. 4. Spatial-Temporal GNNs
  5. 5. Future Directions