**Materials**

- paper “How powerful are GNNs?”
- blog “WL test”
- open review of the paper
- youtube lecture including this paper

# Motivation

Despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present **a theoretical framework for analyzing the expressive power of GNNs to capture different graph structures.** We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test.

Similar to GNNs, the WL test iteratively updates a given node’s feature vector by aggregating feature vectors of its network neighbors. What makes the WL test so powerful is **its injective aggregation update that maps different node neighborhoods to different feature vectors.**

**Main contributions**

- We show that GNNs are at most as powerful as the WL test in distinguishing graph structures.
- We establish conditions on the neighbor aggregation and graph readout functions under which the resulting GNN is as powerful as the WL test.
- We identify graph structures that cannot be distinguished by popular GNN variants, such as GCN and GraphSAGE, and we precisely characterize the kinds of graph structures such GNN-based models can capture.
- We develop a simple neural architecture, Graph Isomorphism Network (GIN), and show that its discriminative/representational power is equal to the power of the WL test.

# Preliminaries

## GNN Structure

本文总结了当前主流的GNN结构：Modern GNNs follow a **neighborhood aggregation strategy**, where we iteratively update the representation of a node by aggregating representations of its neighbors. After k iterations of aggregation, a node’s representation **captures the structural information within its k-hop network neighborhood.** 文章将neighborhood aggregation总结成两步：首先需要将中心节点neighborhood的表示通过一个函数进行聚合：

接着将中心节点的表示进行更新：

对于graph-level的分类问题，我们需要一个readout function将所有的node-level representation进行aggregation，从而得到关于整个图的representation。readout function should be a permutation invariant function such as summation or graph-level pooling function.

## Weisfeiler-Lehman Test

**The graph isomorphism problem asks whether two graphs are topologically identical.** This is a challenging problem: no polynomial-time algorithm is known for it yet.

The Weisfeiler-Lehman (WL) test of graph isomorphism is an effective and computationally efficient test that distinguishes a broad class of graphs. Its 1-dimensional form, “naïve vertex refinement”, is analogous to neighbor aggregation in GNNs. The WL test iteratively (1) aggregates the labels of nodes and their neighborhoods, and (2) hashes the aggregated labels into unique new labels. The algorithm decides that two graphs are non-isomorphic if at some iteration the labels of the nodes between the two graphs differ.

关于WL test的具体算法，详见blog “WL test”.

# Theoretical Framework

A GNN recursively updates each node’s feature vector to capture the network structure and features of other nodes around it, i.e., its **rooted subtree structures**. For notational simplicity, we can assign each feature vector a unique label in $ \{ a, b, c, \cdots \} $. Then, feature vectors of a set of neighboring nodes form a multiset: the same element can appear multiple times since different nodes can have identical feature vectors.

Since subtree structures are defined recursively via node neighborhoods, we can reduce our analysis to the question whether a GNN maps two neighborhoods (i.e., two multisets) to the same embedding or representation. **A maximally powerful GNN would never map two different neighborhoods, i.e., multisets of feature vectors, to the same representation.** This means its aggregation scheme must be **injective**.

总结一下理论推导的思路：

- 首先定义GNN的aggregation函数是一个作用在由neighborhood定义的multiset上的函数
- 多层的GNN不断地将周围的neighborhood展开变成子树的形式，因为这一递归的定义，我们只需要关注在每一层树的映射上
- 最有鉴别力的GNN的上界就是WL test，这一条件只在aggregation方法是injective时实现，因为这意味着GNN不会将不同的multiset映射到同一个representation上。

# Building Powerful GNN

A natural follow-up question is whether there exist GNNs that are, in principle, as powerful as the WL test? Our answer, in Theorem 3, is yes: if the neighbor aggregation and graph-level readout functions are injective, then the resulting GNN is as powerful as the WL test.

**For countable sets, injectiveness well characterizes whether a function preserves the distinctness of inputs.** Uncountable sets, where node features are continuous, need some further considerations. In addition, it would be interesting to characterize how close together the learned features lie in a function’s image.

上面虽然讨论了GNN和WL test在鉴别不同拓扑结构的图时的能力比较，但是GNN**还需要将相似的neighborhood映射到相似的embedding上，这是为了GNN的泛化能力**。In contrast, a GNN satisfying the criteria in Theorem 3 generalizes the WL test by learning to embed the subtrees to low-dimensional space. This enables GNNs to not only discriminate different structures, but also to learn to map similar graph structures to similar embeddings and capture dependencies between graph structures.

## Graph Isomorphism Network

本文提出的GIN网络结构和WL test有着同样的鉴别不同网络结构的能力。Our next lemma states that **sum aggregators can represent injective**, in fact, universal functions over multisets.

这条引理告诉我们存在一个作用在可数元素的函数，使得经过该函数变换后再加权的函数是一个单射。拿GNN网络来做类比，也就是函数$f$作用在节点feature上，$h(X)$就是中心节点的聚合函数，那么我们知道存在这样的一个函数$f$能够让GNN的聚合函数为单射。更进一步地，任何一个定义在multiset上的函数$g$都能够被解构成关于某个函数的和的形式：

以上的引理实际上也是根据GNN中心节点获得neighborhood aggregation之后的combine函数进行设计的。从而我们有GIN节点表示的递推关系。 Generally, there may exist many other powerful GNNs. GIN is one such example among many maximally powerful GNNs, while being simple.

## Mean/Max Pooling

What happens if we replace the sum in $h(X)=\sum_{x \in X} f(x)$ with mean or max-pooling as in GCN? Mean and max-pooling aggregators are still well-defined multiset functions because they are permutation invariant. But, **they are not injective**.

第一张图表示了三种aggregator表达能力的不同，其表达能力的强弱顺序依次为sum>mean>max。而第二张图给出了一些简单的mean/max失败的案例。

### MEAN LEARNS DISTRIBUTIONS

很容易想到，当一个multiset包括了另一个multiset若干份copy时，mean aggregation会无法分辨，从而得出结论：Mean learns distributions. The mean aggregator may perform well if, for the task, the statistical and distributional information in the graph is more important than the exact structure. Moreover, when node features are diverse and rarely repeat, the mean aggregator is as powerful as the sum aggregator. This may explain why, despite the limitations identified, GNNs with mean aggregators are effective for node classification tasks, such as classifying article subjects and community detection, where node features are rich and the distribution of the neighborhood features provides a strong signal for the task.

### MAX-POOLING LEARNS SETS WITH DISTINCT ELEMENTS

Max-pooling captures neither the exact structure nor the distribution. However, it may be suitable for tasks where it is important to identify representative elements or the “skeleton”, rather than to distinguish the exact structure or distribution. Qi et al. (2017) empirically show that the max-pooling aggregator learns to identify the skeleton of a 3D point cloud and that it is robust to noise and outliers.

# *OpenReview Discussion

### Review1.

My chief concern is equating the Weisfeiler-Lehman test (WL-test) with Weisfeiler-Lehman-type GNNs (WL-GNNs). The WL-test relies on countable set inputs and injective hash functions. Here, the paper is oversimplifying the WL-GNN problem. After the first layer, a WL-GNN is operating on uncountable sets. On uncountable sets, saying that a function is injective does not tells us much about it; we need a measure of how closely packed we find the points in the function’s image (a measure in measure theory, a density in probability). On countable sets, saying a function is injective tells us much about the function. Moreover, the WL-test hash function does not even need to operate over sets with total or even partial orders. As a neural network, the WL-GNN “hash” ( in the paper) must operate over a totally ordered set (\mathbb{R}^n, n > 0). Porting the WL-test argument of “convergence to unique isomorphic fingerprints” to a WL-GNN requires a **measure-theoretic analysis of the output** of the WL-GNN layers, and careful analysis if the total order of the set does not create attractors when they are applied recursively.

**Reply. Validity of equating the WL test operating on countable sets to the WL-GNN operating on uncountable sets.** The reviewer makes a great observation that countability of node features is essential and necessary for our theory, and we acknowledge that our current Theorem 3 and Lemma 5 are built on the common assumption that input node features are from a countable universe. We have now made this clear in our paper. We also filled in a technical gap/detail to address R1’s concern that after the first iteration, we are in an uncountable universe: this actually does not happen. We can show that for a fixed aggregation function, hidden node features also form a countable universe, because the countability of input node features recursively propagates into deeper layers. We also added a rigorous proof for this (Lemma 4 in our revised paper). As the reviewer nicely suggests, for the uncountable setting, it would be useful to have measure-theoretic analysis, which we leave for future work. **Often input node features in graph classification applications (e.g., chemistry, bioinformatics, social)** come from a countable (in fact, finite) universe, so our assumption is realistic. In the revised version, we clearly stated our assumptions at the beginning of Section 3 and have added further discussion on the relation between the WL test and WL-GNN after Theorem 3.

### Review2.

The matrix analysis of the last paragraph also points to another **potential problem with the sum aggregator.** GIN needs to be shallow. With ReLU activations the reason is simple: for an adjacency matrix $A$, the value of $A^j$ grows very quickly with (diverges). With sigmoid activations, GIN would experience vanishing gradients in graphs with high variance in node degrees.

**Reply**. Furthermore, the reviewer is concerned **with a possibly exploding value due to the sum aggregation**, but this can be avoided because **we have different learnable neural networks** at each layer that can scale down the summed output (also, in practice, we did not observe such explosion).

### Review3.

I understand that GIN provably has more discriminative power than other variants of GNN. But the ability to differentiate non-isomorphic graphs does not necessarily imply better graph classification accuracy, right? Would it be possible to strong discriminative power will backfire for the graph classification? After all, we don’t need to solve graph isomorphism here.

**Reply**. As we have pointed out in the experiment section, although stronger discriminative power does not directly imply better generalization, it is reasonable to expect that models that can accurately capture graph structures of interest also perform well on test set. In particular, with many existing GNNs, the discriminative power may not be enough to capture graph substructures that are important for classifying graphs. Therefore, we believe strong discriminative power is generally advantageous for graph classification. In our experiments, we empirically demonstrated that our powerful GIN has better generalization as well as better fitting to training datasets compared to other GNN variants. GINs performed the best in general, and achieved state-of-the-art test accuracy. We leave further theoretical investigation of generalization to our future work.