Spatial Temporal Graph Convolutional Networks (ST-GCN) — Explained

5 min read3 days ago

Explaination for the paper “Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition” [1] (aka. ST-GCN) as well as key findings in its source code [2].

Introduction

Graph Neural Network and Knowledge Graph have been my research interest. During Literature Review, there were some noteable papers that opened up new directions or enabled new techniques. Among those is the paper “Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition” [1]. Its source code is also publicly available [2], which is beneficial as reproducibility is highly appreciated. I have spent some time on this paper, and hope that my short article can get like-minded people to comprehend the paper quicker. Time is our precious asset. 😀

Paper Summary

Inspiration

The paper itself is an extension of another paper: “Semi-Supervised Classification with Graph Convolutional Networks” [3], which defined the formula for convolving nodes within a graph to create Node Embeddings:

It looks scary at first, but thanks to [4], interpreting this formula can be untangled step-by-step. Its main idea is, given a specific node, to aggregate features from its neighbor nodes with its own’s, then embed the aggregated features into another latent space. The new features becomes that node’s features. Due to matrix multiplication, all the nodes’ features get updated at once . This is one way how convolution is carried out in graph. Notice that the above neighbors are direct ones; what if we are also interested in convolving neighbors at a further distance? In this case, the same process is repeated. Every repetition allows to convolve the neighbor nodes at an one-step further distance. The deeper representation a node ends up with through the convolution, the more representative it can be: it is not only aware of itself, but also the surrounding nodes near it. Typical tasks can be: 1). Node classfication, 2). Graph classfication. For 1), the node features are put through a Fully Connected layer with appropriate outputs, e.g. Softmax for multiclass classification. For 2), all the nodes’ features can be concatenated into one Tensor, and we apply the same processing.

ST-GCN

The above described convolution is spatial. When adding another dimension to the nodes, such as time, it is motivating to convolve in both spatial and temporal dimensions. That is the idea behind the paper ST-GCN [1]. The task to be solved is Human Action Recognition, using skeleton sequence, denoted with body joints, each of which is a triplet of (Coordinate-X/Coordinate-Y/Confidence-Level):

Figure 2: Skeleton sequence in space and time (source: [1])

ST-GCN’s archiecture:

Figure 3: Architecture (image by me, made with GraphViz). Left: the Forward function. Right: ST-GCN Block.

N: batch size.
C: original node features, i.e. triplet of (Coordinate-X/Coordinate-Y/Confidence-Level).
T: time steps.
V: number of nodes in Graph.
M: number of skeletons in a data record.

Code Explanation

First and foremost, where is the formula in Figure 1 applied? Here is it:

Figure 4: This is where GCN Formula is applied (file: st_gcn.py [7])

Matrices multiplication is not commutative, but surely associative. Originally, it was thought that one would first aggregate the features between the current nodes and its neighbors, then proceed to embedding the result to another latent space. But the code was implemented in a way that the embedding takes place first, then the aggregation.

After convolving Node spatially, it moved to convolving temporally for each node separately. We notice here the tuple kernel=(temporal_ker, 1), where temporal_ker was hardcoded with 9 (9 time steps).

Figure 5: Temporal Convolution for each node (file: st_gcn.py [7])

The main training loop is defined in file “recognition.py” [5], the content of which is typical. Although not discussed in the paper, the implementation made use of Residual Connection (aka. Skip Connection) in each ST-GCN Block to deal with Gradient Degradation. Because of different input-output channels, the Connection is indeed a mini-Neural Network (instead of Identity connection):

Within ST-GCN block, to prevent Gradient explosion and achieve a more stable convergence, BatchNorm is used before and after non-linear layers, as often advised [6]. The DropOut layer also helps in reducing overfitting [8]. But the code uses default setting for Dropout (meaning: probability of 0.5) while recommended value for Convolution layer is 0.1 or 0.2 [9].

Similarly to Convolution Layers for images, 10 ST-GCN Blocks are chained to hopefully generate high-level feature maps on skeleton sequences.

Another worthy point to mention is the Adjacency Matrix, constructed as a Tuple (spatial_ker, V, V). The neighbors of a specific node can be grouped into subsets. The easiest way is to just treat all of them equally (“Uni-labeling” as per paper), or the node connected to itself has distance 0 while its direct neighbors have distance of 1 (“Distance partitioning”). The number of such subsets is denoted as spatial_ker; each subset has its own Adjacency matrix to describe which nodes are considered neighbor, given a specific node. Each subset also has its own (o_channel, T, V). Thanks to einsum(), those datapoints are afterwards aggregated neatly.

Final Remark

I must admit that when first reading the paper [1], I got a certain perception about its main idea. For example, section “Spatial Temporal Modeling” page 4 [1], the authors extended the spatial modeling in temporal dimension by tweaking the Neighboring Function B(vₜᵢ) to include temporal neighbors (of the same node (joint)). I imagined we would have a dedicated function like this, but through the real implementation, it turned out to be just matrices multiplication. This once again reinforces my preference in focusing on papers that have code available. Otherwise, one would take the paper with a grain of salt. 😕

Hopefully, my effort in summarizing ST-GCN could help other people understand it quicker and more reliably. 💪 The spatial temporal convolution in this paper can be framed as generic multi-dimensional convolution with one dimension denoting nodes within a graph, from where the Graph Convolution formula (Figure 1) is applicable.