WA-GNN: accelerating graph neural networks with tensor core optimization
Loading...
Date
Authors
Liu, Yang
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Graph Neural Networks (GNNs) have been widely applied in various domains, such as social
network classification, biological prediction, and financial fraud detection, among others,
offering effective solutions for non-Euclidean problems. A typical GNN consists of two major
phases: combination and aggregation. In the combination phase, the original feature vectors
are processed by a deep neural network with learnable weights, typically a multi-layer
perceptron (MLP), to generate new embeddings. This phase can efficiently utilize Tensor
Cores, specialized matrix computation units in modern Graphics Processing Units (GPUs)
optimized for high-throughput computation. In contrast, the aggregation phase collects feature
data from neighbouring nodes based on the sparse adjacency structure, leading to irregular data
access that significantly limits Tensor Core utilization. Consequently, the overall performance
of GNNs on GPUs is primarily constrained by the inefficient aggregation phase, where sparse
computation patterns hinder hardware utilization.
To address this challenges, we propose WA-GNN (Warp-Specialization Accelerated GNNs),
a Tensor Core–accelerated framework designed to fully exploit Tensor Core capabilities for
GNN inference. Our approach introduces the K-Concat data format to reorganize the adjacency
matrix into a Tensor Core-friendly layout. A warp specialization mechanism is designed to
optimize the data loading and computation, while a C-allocation strategy is employed to assign
warp workloads. These techniques are integrated into three representative GNN models,
Graph Convolutional Network (GCN), Graph Isomorphism Network (GIN), and Graph
Attention Network (GAT), each implemented with a customized kernel.
Experimental results on multiple benchmark datasets demonstrate that WA-GNN achieves
an average of 2 × end-to-end speedup over other baselines across datasets for the GCN model,
with the performance gap widening as the dataset size increases. For the GIN model, WAGNN
delivers comparable performance to Deep Graph Library (DGL) on the H100 GPU. For
the GAT model, WA-GNN achieves an average of 3 × speedup across datasets. These results
demonstrate WA-GNN effectiveness in leveraging Tensor Cores for GNN workloads, and
similar sparse matrices operations.
Description
Thesis is embargoed until December 12, 2026
