Zero-shot pruning of transformer language models using non-dominated sorting genetic algorithm
Abstract
Large Language Models (LLMs) are advanced neural networks trained on massive text corpora to understand and generate human language. While they have grown significantly in power and capability, their extensive parameter counts result in high computational costs. Current pruning techniques typically apply uniform sparsity across all layers of LLMs. However, not all layers contribute equally to the model's performance.
Therefore, we propose a non-uniform sparsity mapping algorithm that assigns varying levels of sparsity to each layer according to its impact on the model's performance. To identify the optimal effective allocation schemes, we create a search space comprising a population of candidates for sparsity mapping in the LLM. We leverage an evolutionary algorithm to conduct crossover and mutation on the top-performing candidates within this population, guided by performance evaluations. To determine the optimal sparsity mapping, we employ the Non-dominated Sorting Genetic Algorithm II (NSGA-II), which provides us with a set of optimal solutions that balance pruning ratio and performance trade-offs.
We implement an unstructured pruning approach to maximize sparsity. The pruning process is conducted on a sorted list of weights in each layer, utilizing Hessian approximation (the second-order term of the Taylor series expansion) as the basis for selection.
Furthermore, we employ a novel technique for efficiently measuring the importance scores of LLM layers. In this approach, the LLM is divided into several chunks of layers, and at each iteration, the importance score for the selected chunk is computed. By utilizing a gradient accumulation technique, we collect the scores for mini-batches of input data. We applied our algorithm on GPT2 architecture to demonstrate the applicability of our algorithm for LLMs from millions of parameters to billions of parameters.
We perform comprehensive experiments using the Wikitext and PTB datasets, showing that our method leads to substantial performance improvements on the GPT-2 Medium, Large, and XL models. Remarkably, the GPT-2 model pruned using our algorithm achieves a 15.8% and 3.8% smaller model over state-of-the-art techniques, DistilGPT-2 and ZipLM, respectively, while offering less performance degradation. Notably, our approach requires no retraining or fine-tuning, in contrast to these existing methods, which rely on extensive retraining.