Guillermo Alaejos López
General Matrix Multiplication (GEMM) is a fundamental component of scientific computation and current frameworks in the field of Deep Learning (DL). Current implementations of GEMM are mostly written in C, although its computational power resides on a small, highly optimised computational core, or micro-kernel, which is usually encoded in vector intrinsics or assembler instructions. State-of-the-art linear algebra libraries typically include a single micro-kernel per architecture, usually implemented manually by an expert.
On the one hand, the boom in the application of Deep Neural Networks (DNN) in a wide variety of scientific fields has led to their use, not only in computationally intensive servers, but also in low-power devices. Many of the computations performed in DNNs, both in training and inference, are decomposed into linear algebra numbers and extracted from specialised libraries such as Intel Math Kernel Library (MKL) or BLAS-like Library Instantiation Software Framework (BLIS). In addition, DNN models are associated with new problem sizes associated with them. These models have convolutional layers that are transformed into matrix multiplications. These matrices differ from the usual ones in their dimensions. In state-of-the-art models, the dimensions of the of the matrices used are very tall-and-skinny, in very extreme cases reaching almost 200 times the difference between one dimension and the other.
The computational kernels implemented by state-of-the-art linear algebra libraries are designed for high performance with very large and square matrices. However, in the current models of DNNs, as the matrices are very rectangular, the kernels designed are not suitable for these problems and perform less well than could be achieved with the computational kernel of the right size. In the most extreme cases, it is possible that the optimal size of the computational kernel will vary for each of the layers of the new DNN models.
On the other hand, the wide variety of low-power hardware makes it virtually impossible to have optimised computational cores for each layer of the new DNN models.
Thus, the idea of implementing these computational kernels by means of self-generated code, with the aim of quickly making available computational kernels adapted to each architecture, arises.
In this thesis we study how to self-generate these computational kernels using different techniques to determine their performance. In more detail, this thesis explored ways of autogenerating computational kernels using Python scripts, generating both C code with vector instructions and assembler code, C++ templates and Apache Tensor Virtual Machine (TVM).
The implementation of a single computational core is a complex and error-prone task that requires a high level of knowledge of the architecture where it is to be implemented.
Choosing the optimal size of the computational core for each DNN model, and then, chosing the right size for each layer, requires having a wide variety of cores available.
While manually implementing each of the possible computational cores for each architecture is a very complex task, having an automatic generator makes it possible to test all possible cores and determine the optimal size for each use case. This way, it is possible to outperform current linear algebra libraries, which only implement a single computational kernel, regardless of problem size.
The solution proposed here is compared against several state-of-the-art linear algebra libraries for different use cases and on different architectures.
© 2008-2025 Fundación Dialnet · Todos los derechos reservados