## Llama2 on the Vector Engine

*Erich Focht*

The SX-Aurora Vector Engine (VE) is an accelerator for HPC applications with large memory bandwidth and vector registers with the length of 16kbits. It is a long vector system that has been conceived for HPC and supports only vector computation in 32 and 64 bits.

The following slides summarize a few hours of experiments with porting, tuning and testing the tiny llama2.c project which aims at providing a single compact binary for llama based large language models inference.

My interest for llama2.c was inspired by a twitch video by Xpress AI. I was curious to see myself how the VE performs and where it stands in comparison to other architectures.

With minimal porting effort, which mainly circumvents the compiler’s refusal to vectorize two loops, we reach unexpectedly good performance with inference in float32 format:

- VE20B Vector Engine: 27 tokens / s
- VE30B Vector Engine: 44 tokens / s

The most important of the two changes was to unroll the matrix-vector multiplication, rewriting it from

```
void matmul(float* xout, float* x, float* w, int n, int d) {
int i;
#pragma omp parallel for private(i)
for (i = 0; i < d; i++) {
float val = 0.0f;
for (int j = 0; j < n; j++) {
val += w[i * n + j] * x[j];
}
xout[i] = val;
}
}
```

to

```
void matmul(float* xout, float* x, float* w, int n, int d) {
#pragma omp parallel
{
int nthr = omp_get_max_threads();
int ithr = omp_get_thread_num();
int block = (d + nthr - 1) / nthr;
int imin = ithr * block;
int imax = (ithr + 1) * block > d ? d : (ithr + 1) * block;
#pragma _NEC outerloop_unroll(8)
for (int i = imin; i < imax; i++) {
float val = 0.0f;
for (int j = 0; j < n; j++) {
val += w[i * n + j] * x[j];
}
xout[i] = val;
}
}
}
```

or (with same performance but explicit unrolling):

```
void matmul(float* xout, float* x, float* w, int n, int d) {
int i;
#pragma omp parallel for private(i)
for (i = 0; i < d; i += 8) {
xout[i ] = 0.0f;
xout[i + 1] = 0.0f;
xout[i + 2] = 0.0f;
xout[i + 3] = 0.0f;
xout[i + 4] = 0.0f;
xout[i + 5] = 0.0f;
xout[i + 6] = 0.0f;
xout[i + 7] = 0.0f;
for (int j = 0; j < n; j++) {
xout[i ] += w[(i ) * n + j] * x[j];
xout[i + 1] += w[(i + 1) * n + j] * x[j];
xout[i + 2] += w[(i + 2) * n + j] * x[j];
xout[i + 3] += w[(i + 3) * n + j] * x[j];
xout[i + 4] += w[(i + 3) * n + j] * x[j];
xout[i + 5] += w[(i + 3) * n + j] * x[j];
xout[i + 6] += w[(i + 3) * n + j] * x[j];
xout[i + 7] += w[(i + 3) * n + j] * x[j];
}
}
for (i = ((d + 7) / 8) * 8; i < d; i++) {
xout[i] = 0.0f;
for (int j = 0; j < n; j++) {
xout[i] += w[i * n + j] * x[j];
}
}
}
```

This was compared to running llama2 on nvidia V100-32GB-PCIE CPUs (technologically equivalent to the SX-Aurora VE20B) and nvidia A100-40GB-PCIE and A100-80GB-SXM4 (technologically equivalent to VE30). On the GPUs the tests were run through the text-generation-webui using the Huggingface transformers library which was loading and running the models in float16 format, giving the GPUs an edge by using only half the memory bandwidth for matrix transfers compared to the float32 format on VE. The results on GPUs ranged from 13.5 tokens/s (V100) to 18-22 tokens / s (A100).

Clearly, the VEs have a disadvantage because they don’t vectorize low precision operations and (so far) can’t be used for example for quantized models (VE3 actually supports 16 bits load/stores and conversion of float16 and bfloat16 to float32, but this couldn’t be tested, yet).

On the other hand: the very simple changes show that it is very easy to get large memory bandwidth for the memory bound matrix-vector multiplication on the SX-Aurora Vector Engine. In this particular case this compensates greatly the comparably low peak performance of the VE.

A boring video showing the port and tests in “twitch” style will be uploaded once edited. For now, find below the slides used in the video.

The code is in the branch **vectorengine** of the repository
https://github.com/efocht/llama2.c .

References and links: