Erich Focht

Normal ncc, nc++, nfort behavior

The NEC Vector Engine (VE) compilers offer a large number of ways to optimize loops, vectorize and parallelize them. The actions of the compilers are being controlled by using compiler options (with a global scope) or specific compiler directives (#pragma _NEC in C/C++), as described in the following two compiler manuals:

While the control of loop transformations and vectorization through directives is very fine-grained, the lowest level of control over the generated code was to use VREG directives. They can assign local arrays to vector registers:

double xv[256];
#pragma _NEC vreg(xv)

Still, the compiler had to be left to do its thing and sometimes chose other vector instructions than intended by the programmer.

The LLVM-VE compiler

Sometimes programmers have a very clear idea of what they want the vector processor to do. ncc, nc++, nfort could not be forced to do that, so the remaining choices were limited:

  • use assembler,
  • use the LLVM-VE compiler with intrinsics. These are supported also in the upstream LLVM project, but the safer bet is probably still LLVM-VE.

In assembler code get quickly ugly and hard to change or extend. LLVM-VE intrinsics on the other hand allow C-alike programming with intrinsics able to map every instruction of the vector engine with every possible flag and option. A very detailed list is in the generated page https://sx-aurora-dev.github.io/velintrin.html with links into the assembler manual and ISA guide of the VE.

Below is an example of a BLAS daxpy implementation with LLVM-VE intrinsics:

#include <velintrin.h>
#define VLEN (256)
void daxpy(int n, double alpha, double *x, int incx,
           double *y, int incy)
{

  const double *xp = (const double *)x;
  double *yp = y;
  uint64_t incx_b = sizeof(double) * incx;
  uint64_t incy_b = sizeof(double) * incy;

  for (size_t i = 0; i < n; i += VLEN) {
    const int vl = (n-i < VLEN ? n-i : VLEN);
    __vr xv = _vel_vld_vssl(incx_b, xp+i, vl);
    __vr yv = _vel_vld_vssl(incy_b, yp+i, vl);
    __vr yr = _vel_vfmadd_vvsvl(yv, alpha, xv, vl);
    _vel_vst_vssl(yr, incy_b, yp+i, vl);
  }
}

Vector registers are declared with __vr and intrinsics have the form _vel_<mnemonic>_<arg_options>.

NEC VE compiler intrinsics

The proprietary NEC VE (C) compilers were supporting intrinsics since quite a while but that wasn’t made public. The support was limited and usually code generation broke when using optimization levels larger tha -O1. So basically, they were not safely usable.

In the last version of the C/C++ Compiler Guide a new section has appeared, that describes the Vector Type Extension while a much more detailed Japaneze documentation page showed up in the SX-Aurora SDK documentation that had an interesting title: NEC C/C++ Compiler: Vector Type Extension

For those who are not able to read Japaneze, like me, google (or other) translate is an invaluable tool. It revealed that the NEC VE C/C++ compilers support builtins and the document version 1.0.0 means, that this support seems pretty stable. Here is the english translation.

With these new intrinsics the BLAS daxpy example from above looks pretty similar, but with significant differences:

#define VLEN (256)
typedef double __vr_dp256 __attribute__((ext_vector_type(256)));

void daxpy(int n, double alpha, double *x, int incx,
           double *y, int incy)
{

  const double *xp = (const double *)x;
  double *yp = y;
  uint64_t incx_b = sizeof(double) * incx;
  uint64_t incy_b = sizeof(double) * incy;

  for (size_t i = 0; i < n; i += VLEN) {
    const int vl = (n-i < VLEN ? n-i : VLEN);

    __vr_dp256 xv, yv, yr;
    __builtin_ve_vld(xv, xp+i, incx_b, vl);
    __builtin_ve_vld(yv, yp+i, incy_b, vl);
    __builtin_ve_vfmadd(yr, xv, alpha, yv, vl);
    __builtin_ve_vst(yr, yp+i, incy_b, vl);
  }
}

The vector register type must be defined with typedef and is typed. In this case the compiler knows that we deal with a double!

The __builtin_ve_* intrinsics offer less control about the generated instructions. No flags (like .nc) can be specified and it looks like we can’t specify packed instructions explicitly, these are probably generated by the compiler depending on the context and arguments (I didn’t check this assumption).

The good news is: intrinsics are supported now in ncc and nc++ and work even with higher optimization levels than -O1. Still, they don’t seem to be aimed at a wide public… So you’re on your own if you play with them.


LLVM-VE project

Wikipedia