Erich Focht, Holger Berger
The intent of this post is to prepare fresh users for basic usage of the vector engine (VE) and some follow-on posts on vectorization.
This little “getting started” guide uses some content from the SX Aurora TSUBASA Quickstart Guide written by Holger Berger that is usually handed out to fresh owners of a SX-Aurora.
Introduction
The SX-Aurora TSUBASA Vector Engine is a PCIe card containing a multicore vector processor and very high bandwidth HBM memory. An overview on the hardware is avaiable in the post VE Hardware Overview.
Currently up to eight VE cards can be configured into a x86_64 vector host (VH) system. Each of the VE cards is running a VEOS instance on the VH, a daemon that provides operating system functionality for a VE. VEs are used without a kernel, system calls from native VE programs are forwarded to and executed on the VH.
Commands and Tools
The software components, compilers and tools needed for the VE are
installed by default under the path /opt/nec/ve
. In the bin
subdirectory you’ll find most of the utilities needed to interact with
the VE. Besides compilers and MPI there are several VE specific Linux
tools:
- binutils: the binaries are prefixed by the letter n: nas, nar, nelfedit, nld, nnm, nobjdump, nstrings, etc….
- gdb: the GNU debugger for VE programs.
- ps, top, free, uptime, w: are equivalents of the Linux commands and show information on processes running on a particular VE, free memory on a VE, its uptime, etc.
- arch, uname, lscpu: information about the architecture and the system.
- iostat, pidstat, vmstat, sar, strace, taskset, time: programs to display various statistics about the VE or trace VE programs.
- ve_exec: the loader for native VE programs. Not needed to be called explicitly because it should normally be registered as interpreter for the VE binary format with the Linux kernel of the VH. Thus native VE programs are executed seamlessly.
- vecmd: a VE management and monitoring program.
Most of the Linux-equivalent tools expect that the VE node that they
should address is set in the environment variable VE_NODE_NUMBER
. If
the tools return odd results, try setting the env variable to one of
the VE IDs. On systems with only one VE it could make sense to
globally set the variable to 0:
$ export VE_NODE_NUMBER=0
The fact that many of the programs have the same name as their normal
Linux equivalents can bring confusion when simply adding
/opt/nec/ve/bin
to the command search $PATH
. A possible solution
to these name space clashes is to create properly prefixed symlinks in
another directory. For example (as root):
#!/bin/bash
DIR=/usr/ve/bin
mkdir -p $DIR
cd /opt/nec/ve/bin
ls -1 | egrep -v '^n' | while read NAME; do
ln -s `pwd`/$NAME $DIR/ve$NAME
done
ls -1 | egrep '^n' | while read NAME; do
ln -s `pwd`/$NAME $DIR/$NAME
done
VE Status
A VE card starts up automatically when the VH is started, due to a systemd unit file. You should be able to see the running VE card(s) with
$ /opt/nec/ve/bin/uptime
and you should be able to see details of the card (like model, memory size, frequencies, HW errors) with
$ /opt/nec/ve/bin/vecmd info
When using with multiple VEs vecmd can be used selectively for a
particular VE card by adding the command line options -N <VE_number>
.
The following commands are administrative commands that require root priviledges.
The power state of the card can be queried with:
# /opt/nec/ve/bin/vecmd state get
The card can be powered off by:
# /opt/nec/ve/bin/vecmd state set off
To power the VE on again, use:
# /opt/nec/ve/bin/vecmd state set on
Before a VEOS update the VE card should be turned off and then switched into maintenance mode:
# /opt/nec/ve/bin/vecmd state set off
# /opt/nec/ve/bin/vecmd state set mnt
After the update the VE card needs to be resetted:
# /opt/nec/ve/bin/vecmd reset card
Compiler usage
The NEC SDK comes with three VE compilers that are capable of automatic vectorization and automatic parallelization: ncc is the C compiler, nc++ the C++ compiler and nfort the Fortran compiler. The compilers pretty much emulate the behavior of GNU compilers in order to increase compatibility, but they are in no way related to or derived from the GNU compilers.
Note for C++ users: the compilers language default is –std=g++14
,
which stands for C++14 with some GNU extensions. Currently
–std=c++14
and –std=c++11
are not fully supported and might have
some issues with the standard library.
Note for Fortran users: the compiler partially supports Fortran2003 and large parts of Fortran2008.
A quick overview of the most important compiler options:
- -O0, -O1, -O2, -O3, -O4: optimization level, -O4 is aggressive and might change results, -O3 includes loop level transformations, inling works for -O2 and higher, but has to be enabled by separate options.
- -g: enable debug symbols.
- -report-all: create a .L file with a formatted source file and messages, containing loop markers making it easy to see unvectorized loops.
- -fdiag-vector=2: increase verbosity of vectorization messages.
- -finline –finline-functions: enable inlining, use it with C++.
- -fdiag-inline=2: give detailed messages about inlining.
- -fopenmp: enable OpenMP. Link with this option, too!
- -ftrace: enable performance analysis at subroutine/function level. Introduces overhead per call but this is cleanly subtracted in the output. Link with this option, too.
- -proginf: enable proginf output at program end (low overhead). Link with this option, too.
- -traceback: enable traceback output in case of application crash. Link with this option, too.
Compiler directives can significantly change and improve the optimization of programs. One of the most important ones for vectorization is declaring that loop iterations of the loop following the directive are not data dependent and the loop can be vectorized. It’s notation in Fortran is:
!$NEC ivdep
while in C and C++ one should use:
#pragma _NEC ivdep
Please see the compiler manuals for details about the directives, they allow to ignore dependencies and can influence the optimizations and transformations of the compilers.
Running VE programs
Non-MPI native VE programs are started explicitly or implicitly with the help of /opt/nec/ve/bin/ve_exec. This program is the loader for VE programs which interacts with VEOS to create a process on the VE, allocate memory for it which is managed in a virtual address space, and handle interruptions and exceptions raised by the VE process, like the system calls offloaded to the VH. While it wasn’t in the early days of VEOS, nowadays ve_exec is normally registered as the interpreter for SX-Aurora ELF binaries in /proc/sys/fs/binfmt_misc. Therefore the two ways of starting such a native VE program are:
Using ve_exec explicitly:
$ /opt/nec/ve/bin/ve_exec ./ve.binary
or implicitly by simply executing the binary from a VH shell command:
$ ./ve.binary
On machines with multiple VEs the one on which the binary will be
executed is specified by either the value of the environment variable
VE_NODE_NUMBER
or the command line option -N NODE_NR or
–node=NODE_NR to ve_exec. The VE node number defaults to 0.
MPI programs
For building and running MPI applications it is recommended to source
the necmpivars.{sh,csh}
script located in
/opt/nec/ve/mpi/<VERSION>/bin64. This will bring the wrappers
mpincc, mpinc++, mpinfort into the execution PATH, which should be
used for compiling and linking MPI programs. A side effect is the name
collision of some of the VE specific utilities in /opt/nec/ve/bin
with their native Linux variants. This means: if you need to use any
of the x86_64 Linux tools with a name collision (like time, strace,
gdb, …), call them by specifying their full path.
Start MPI applications with the command:
$ mpirun ./a.out
This executes one MPI rank per VE core on all locally available VE’s.
To execute one MPI process per VE use (in this example 2 VE’s)
$ mpirun -v -np 2 -ve 0-1 ./a.out
You can select the VE to execute on with the option –ve using the numbering of the VE cards e.g. displayed by /opt/nec/ve/bin/uptime.
An example that places 2 processes on two cards is
$ mpirun -v –venode –nn 2 -np 2 ./a.out
Please note that IO redirection via
$ mpirun ve_exec ./a.out > output # does not always do what you expect
does probably not do what you expect, as the MPI ranks are largely disconnected from the mpirun command, and the stdout is not connected to mpirun’s stdout. To achieve IO redirection, use
$ mpirun /opt/nec/ve/bin/mpisep.sh ./a.out
which creates files with MPI rank in the name for each rank, according to the value of the variable NMPI_SEPSELECT.
Performance analysis
For basic performance analysis the VE program should be compiled and linked with the –proginf
option. In addition, at run-time, following environment variable should be set for non-MPI native VE programs:
$ export VE_PROGINF=YES|DETAIL
For MPI parallel applications set
$ export NMPI_PROGINF=YES|ALL|DETAIL|ALL_DETAIL
Those variables will lead to dumping a summary of the performance metrics computed from the performance counter registers of the CPU at the end of the application. Using this feature causes no additional overhead besides printing the performance values at the end of the program.
A more detailed performance analysis can be obtained by using the -ftrace option for compilation and linking. This leads to instrumenting each function/subroutine and measuring its performance metrics. The approach causes some overhead that can increase the run-time significantly. When calling small functions very many times it is recomendable to consider to inline them.
Running the application will create one or more ftrace.out files, which can be viewed with
$ /opt/nec/ve/bin/ftrace [-f filename]
For MPI applications it might be more handy to use a GUI tool which can compare different MPI ranks:
$ /opt/nec/ve/ftraceviewer/ftraceviewer
The ftrace files must be opened in the GUI.
Debugging
Compile the application with –traceback
. Run the program with the
environment variable set to allow generation of traceback output:
$ export VE_TRACEBACK=ALL
When crashing, the program should generate a traceback of hex addresses. Those can be resolved into sourcefiles and sourceline with the utility /opt/nec/ve/bin/naddr2line if the application was compiled with –g.
You can also use /opt/nec/ve/bin/gdb to debug a VE executable or examine a core file. For best usage with gdb, please use -g2.