In the past weeks I re-implemented the VEO API by using a completely different communication method between vector host (VH) and vector engine (VE) than the original veoffload project. Instead of the system call semantics for controlling VE processes AVEO uses a little RPC layer built on top of User DMA communication mechanisms: ve-urpc. AVEO itself is at github.com/SX-Aurora/aveo.
AVEO does re-use code from VEO: the main include file
is almost identical and the upper layers which deal with asynchronous
call requests queueing are slightly changed variants from VEO.
This project aims at tackling and solving some of the issues of the original VEO:
- Call latency: AVEO’s asynchronous call latency is down from ~100us to the range of 5.5-6us.
- Memory transfer bandwidth is much increased and benefits for small transfers also of the latency improvement.
- The VH and VE processes are now decoupled, VEOS-wise the VE side is a separate process. This solves several long-standing issues:
- Debugging of VE and VH side is now possible at the same time. Just attach an appropriate gdb to each side (VH and VE), set breakpoints and have fun.
- In AVEO using multiple VEs from one process is no problem. Thus creating multiple procs is possible. The limitations imposed by VEOS do not apply to AVEO since VE processes are simply that: normal VE processes.
- There are no additional worker threads competing for resources on the VE, therefore OpenMP scheduling on VE is simple and less conflict-prone.
- Performance analysis of the VE side with ftrace works and is simple to use.
- In principle AVEO VE kernels can even be connected with each other through NEC VE-MPI. An API for simplifying this is being worked on.
There is a drawback for the reduced latency: it is obtained by polling
in tight loops, therefore the VE core running the
aveorun tool will
be continuously spinning. Also the VH cores actively waiting for the
request to finish (by
veo_call_wait_result()) will be spinning,
making progress and using power.
NOTE: This code is being actively developed, so is rough around the edges. It is working and meanwhile integrated in a few decently sized projects. Please report issues on github!
Besides the added features the main benefit of using AVEO is the reduced call latency. The table below shows the results achieved on a VE10B.
Test 1: submit N calls, then wait for N results Test 2: submit one call and wait for its result, N times Test 3: submit N calls and wait only for last result Test 4: submit N synchronous calls | |------ veoffload 2.3.0 --------|---------- AVEO 0.9.6 -------------| |----------|-------------------------------|-----------------------------------------| | #calls | Test 1 Test 2 Test 3 | Test 1 Test 2 Test 3 Test 4 | | 1 | 108.00 104.00 104.00 | 23.00 10.00 10.00 7.00 | | 5 | 97.00 102.80 95.80 | 9.00 9.80 6.60 6.60 | | 10 | 96.70 102.20 95.00 | 7.50 10.10 6.10 6.70 | | 50 | 95.62 103.76 96.78 | 5.98 9.88 5.64 6.68 | | 100 | 96.66 104.57 96.27 | 5.78 10.19 5.62 6.71 | | 500 | 96.63 102.97 92.96 | 5.60 9.88 5.57 6.66 | | 1000 | 96.86 101.53 88.50 | 5.61 10.15 5.60 6.72 | | 5000 | 93.47 94.56 87.56 | 5.63 9.96 5.63 6.72 | | 10000 | 91.03 94.27 87.37 | 5.63 10.04 5.65 6.73 | | 50000 | 88.69 95.95 109.34 | 5.62 10.00 5.64 6.72 | | 100000 | 87.48 94.94 218.85 | 5.69 10.14 5.68 6.75 | | 200000 | 88.27 94.82 333.62 | 5.70 10.11 5.70 6.71 | | 300000 | 87.07 94.36 661.19 | 5.77 10.20 5.74 6.74 | |----------|-------------------------------|-----------------------------------------|
The memory transfer latency, measured by transfering 1 byte 30000 times, is similar:
Compatibility to VEO
AVEO aims at implementing fully the VEO API. Use that as a reference.
Actually codes which have been compiled for VEO and use only
dynamically loadable libraries for VE kernels don’t really need to be
recompiled and should run with AVEO. The
LD_LIBRARY_PATH must be set
to point to the AVEO installation lib/ directory and instead of the
original VEO veorun one should use the aveorun offload helper.
Codes which relink their own veorun helper with the
mk_veorun_static must relink the corresponding aveorun, the
command is the same, but is actually a symlink to
AVEO uses VE-VH shared memory for which it uses on VH side huge pages
of 2MB size. Configure (or ask your admin to configure) the VH to have
enough huge pages of 2MB size. You will need at least 128MB for each
possible peer or proc, the absolute minimum are 128MB per VE. If you
use VE accelerated IO, MPI, veo-udma, you will need more. At least
256MB/VE. Also you will need more if you want to run with multiple
procs per VE or have several simultaneous users of AVEO. The number of
huge pages in the system are visible in
/proc/meminfo and changeable
sysctl -w vm.nr_hugepages=2048
NOTE: If your AVEO program crashes it may leave behind offloading helper processes running on the VEs. Each of them will keep a core busy, spinning, and occupy hugepages in memory. You must find these processes with
and kill them. Hugepages will normally be released. Check with
AVEO brings the opportunity to extend the API and test new things. The following new offloading functionality is available only in AVEO, not in VEO:
Specify the core of a proc
The VE core on which a proc shall run can be specified by setting the environment variable
The proc will run the
aveorun helper on that core.
Find contexts of a proc
A ProcHandle instance can have multiple contexts. You should not use more contexts than cores on a VE.
The number of contexts opened on a proc can be determined by the function
int veo_num_contexts(struct veo_proc_handle *);
Pointers to context structures can be retrieved by calling:
struct veo_thr_ctxt *veo_get_context(struct veo_proc_handle *proc, int idx);
idx taking values between 0 and the result of
veo_num_contexts(). The context with
idx=0 is the main context of
the ProcHandle. It will not be destroyed when closed, instead it is
destroyed when the proc is killed by
veo_proc_destroy(), or when
the program ends.
Unloading a library
int veo_unload_library(struct veo_proc_handle *proc, const uint64_t libhandle);
unloads a VE library and removes its cached resolved symbols.
void veo_context_sync(veo_thr_ctxt *ctx);
will return when all requests in the passed context queue have been
processed. Note that their results still need to be picked up by
Synchronous kernel call
A synchronous kernel call is a shortcut of an async call + wait. In AVEO it is implemented differently from an async call and is being executed on a proc instead of a context. You can use this without actually having opened any context.
int veo_call_sync(struct veo_proc_handle *h, uint64_t addr, struct veo_args *ca, uint64_t *result);
NOTE: The synchronous kernel call has a implicit builtin timeout. If the call doesn’t return within ~30 minutes, the call is considered timed out and failed. Things will start going wrong from this point on. Do use it only if you have short function calls that need at most a few minutes. This call is included only because it was the first kernel call implementation which actually works without the requests concept and the queues of a context.
Make sure you can connect to github.com. Then
git clone https://github.com/SX-Aurora/aveo.git cd aveo make install
The installation goes into the
install/ subdirectory. If you decide
for another installation path, specify it with the
variable. This needs to be done for the make and make install
make DEST=<install_path> make install DEST=<install_path>
make install DEST=<install_path>
A set of tests and examples are in the
test/ subdirectory. By make
install their binaries get installed to
can execute them all, one after the other, by calling
Usage is similar to the original VEO. The relevant include file is
therefore point your compiler to the place where it is located by adding the option
You can link with
-lveo as before, with VEO, or use the new library name libaveoVH.so:
-L<install_path>/lib -lveo # or -L<install_path>/lib -laveoVH
A self-linked veorun from original VEO can not be used with AVEO!
The VE helper tool default name has been changed from
but pointing to it when it lives in a non-standard location still works by
setting the same environment variable:
This helper is compiled with OpenMP support, therefore if you have sequential code make sure to use
VE Performance Profiling with FTRACE
For performance profiling of the VE kernel code please point the environment variable VEORUN_BIN to the appropriate aveorun-ftrace:
and make sure sure that the kernel is compiled with
-ftrace. This works only for ncc, nc++ and nfort compiled codes.
The Python API to VEO also works with AVEO. Make sure you use the branch aveo.