Erich Focht
This post presents a testing VEOS version, derived from the official v1.3.2 with two added features: accelerated DMA and a fix for external sampling VE profiling, as well as instructions on how to test these new features. It is a follow-up on the post “Building VEOS” that was describing
- a way to start building and experimenting with VEOS packages,
- an example usage with an own branch of VEOS from github.com/efocht which included a patch for speeding up DMA transfers.
The news is an updated base version of VEOS, now the patched version is on top of v1.3.2, a fix for a rare issue with the tuned DMA manager, and a patch fixing occasional lock-ups when using veprof, an external profiler for VE programs.
NOTE: The patches included in this inofficial VEOS version are currently being reviewed and will be included into the official release some time in the future. Since this process could take O(months), I simply want to provide them for those interested to test and use the higher IO performance, higher VE-VH transfer speeds for VEO and VHcall as well as the external VE profiling tool veprof. The patches were already tested and I consider them safe enough to not kill your cat, but, of course, this release comes without any warranty.
Download & Install
RPMs and SRPMs are provided as a release of the github project github.com/efocht/build-veos: v1.3.2-4dma.
Before installing the RPMs make sure that you stopped the VEOS services and unloaded the VE related kernel modules. For instructions on this check this post.
Install the RPMs for example with:
cd BUILD_1.3.2_4dma
rpm -Fhv *.x86_64.rpm
The RPMs for the kernel modules have been built for CentOS 7.4. The source RPMs are provided, thus the packages can be rebuilt for CentOS 7.5.
Building
The packages can be built by using the scripts in github.com/efocht/build-veos:
git clone -b v1.3.2-4dma https://github.com/efocht/build-veos.git
ln -s build-veos/x .
Clone repositories, checkout proper branch
mkdir BLD
cd BLD
GITH="https://github.com/efocht"
for repo in build-veos veos libved ve_drv-kmod vp-kmod; do
git clone $GITH/$repo.git
done
for repo in build-veos veos libved ve_drv-kmod vp-kmod; do
cd $repo
git checkout v1.3.2-4dma
cd ..
done
ln -s build-veos/x .
Build RPMs and replace old ones by new ones
# stop VEOS and VE related services, unload VE modules
x/vemods_unload
# replace installed RPMs by the newly built ones
export RPMREPLACE=1
cd libved
../x/bld.libved
cd ..
cd vp-kmod
../x/bld.vp
cd ..
cd ve_drv-kmod
../x/bld.ve_drv
cd ..
cd veos
../x/bld.veos
cd ..
# load VE modules, start VEOS and VE related services
x/vemods_load
Testing DMA performance
The simplest and most direct test is a benchmark that only does VE-VH
DMA transfers. The following test is using the functions
ve_recv_data()
and ve_send_data()
directly. These functions are
actually used by the IO related system calls inside the pseudo
process. They use unregistered VH buffers.
In order to get reproducible (and better) results, turn the VH cpufreq governor into performance mode:
sudo cpupower frequency-set --governor performance
Huge Pages on the VH
Huge pages are essential for high performance DMA with unregistered buffers because they significantly reduce the overhead needed for virtual to physical translation.
For huge pages testing: make sure they are enabled on the VH. You should find the appropriate directories in /sys/kernel/mm/hugepages. This approach uses 2MB huge pages, so make sure there is a sufficient number of them in the system. Increase their number eg with
sudo sysctl -w vm.nr_hugepages=8192
Transparent huge pages work, but could hide some trouble. They can be
disabled by adding transparent_hugepages=never
to the kernel boot
options.
There are several ways to get applications to use huge pages on the VH side. In a C program you can allocate huge page memory explicitly by using mmap():
addr = mmap(0, inp->size, PROT_READ | PROT_WRITE,
MAP_PRIVATE|MAP_HUGETLB|MAP_ANONYMOUS|MAP_POPULATE,
-1, 0);
When running native VE programs we want to have huge pages used inside
the VH side of the program, which is actually only the “pseudo” part
of ve_exec, the part that waits for exceptions from VH and processes
them inside the ve_exec context. Since we don’t want to modify
ve_exec or its libvepseudo.so library, the easiest way to let the
syscalls use huge pages buffers is to run the program under hugectl
,
the tool from the libhugetlbfs-utils package.
hugectl --heap <VE_EXECUTABLE> [...]
or, alternatively, set the environment variables preloading libhugetlbfs.so and replacing malloc with something huge-page-aware:
export LD_PRELOAD=libhugetlbfs.so
export HUGETLB_MORECORE=yes
export HUGETLB_VERBOSE=2
Building and Running Tests
Get the repository with the tests and build them:
git clone https://github.com/efocht/vhcall-memtransfer-bm.git
# build
cd vhcall-memtransfer-bm
make
Run single transfer tests, for example one 40MB transfer on small pages, from VE to VH:
$ ./ve2vh -s $((40*1024))
prepared
4952.983801[MiB/s]
Total: 4952.983801[MiB/s]
And now the same on huge pages:
$ ./ve2vh -s $((40*1024)) -H
prepared
10389.310854[MiB/s]
Total: 10389.310854[MiB/s]
Or run scans over a range of buffer sizes:
VH to VE, small pages (4k), unpinned, unregistered buffer*
$ ./scan_vh2ve.sh
buff kb BW MiB/s
32 138
64 254
128 446
256 834
512 937
1024 1977
2048 1841
4096 3075
8192 3999
16384 4578
32768 4914
65536 4948
131072 5579
262144 5693
524288 5635
1048576 5654
VH to VE, huge pages (2M), unpinned, unregistered buffer*
$ HUGE=1 ./scan_vh2ve.sh
buff kb BW MiB/s
32 136
64 283
128 393
256 933
512 1900
1024 2673
2048 4273
4096 6506
8192 8095
16384 9108
32768 9720
65536 10047
131072 10290
262144 10300
524288 10322
1048576 10344
Using veprof VE Profiling
This topic actually deserves a separate article, therefore I’ll make it short:
The proprietary NEC compilers currently don’t support simple profiling. They do support something much more fancy, which is called ftrace and instruments and measures each function’s performance counters, rendering these to a nice list of performance metrics for each function in the program.
Unfortunately ftrace requires recompilation and brings quite some overhead, especially when having many small functions. Often a quick and low overhead profile would provide sufficient information for an idea on what needs to be optimized. After VEOS has been added the ve_get_regvals() functionality, my colleague Holger Berger decided to try if an external profiler, periodically reading out from the VH performance counters of certain VE processes and the corresponding IC register, would be a feasible approach. It is using similar mechanisms to monitor VE processes like veperf but has a different scope. The result is the new tool at github.com/SX-Aurora/veprof.
The tool has promptly triggered some bugs hiding in VEOS, which were not fixed in the official v1.3.2 release, yet. They are fixed in the test release described in this post: v1.3.2-4dma, therefore veprof can finally be used.
Build veprof
Clone the repository and build the tools:
git clone https://github.com/SX-Aurora/veprof.git
cd veprof
make
The tools veprof and veprof_display profide the profile sampler and the postprocessing, respectively.
Usage
The following is taken more or less directly from the README.md of Holger’s veprof repository.
Sample with 100hz:
veprof ./exe
Sample with 50 hz:
veprof -s 50 ./exe
Sample a wrapper calling exe
veprof -e ./exe ./wrapper.sh
Sample an openmp code (works only with VEOS v1.3.2-4dma, currently):
veprof --openmp ./exe
Display gathered results:
veprof_display veprof.out
A sample output is shown below:
FUNCTION SAMPLES TIME TIME VECTIME VTIME VOP MFLOPS MOPS AVG L1$ PRTCNF LLC$E
% % [s] [s] % % VLEN MISS% [s] HIT%
main 53.33 55.00 0.08 0.08 100.00 98.63 33388 67707 254.46 0.00 0.00 100.00
subroutine 46.67 45.00 0.06 0.06 96.21 98.55 32109 65167 254.46 0.00 0.00 100.00