veperf memory bandwidth, power and energy monitoring

Erich Focht

The monitoring tool veperf was introduced in November 2018 in the blog post VE monitoring with veperf. It is part of the py-veosinfo package and displays live performance metrics of own running VE programs.

veperf has been extended to display some additional metrics: load and store memory bandwidths, power consumption of each monitored VE and energy used by each VE since the monitoring tool was started.

In order to monitor memory bandwidths the VE processes need to be started with the environment variable setting:

export VE_PERF_MODE=VECTOR-MEM

This is valid for programs linked with any of the NEC SDK compilers: ncc, nc++, nfort. Programs linked with LLVM are currently not influenced by the environment variable and will not display memory bandwidth metrics.

Installation

From PYPI

The py-veosinfo source package is uploaded to pypi.org and can be installed by calling:

pip install py-veosinfo

The build needs the RPM veosinfo-devel to be installed as it needs access to the include file veosinfo.h.

For local user installs the binary is being created in $HOME/.local/bin.

From RPM

For CentOS 7 (Python 2.7) setups we provide an rpm in the YUM repository with the following configuration:

$ cat /etc/yum.repos.d/ve_extras.repo 
[ve-extra]
name=Aurora TSUBASA Extras
baseurl=https://sx-aurora.com/repos/veos/ef_extra
gpgcheck=0
enabled=1

With that configured the installation command is

sudo yum install py-veosinfo

The executable ends up installed in /usr/bin/veperf.

Local install without root privileges

If you have no root privileges you can unpack the localinstall tarball in your home directory:

cd ~
wget https://github.com/SX-Aurora/py-veosinfo/releases/download/v2.5.1/py-veosinfo-2.5.1-localinstall-py27-py36.tgz
tar xzvf py-veosinfo-2.5.1-localinstall-py27-py36.tgz

It is essential to do this in $HOME because that puts the veosinfo (binary) module in a default place where python finds it without need to manipulate PYTHONPATH.

Normally you should be able to now execute veperf because ~/.local/bin is being added to your $PATH if it exists, on RHEL or CentOS distributions.

Usage

veperf shows performance metrics of the user’s own processes on all or selected nodes, measured over certain time intervals. The root user can monitor all VE processes’ performance metrics. Performance counter readout is done from the vector host (VH) with no overhead or interruption for the running processes, no instrumentation of programs is necessary. It involves reading register values for registers that don’t need the stopping of the cores and uses the VE’s system (privileged) DMA facilities.

The tool can be used by administrators to track the overall performance of VEs and detect VEs running at very low performance. Detecting “slow” users is intentionally left as a separate step. Also the tool can be started by users just before running a job and gather the performance evolution of their program over the run-time. It can help identify fast and slow phases of the program.

Command invokation

veperf [-h|--help] [-d|--delay DELAY] [-n|--node NODE] [INTERVALS]

Options

-d|--delay DELAY : performance metrics measurement time interval. Default: 2s.
-n|--node NODE : select VE node ID for measurement. Multiple –node options are allowed. When the option is omitted, all nodes are monitored.
INTERVALS : number of monitoring time intervals before exiting the program. When this option is not specified, ‘veperf’ will continue until it is killed.

Output Format

The output for each monitoring time interval consists of:

a line containing the labels of the thread monitoring data
for each monitored VE node:
- ->VE#: the VE node ID
- for each (own, seen) task running on the VE
  - pid, USRSEC, EFFTIME, MOPS, MFLOPS, AVGVL, VOPRAT, VTIMERATIO, L1CMISS, PORTCONF, VLLCHIT, LOADBW, STOREBW, comm
- summed up values for MOPS, MFLOPS, POWER and ENERGY for the VE. If processes are set to monitor memory bandwidths, the summed LOAD and STORE memory bandwidths (Core <-> LLC) are displayed.

Only the visible processes contribute to the MOPS, MFLOPS, LOADBW and STOREBW sums while POWER and ENERGY might include contributions from processes that are not visible, belonging to other users.

The example below shows 8 processes running on VE node 1 with an idle node 0, monitored with the default time interval of 2s.

pid          USRSEC  EFFTIME     MOPS    MFLOPS  AVGVL    VOPRAT  VTIMERATIO  L1CMISS  PORTCONF   VLLCHIT    LOADBW   STOREBW  comm      
-> VE0
SUM VE0: MOPS=0        MFLOPS=0        POWER=40.6W ENERGY=4.23kJ
-> VE1
199287      301.85s    1.000    75934     67531    175     98.6%       91.0%       0%        0%       96%   36.091   23.460  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199288      301.84s    1.000    76016     67630    176     98.6%       91.1%       0%        0%       97%   35.963   23.457  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199289      301.82s    1.000    76170     67759    175     98.6%       91.2%       0%        0%       97%   36.109   23.493  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199290      301.88s    1.000    75960     67581    176     98.6%       91.3%       0%        0%       97%   35.938   23.459  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199291      301.82s    1.000    75664     67277    175     98.6%       91.1%       0%        0%       96%   35.985   23.445  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199293      301.87s    1.000    75368     67017    176     98.6%       90.8%       0%        0%       96%   35.791   23.378  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199295      301.85s    1.000    75839     67449    175     98.6%       91.0%       0%        0%       96%   35.996   23.456  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199298      301.81s    1.000    75797     67420    176     98.6%       91.2%       0%        0%       97%   35.919   23.446  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
SUM VE1: MOPS=606749   MFLOPS=539664   LOADBW=287.791GB/s STOREBW=187.593GB/s POWER=143.4W ENERGY=15.31kJ
pid          USRSEC  EFFTIME     MOPS    MFLOPS  AVGVL    VOPRAT  VTIMERATIO  L1CMISS  PORTCONF   VLLCHIT    LOADBW   STOREBW  comm      
-> VE0
SUM VE0: MOPS=0        MFLOPS=0        POWER=40.6W ENERGY=4.31kJ
-> VE1
199287      303.85s    1.000    58922     50487    174     98.4%       89.6%       0%        0%       96%   35.156   24.599  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199288      303.84s    1.000    58653     50276    175     98.4%       89.9%       0%        0%       96%   34.786   24.543  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199289      303.83s    1.000    59104     50677    174     98.4%       90.0%       0%        0%       96%   35.095   24.591  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199290      303.88s    1.000    58598     50235    175     98.4%       89.7%       0%        0%       96%   34.740   24.515  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199291      303.82s    1.000    58731     50330    174     98.4%       89.8%       0%        0%       96%   34.949   24.547  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199293      303.87s    1.000    58166     49826    175     98.4%       89.3%       0%        0%       96%   34.605   24.444  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199295      303.85s    1.000    58808     50405    174     98.4%       89.4%       0%        0%       96%   34.947   24.561  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
199298      303.81s    1.000    58540     50175    175     98.4%       89.7%       0%        0%       96%   34.737   24.518  cpunodebind=-1--/scr/qe-6.3/bin/pw.x
SUM VE1: MOPS=469522   MFLOPS=402411   LOADBW=279.014GB/s STOREBW=196.318GB/s POWER=152.8W ENERGY=15.62kJ

Metrics

USRSEC: Task’s user time on VE. When the task is not scheduled on a core, this time does not progress.
EFFTIME: Effective time: ratio between user and elapsed time. A value lower than 1.0 is a sign that the task is spending time in syscalls.
MOPS: Millions of Operations Per Second.
MFLOPS: Millions of Floating Point Ops Per Second.
AVGVL: Average vector length.
VOPRAT: Vector operation ratio [percent].
VTIMERATIO: Vector time ratio (vector time / user time) [percent].
L1CMISS: SPU L1 cache miss time [percent].
PORTCONF: CPU port conflict time [percent].
VLLCHIT: Vector load LLC hits [percent] (counter not active with MPI).
LOADBW: LLC to VE core load memory bandwidth in GB/s. Only shown if VE_PERF_MODE=VECTOR-MEM.
STOREBW: LLC to VE core store memory bandwidth in GB/s. Only shown if VE_PERF_MODE=VECTOR-MEM.
POWER: Power consumption of the VE in Watts. No VH related power consumption is considered.
ENERGY: Energy consumed by the VE since veperf was started, in kiloJoules. No VH related power consumption is considered.

NOTE: All metrics except USRSEC, POWER and ENERGY are giving the values over the last displayed time interval. USRSEC shows the time that the executable was using the VE core. POWER is measured at the end of the time interval! If you care about an accurate power measurement, chose a small time interval with the -d option, 1 or 2s. ENERGY is a time integral of the power from the start of the execution of the veperf tool, thus if started right before the program and stopped right after the program has finished, you’ll get the energy spent by each VE in the last displayed output block.

The PMMR register of each VE core is controlling which performance metrics are actually active. Currently there is no easy way to control this register from user side besides setting VE_PERF_MODE, MPI and non-MPI programs might measure slightly different metrics.

Wikipedia