Record Course of Events with lo2s¶

Lightweight node-level performance monitoring tool lo2s creates parallel OTF2 traces with a focus on both application and system view. The traces can contain any of the following information:

From running threads
- Calling context samples based on instruction overflows
- The calling context samples are annotated with the disassembled assembler instruction string
- The frame pointer-based call-path for each calling context sample
- Per-thread performance counter readings
- Which thread was scheduled on which CPU at what time
From the system
- Metrics from tracepoints (e.g., the selected C-state or P-state)
- The node-level system tree (CPUs (HW-threads), cores, packages)
- CPU power measurements (x86_energy)
- Microarchitecture specific metrics (x86_adapt, per package or core)
- Arbitrary metrics through plugins (Score-P compatible)

In general, lo2s operates either in process monitoring or system monitoring mode.

With process monitoring, all information is grouped by each thread of a monitored process group - it shows you on which CPU is each monitored thread running. lo2s either acts as a prefix command to run the process (and also tracks its children) or lo2s attaches to a running process.

In the system monitoring mode, information is grouped by logical CPU - it shows you which thread was running on a given CPU. Metrics are also shown per CPU.

In both modes, lo2s always groups system-level metrics (e.g., tracepoints) by their respective system hardware component.

Usage¶

Only the basic usage is shown in this Wiki. For a more detailed explanation, refer to the Lo2s website.

Before using lo2s, set up the correct environment with

marie@login$ module load lo2s

As lo2s is built upon perf, its usage and limitations are very similar to that. In particular, you can use lo2s as a prefix command just like perf. Even some of the command line arguments are inspired by perf. The main difference to perf is that lo2s will output a Vampir trace, which allows a full-blown performance analysis almost like Score-P.

To record the behavior of an application, prefix the application run with lo2s. We recommend using the double dash -- to prevent mixing command line arguments between lo2s and the user application. In the following example, we run lo2s on the application sleep 2.

marie@compute$ lo2s --no-kernel -- sleep 2
[ lo2s: sleep 2  (0), 1 threads, 0.014082s CPU, 2.03315s total ]
[ lo2s: 5 wakeups, wrote 2.48 KiB lo2s_trace_2021-10-12T12-39-06 ]

This will record the application in the process monitoring mode. This means, the applications process, its forked processes, and threads are recorded and can be analyzed using Vampir. The main view will represent each process and thread over time. There will be a metric "CPU" indicating for each process, on which CPU it was executed during the runtime.

Required Permissions¶

By design, lo2s almost exclusively utilizes Linux Kernel facilities such as perf and tracepoints to perform the application measurements. For security reasons, these facilities require special permissions, in particular perf_event_paranoid and read permissions to the debugfs under /sys/kernel/debug.

Luckily, for the process monitoring mode the default settings allow you to run lo2s just fine. All you need to do is pass the --no-kernel parameter like in the example above.

For the system monitoring mode you can get the required permission with the Slurm parameter --exclusive. (Note: Regardless of the actual requested processes per node, you will accrue cpu-hours as if you had reserved all cores on the node.)

Memory Requirements¶

When requesting memory for your jobs, you need to take into account that lo2s needs a substantial amount of memory for its operation. Unfortunately, the amount of memory depends on the application. The amount mainly scales with the number of processes spawned by the traced application. For each processes, there is a fixed-sized buffer. This should be fine for a typical HPC application, but can lead to extreme cases there the buffers are orders of magnitude larger than the resulting trace. For instance, recording a CMake run, which spawns hundreds of processes, each running only for a few milliseconds, leaving each buffer almost empty. Still, the buffers needs to be allocated and thus require a lot of memory.

Given such a case, we recommend to use the system monitoring mode instead, as the memory in this mode scales with the number of logical CPUs instead of the number of processes.

Advanced Topic: System Monitoring¶

The system monitoring mode gives a different view. As the name implies, the focus isn't on processes anymore, but the system as a whole. In particular, a trace recorded in this mode will show a timeline for each logical CPU of the system. To enable this mode, you need to pass -a parameter.

marie@compute$ lo2s -a
^C[ lo2s (system mode): monitored processes: 0, 0.136623s CPU, 13.7872s total ]
[ lo2s (system mode): 36 wakeups, wrote 301.39 KiB lo2s_trace_2021-11-01T09-44-31 ]

Note: As you can read in the above example, lo2s monitored zero processes even though it was run in the system monitoring mode. Certainly, there are more than none processes running on a system. However, as the user accounts on our HPC systems are limited to only see their own processes and lo2s records in the scope of the user, it will only see the users own processes. Hence, in the example above, there are no other processes running.

When using the system monitoring mode without passing a program, lo2s will run indefinitely. You can stop the measurement by sending lo2s a SIGINT signal or hit ctrl+C. However, if you pass a program, lo2s will start that program and run the measurement until the started process finishes. Of course, the process and any of its child processes and threads will be visible in the resulting trace.

marie@compute$ lo2s -a -- sleep 10
[ lo2s (system mode): sleep 10  (0), 1 threads, monitored processes: 1, 0.133598s CPU, 10.3996s total ]
[ lo2s (system mode): 39 wakeups, wrote 280.39 KiB lo2s_trace_2021-11-01T09-55-04 ]

Like in the process monitoring mode, lo2s can also sample instructions in the system monitoring mode. You can enable the instruction sampling by passing the parameter --instruction-sampling to lo2s.

marie@compute$ lo2s -a --instruction-sampling -- make -j
[ lo2s (system mode): make -j  (0), 268 threads, monitored processes: 286, 258.789s CPU, 445.076s total ]
[ lo2s (system mode): 3815 wakeups, wrote 39.24 MiB lo2s_trace_2021-10-29T15-08-44 ]

Advanced Topic: Metric Plugins¶

Lo2s is compatible with Score-P metric plugins, but only a subset will work. In particular, lo2s only supports asynchronous plugins with the per host or once scope. You can find a large set of plugins in the Score-P Organization on GitHub.

To activate plugins, you can use the same environment variables as with Score-P, or with LO2S as prefix:

LO2S_METRIC_PLUGINS
LO2S_METRIC_PLUGIN
LO2S_METRIC_PLUGIN_PLUGIN