← Back to C++ Advanced

Module 18: Coding Standards, Profiling & System Tracing

Overview

Module 16 taught you what sanitizers catch and how to read their reports. This module teaches the broader quality ecosystem: how to prevent bugs before they happen (coding standards + static analysis), how to find performance bottlenecks (profiling), how to trace system behaviour in production (LTTng, strace, perf), and how to debug the toughest problems (advanced GDB).

Key insight: Sanitizers are reactive — they find bugs after you write them. Coding standards + static analysis are preventive. Profiling and tracing are diagnostic. A production-grade workflow uses ALL three layers.

Layer 1 — PREVENTION:  Coding standards → static analysis → compiler warnings
Layer 2 — DETECTION:   Sanitizers (ASan, TSan, UBSan) → Module 16
Layer 3 — DIAGNOSIS:   Profiling (perf, callgrind) → Tracing (LTTng, strace)
Layer 4 — DEBUG:       GDB advanced → core dumps → post-mortem

Part 1: C++ Coding Standards & Static Analysis

1.1 Why Coding Standards Matter

Coding standards reduce bugs by banning patterns that are known to cause defects. They are not about style (tabs vs spaces) — they encode decades of hard-won knowledge about what goes wrong in C++ code.

Standard Scope Target Domain Key Feature
CppCoreGuidelines General All C++ Broad, modern, tool-enforceable
MISRA C++:2023 Safety Automotive, medical, aerospace Decidable, every rule provable
CERT C++ Security Network, systems Focuses on exploitability
AUTOSAR C++14 Safety Automotive ECU Based on MISRA, adds automotive rules
SEI CERT C Security C code (not C++) Applicable to C-style code in C++
JSF++ AV Safety Joint Strike Fighter Military aerospace, very strict

For robot software (our context): CppCoreGuidelines + selected MISRA rules give the best cost/benefit ratio. Full MISRA compliance is only needed for ISO 26262 / IEC 61508 certification.


1.2 CppCoreGuidelines — The Essential Rules

The full guidelines have 500+ rules. Here are the ones that catch the most bugs:

Resource Management (R)

// R.1: Manage resources automatically using RAII
// BAD — raw pointer ownership
void bad() {
    auto* p = new Widget();
    do_something(*p);  // if this throws → LEAK
    delete p;
}

// GOOD — unique_ptr enforces cleanup
void good() {
    auto p = std::make_unique<Widget>();
    do_something(*p);  // exception-safe, no leak possible
}
// R.3: A raw pointer (T*) is non-owning
// Rule: if you see T*, the pointed-to object is managed elsewhere.
// Ownership MUST use unique_ptr or shared_ptr.

void process(Widget* w);  // non-owning: caller keeps ownership
auto owner = std::make_unique<Widget>();
process(owner.get());     // explicitly non-owning
// R.5: Prefer scoped objects, don't heap-allocate unnecessarily
// BAD — heap allocation for no reason
auto* data = new std::vector<int>{1, 2, 3};
// ...
delete data;

// GOOD — stack allocation
std::vector<int> data{1, 2, 3};

Lifetime Safety (F, C)

// F.42: Return a T* to indicate a position (only)
// Never return a pointer to indicate ownership transfer.

// F.43: Never return a pointer or reference to a local object
// BAD
int& bad() {
    int local = 42;
    return local;  // dangling reference — undefined behavior
}

// C.31: All resources acquired by a class must be released by the destructor
// This IS the Rule of Zero / Rule of Five.

Concurrency (CP)

// CP.1: Assume code will run in a multi-threaded environment
// Even single-threaded code may become multi-threaded later.

// CP.2: Avoid data races
// A data race = two threads access the same memory, at least one writes,
// no synchronization. This is UNDEFINED BEHAVIOR (not just a bug).

// CP.20: Use RAII for locking, never plain lock()/unlock()
// BAD
mutex_.lock();
do_work();  // if this throws → deadlock (mutex never unlocked)
mutex_.unlock();

// GOOD
{
    std::lock_guard<std::mutex> lock(mutex_);
    do_work();  // unlock guaranteed even on exception
}

// CP.44: Remember to name your lock_guards and unique_locks
// BAD — anonymous temporary, unlocks immediately!
std::lock_guard<std::mutex>{mutex_};  // ← UNLOCKS RIGHT HERE
do_work();  // ← UNPROTECTED!

// GOOD — named variable lives until scope end
std::lock_guard<std::mutex> lock{mutex_};
do_work();  // ← protected

Type Safety (I, ES)

// ES.48: Avoid casts
// If you must cast, prefer static_cast over C-style casts.
// C-style casts can silently do reinterpret_cast.

// I.11: Never transfer ownership by a raw pointer
// Use unique_ptr for single ownership, shared_ptr for shared.

// ES.46: Avoid lossy narrowing conversions
// BAD
int64_t big = 1LL << 40;
int32_t small = big;  // silent truncation

// GOOD
int32_t small = static_cast<int32_t>(big);  // explicit
// Even better: use gsl::narrow<> which throws on data loss

1.3 MISRA C++:2023 — Safety-Critical Rules

MISRA rules are classified as Required, Advisory, or Mandatory. Here are the most impactful Required rules:

Rule Category What it bans Why
0.1.2 Mandatory Unreachable code Dead code masks bugs
4.6.1 Required Implicit narrowing conversions Silent data loss
6.0.1 Required goto Unstructured control flow
6.7.2 Required Global non-const variables Hidden coupling
6.8.2 Required Single-use variables Unnecessarily complex
7.0.5 Required C-style casts Bypasses type system
7.6.1 Required reinterpret_cast Undefined behavior risk
8.2.5 Required Virtual functions in constructors Surprising dispatch
8.14.1 Required const_cast to remove const Breaks type safety
9.3.1 Required malloc/free in C++ Use RAII instead
10.3.1 Required Empty throw (throw;) outside catch Calls std::terminate
12.3.1 Required NULL macro Use nullptr
15.0.2 Required Uninitialized variables Undefined behavior
19.0.1 Required #define for constants Use constexpr
21.10.1 Required signal() for signal handling Race conditions

Key insight: Many MISRA rules can be enforced automatically with clang-tidy + cppcheck. You don’t need to memorize them — configure your tools.


1.4 CERT C++ Secure Coding

CERT rules focus on security — what an attacker can exploit:

Rule What Exploit
STR50-CPP Guarantee null termination Buffer overflow
MEM50-CPP Don’t access freed memory Use-after-free → RCE
MEM51-CPP Properly deallocate memory Memory corruption
INT50-CPP Don’t cast to smaller type Integer truncation
ERR50-CPP Don’t call exit() in destructors Stack unwinding break
CON50-CPP Don’t destroy a locked mutex Undefined behavior
DCL50-CPP Don’t define a C-style variadic function Type confusion
EXP50-CPP Don’t depend on order of evaluation Sequencing bugs

1.5 Enforcing Standards with clang-tidy

clang-tidy is the primary tool for automated coding standard enforcement:

# Install (Ubuntu 20.04)
sudo apt install clang-tidy-10  # or later

# Run all checks
clang-tidy -checks='*' source.cpp -- -std=c++2a

# Run specific check categories
clang-tidy -checks='cppcoreguidelines-*,modernize-*,bugprone-*' source.cpp

# Auto-fix what it can
clang-tidy -checks='modernize-*' -fix source.cpp

# Use with compile_commands.json (from CMake)
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ..
clang-tidy -p build/ source.cpp

Essential clang-tidy check categories:

Category Catches Example
bugprone-* Likely bugs Dangling handles, infinite loops, misused move
cppcoreguidelines-* CppCoreGuidelines violations Owning raw pointers, C-arrays
modernize-* Pre-C++11 patterns NULLnullptr, raw loops → algorithms
performance-* Performance issues Unnecessary copies, move-eligible returns
readability-* Readability Magic numbers, inconsistent naming
cert-* CERT rules Security-related issues
misc-* Miscellaneous Unused parameters, redundant expressions
clang-analyzer-* Deep analysis Null deref paths, dead stores

Configuration file (.clang-tidy):

---
Checks: >
  -*,
  bugprone-*,
  cppcoreguidelines-*,
  modernize-*,
  performance-*,
  readability-*,
  cert-*,
  -modernize-use-trailing-return-type,
  -readability-magic-numbers,
  -cppcoreguidelines-avoid-magic-numbers
WarningsAsErrors: 'bugprone-*,cert-*'
HeaderFilterRegex: '.*'
CheckOptions:
  - key: readability-identifier-naming.ClassCase
    value: CamelCase
  - key: readability-identifier-naming.FunctionCase
    value: camelBack
  - key: readability-identifier-naming.VariableCase
    value: lower_case
  - key: readability-identifier-naming.MemberPrefix
    value: ''
  - key: readability-identifier-naming.MemberSuffix
    value: '_'

1.6 Enforcing Standards with cppcheck

cppcheck does pattern-based analysis (no Clang AST dependency):

# Install
sudo apt install cppcheck

# Basic analysis
cppcheck --enable=all --std=c++20 source.cpp

# With suppression file
cppcheck --enable=all --suppressions-list=cppcheck.supp source.cpp

# Generate XML report (for CI integration)
cppcheck --enable=all --xml source.cpp 2> cppcheck_report.xml

# Check a whole directory
cppcheck --enable=all -I include/ src/

What cppcheck catches that clang-tidy sometimes misses:

Issue Example
Null pointer dereference paths Complex conditional chains
Buffer overflows Array index out of bounds
Uninitialized variables Conditional initialization paths
Memory leaks Non-RAII allocation patterns
Resource leaks File descriptors, sockets
Redundant conditions if (p != NULL) after p = new X
Portability issues Different behavior across compilers

Suppression file (cppcheck.supp):

// Suppress specific findings
unusedFunction:src/test_helper.cpp
uninitvar:third_party/*.cpp
// Suppress by ID
memleak:src/legacy.cpp:42

1.7 Compiler Warnings as a First Line of Defence

The cheapest static analysis is your compiler’s warning flags:

# Recommended minimum for any project
-Wall -Wextra -Wpedantic

# Safety-critical / production quality
-Wall -Wextra -Wpedantic -Werror \
-Wconversion -Wsign-conversion \
-Wdouble-promotion -Wformat=2 \
-Wnull-dereference -Wold-style-cast \
-Wshadow -Wunused

# GCC-specific extras
-Wlogical-op -Wduplicated-cond -Wduplicated-branches \
-Wuseless-cast -Wrestrict

# Clang-specific extras
-Wmost -Weverything  # (warning: very noisy)

What each flag catches:

Flag Catches Example
-Wconversion Implicit narrowing int x = 3.14;
-Wsign-conversion Signed ↔ unsigned unsigned u = -1;
-Wshadow Variable shadowing Inner x hides outer x
-Wdouble-promotion Float→double promotion float f; printf("%f", f);
-Wold-style-cast C-style casts (int)ptr instead of static_cast
-Wnull-dereference Potential null deref GCC path analysis
-Wformat=2 Printf format mismatches printf("%d", "hello")
-Wduplicated-cond Duplicate if conditions Copy-paste bugs

include-what-you-use (IWYU)

# Install
sudo apt install iwyu

# Run (needs compile_commands.json)
iwyu_tool.py -p build/ source.cpp

# IWYU output example:
# source.cpp should add:    #include <algorithm>
# source.cpp should remove:  #include <iostream>  // not used

IWYU ensures each file includes exactly what it uses — no transitive include dependencies, no unused headers. This improves compile times and makes dependencies explicit.


Part 2: Profiling & Performance Analysis

2.1 perf — Linux Performance Counters

perf is the standard Linux profiling tool. It uses hardware performance counters (PMU — Performance Monitoring Unit) built into the CPU.

# Install
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Allow non-root profiling (set once)
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

perf stat — Count events

# Count cache misses, branch mispredictions, instructions
perf stat ./my_program

# Example output:
#    1,234,567,890  instructions       #  2.50 IPC
#       45,678,901  cache-misses       #  3.7% of all cache refs
#       12,345,678  branch-misses      #  1.2% of all branches
#            2.345  seconds time elapsed

# Specific events
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./my_program

# Repeat N times for statistical significance
perf stat -r 10 ./my_program

Key metric: IPC (Instructions Per Cycle). An IPC < 1.0 usually means the CPU is stalling on memory access (cache misses). IPC > 2.0 is good.

perf record + perf report — Profile

# Record call stacks at 99 Hz (use prime numbers to avoid aliasing)
perf record -g -F 99 ./my_program

# Show interactive profile
perf report

# Show annotated source (requires -g debug info)
perf annotate

# Key perf report columns:
# Overhead — % of total samples in this function
# Children — % of total samples in this function + its callees
# Self     — % of total samples ONLY in this function (excludes callees)

perf record tips:

# Record for a running process
perf record -g -p $(pidof my_process) -- sleep 10

# Record specific events
perf record -e cache-misses -g ./my_program

# Record with call graph (dwarf = most reliable, needs -g in compilation)
perf record --call-graph dwarf -F 99 ./my_program

2.2 Valgrind Callgrind — Call Graph Profiling

Callgrind simulates the CPU’s cache hierarchy and counts instruction costs:

# Profile
valgrind --tool=callgrind ./my_program

# Output: callgrind.out.<pid>

# Annotate source (text)
callgrind_annotate callgrind.out.12345

# Visualize with KCachegrind (GUI)
kcachegrind callgrind.out.12345

Callgrind vs perf:

Feature perf Callgrind
Speed ~1x (hardware counters) ~20-100x (simulation)
Accuracy Statistical sampling Exact instruction count
Cache model Real hardware Simulated (may differ)
Call graph Yes (sampling) Yes (exact)
Thread support Yes Limited
Root required Usually no No

Rule of thumb: Use perf for quick profiling, callgrind when you need exact call counts or cache simulation details.


2.3 Valgrind Cachegrind — Cache Simulation

# Run cache simulation
valgrind --tool=cachegrind ./my_program

# Output example:
# ==12345== D1  miss rate:  4.2% (  3.8% rd + 6.1% wr)
# ==12345== LLd miss rate:  0.8% (  0.7% rd + 1.2% wr)

# Annotate per-line cache misses
cg_annotate cachegrind.out.12345

What to look for: - D1 miss rate > 5% → data cache pressure, likely array access pattern issue - LLd (Last-Level cache) miss rate > 2% → memory bandwidth bottleneck - High Dw (data writes) miss rate → false sharing in multi-threaded code


2.4 Flamegraphs

Flamegraphs visualize profiling data as stacked function calls where width represents time. Created by Brendan Gregg.

# Install
git clone https://github.com/brendangregg/FlameGraph.git

# Generate from perf data
perf record -g -F 99 ./my_program
perf script > out.perf
FlameGraph/stackcollapse-perf.pl out.perf > out.folded
FlameGraph/flamegraph.pl out.folded > flamegraph.svg

# Open in browser
firefox flamegraph.svg

Reading a flamegraph:

    ┌──────────── main() ────────────────────┐
    │  ┌─── process_data() ────────────────┐ │
    │  │  ┌── sort_items() ──────────┐     │ │  ← WIDEST = HOTTEST
    │  │  │  ┌─ compare() ─────┐     │     │ │
    │  │  │  └─────────────────┘     │     │ │
    │  │  └──────────────────────────┘     │ │
    │  │  ┌── validate() ─┐               │ │
    │  │  └────────────────┘               │ │
    │  └───────────────────────────────────┘ │
    └────────────────────────────────────────┘
  • Width = time spent. sort_items() is the hottest function.
  • Height = call depth. Read bottom (caller) to top (callee).
  • Colour is random (for visual separation), not meaningful.
  • Look for wide plateaus — those are your optimization targets.

2.5 CPU Cache Effects on C++ Performance

Cache-friendly code can be 10-100x faster than cache-unfriendly code for the same algorithmic complexity:

Row-major vs column-major (the classic example)

// Cache-friendly: sequential access (row-major in C++)
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        matrix[i][j] *= 2;  // stride = sizeof(element)

// Cache-HOSTILE: column-major access
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        matrix[i][j] *= 2;  // stride = N * sizeof(element)

For N=4096, column-major is ~10x slower due to L1 cache thrashing.

Struct of Arrays (SoA) vs Array of Structs (AoS)

// AoS — cache-unfriendly if you only access one field
struct Particle { float x, y, z, mass, charge, radius; };
std::vector<Particle> particles(N);

// Process only positions → loads mass, charge, radius into cache too (waste!)
for (auto& p : particles) p.x += p.vx * dt;

// SoA — cache-friendly for single-field access
struct Particles {
    std::vector<float> x, y, z, mass, charge, radius;
};
Particles ps;
ps.x.resize(N); ps.vx.resize(N);

// Process only x → only x data in cache (no waste)
for (size_t i = 0; i < N; ++i) ps.x[i] += ps.vx[i] * dt;

False sharing

// BAD: Two threads writing to adjacent cache lines
struct alignas(8) Counters {
    std::atomic<int> count_a;  // thread 1 writes here
    std::atomic<int> count_b;  // thread 2 writes here
    // Both fit in ONE 64-byte cache line → ping-pong between cores
};

// GOOD: Pad to separate cache lines
struct Counters {
    alignas(64) std::atomic<int> count_a;  // own cache line
    alignas(64) std::atomic<int> count_b;  // own cache line
};

False sharing can cause a 10x slowdown because every write by one thread invalidates the other thread’s copy of the cache line.


Part 3: System Tracing

3.1 LTTng — Linux Tracing Toolkit next generation

LTTng is a high-performance tracing framework for Linux. It can trace: - Kernel events (syscalls, scheduling, interrupts, block I/O) - Userspace events (your application’s tracepoints)

LTTng’s overhead is ~100ns per tracepoint — low enough for production use. This is 10-100x lower than printf-debugging or syslog.

Architecture:

┌─────────────────────────────────────────┐
│  Your Application                       │
│  ┌─────────────────────────────────┐    │
│  │ TRACEPOINT(my_app, request_start│    │
│  │   , size_t, req_id             │    │
│  │   , int, priority              │    │
│  │ )                               │    │
│  └───────────┬─────────────────────┘    │
│              │ ~100ns per tracepoint    │
│  ┌───────────▼──────────────────────┐   │
│  │  LTTng-UST (Userspace Tracer)   │   │
│  └───────────┬──────────────────────┘   │
└──────────────┼──────────────────────────┘
               │ shared memory ring buffer
┌──────────────▼──────────────────────────┐
│  lttng-sessiond (Session Daemon)        │
│  ┌──────────────────────────────────┐   │
│  │ lttng-consumerd → trace files    │   │
│  └──────────────────────────────────┘   │
└─────────────────────────────────────────┘
               │
┌──────────────▼──────────────────────────┐
│  Analysis Tools                         │
│  babeltrace2, Trace Compass, LTTng Live │
└─────────────────────────────────────────┘

Installation:

# Ubuntu 20.04+
sudo apt install lttng-tools lttng-modules-dkms liblttng-ust-dev babeltrace2

# Verify
lttng version

Defining userspace tracepoints:

Create a tracepoint provider header (my_tp.h):

/* my_tp.h */
#undef TRACEPOINT_PROVIDER
#define TRACEPOINT_PROVIDER my_app

#undef TRACEPOINT_INCLUDE
#define TRACEPOINT_INCLUDE "./my_tp.h"

#if !defined(_MY_TP_H) || defined(TRACEPOINT_HEADER_MULTI_READ)
#define _MY_TP_H

#include <lttng/tracepoint.h>

TRACEPOINT_EVENT(
    my_app,            /* provider name */
    request_start,     /* event name */
    TP_ARGS(
        size_t, req_id,
        int, priority,
        const char*, endpoint
    ),
    TP_FIELDS(
        ctf_integer(size_t, req_id, req_id)
        ctf_integer(int, priority, priority)
        ctf_string(endpoint, endpoint)
    )
)

TRACEPOINT_EVENT(
    my_app,
    request_end,
    TP_ARGS(
        size_t, req_id,
        int, status_code,
        uint64_t, duration_ns
    ),
    TP_FIELDS(
        ctf_integer(size_t, req_id, req_id)
        ctf_integer(int, status_code, status_code)
        ctf_integer(uint64_t, duration_ns, duration_ns)
    )
)

#endif /* _MY_TP_H */

#include <lttng/tracepoint-event.h>

Create the tracepoint provider source (my_tp.c):

/* my_tp.c */
#define TRACEPOINT_CREATE_PROBES
#define TRACEPOINT_DEFINE
#include "my_tp.h"

Use in your C++ code:

#include "my_tp.h"

void handle_request(size_t id, int priority, const char* endpoint) {
    tracepoint(my_app, request_start, id, priority, endpoint);
    // ... do work ...
    auto start = std::chrono::steady_clock::now();
    process(id);
    auto elapsed = std::chrono::steady_clock::now() - start;
    uint64_t ns = std::chrono::duration_cast<std::chrono::nanoseconds>(elapsed).count();
    tracepoint(my_app, request_end, id, 200, ns);
}

Compile:

gcc -c my_tp.c -I.
g++ -std=c++2a -c main.cpp -I.
g++ main.o my_tp.o -ldl -llttng-ust -o my_app

Recording traces:

# Create a session
lttng create my-session --output=/tmp/my-trace

# Enable userspace events
lttng enable-event --userspace 'my_app:*'

# Enable kernel events (optional, needs root)
sudo lttng enable-event --kernel sched_switch,sched_wakeup

# Start tracing
lttng start

# Run your application
./my_app

# Stop and destroy session
lttng stop
lttng destroy

# View trace (text)
babeltrace2 /tmp/my-trace

# Example output:
# [10:30:01.123456789] my_app:request_start: req_id=42 priority=3 endpoint="/api/data"
# [10:30:01.124567890] my_app:request_end:   req_id=42 status_code=200 duration_ns=1111101

LTTng for ROS systems:

# Trace ROS 2 DDS events + your custom tracepoints
lttng create ros-trace
lttng enable-event --userspace 'ros2:*'
lttng enable-event --userspace 'my_navigation:*'
lttng enable-event --kernel sched_switch  # see thread scheduling
lttng start

ros2 run my_package my_node
# ... reproduce issue ...

lttng stop && lttng destroy
babeltrace2 /tmp/ros-trace | grep -E 'callback|timer|my_navigation'

3.2 strace — System Call Tracing

strace intercepts all system calls made by a process. Invaluable for: - Understanding what files a program opens - Finding why a program hangs (stuck in read(), futex(), etc.) - Measuring syscall latency - Debugging permission errors

# Trace all syscalls
strace ./my_program

# Trace specific syscall categories
strace -e trace=file ./my_program      # open, read, write, stat, etc.
strace -e trace=network ./my_program   # socket, connect, send, recv
strace -e trace=process ./my_program   # fork, exec, exit
strace -e trace=memory ./my_program    # mmap, brk, mprotect

# Attach to running process
strace -p $(pidof my_process)

# Count syscalls (summary)
strace -c ./my_program
# Example output:
# % time     seconds  calls  errors syscall
# ------ ----------- ------ ------ --------
#  45.23   0.002345   12345      0 write
#  30.12   0.001567    8901      0 read
#  10.45   0.000543    4567     23 open
#   5.67   0.000294    2345      0 close

# With timestamps (microsecond resolution)
strace -T -t ./my_program
# 10:30:01 write(1, "hello\n", 6) = 6 <0.000015>
#                                       ^^^^^^^^^^^ syscall duration

# Follow child processes
strace -f ./my_program

# Output to file (stderr is trace output)
strace -o trace.log ./my_program

strace patterns for robot software:

# Find which config files a ROS node reads
strace -e trace=open,openat rosrun my_package my_node 2>&1 | grep -v ENOENT

# Find why a node hangs on startup
strace -e trace=futex,read,poll rosrun my_package my_node
# If stuck in futex() → waiting for a lock (mutex contention or deadlock)
# If stuck in poll()  → waiting for network data (topic not published)

# Measure I/O syscall counts for a log-heavy node
strace -c -e trace=write rosrun my_package my_node

3.3 ltrace — Library Call Tracing

ltrace traces dynamic library calls (like strace for libc/libstdc++):

# Trace library calls
ltrace ./my_program

# Example output:
# malloc(64)                = 0x5555557b0260
# memcpy(0x5555557b0260, "hello", 5) = 0x5555557b0260
# free(0x5555557b0260)

# Trace specific libraries
ltrace -e 'malloc+free' ./my_program

# Count calls
ltrace -c ./my_program

Use case: Finding unexpected allocations in a real-time code path. If you see malloc calls from a function that should be allocation-free, you have a latency problem.


3.4 ftrace & trace-cmd

ftrace is the Linux kernel’s built-in tracer. trace-cmd is its CLI:

# Install
sudo apt install trace-cmd

# Record function tracer for 5 seconds
sudo trace-cmd record -p function -l 'sched_*' sleep 5
sudo trace-cmd report | head -50

# Record function graph (call tree with timing)
sudo trace-cmd record -p function_graph -l 'ext4_*' -- dd if=/dev/zero of=/tmp/test bs=4k count=1000
sudo trace-cmd report

# Trace scheduling events (who preempted whom)
sudo trace-cmd record -e sched:sched_switch -e sched:sched_wakeup sleep 5
sudo trace-cmd report

Why ftrace matters for real-time:

In a robot-style robot system, if your 100Hz control loop occasionally takes 15ms instead of 10ms, ftrace can show you: - Which kernel thread preempted your RT thread - How long the preemption lasted - Whether it was a scheduling issue or I/O stall

# Trace scheduling of your RT thread
sudo trace-cmd record -e sched:sched_switch \
    -f 'next_comm == "my_rt_thread" || prev_comm == "my_rt_thread"' \
    sleep 10

3.5 eBPF & bpftrace

eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the kernel. bpftrace provides a one-liner scripting interface:

# Install
sudo apt install bpftrace  # Ubuntu 20.04+

# Count syscalls by type
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Trace malloc sizes
sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
    @sizes = hist(arg0); }'

# Latency of read() syscall
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/ {
    @read_ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}'

# Function call latency in YOUR binary
sudo bpftrace -e 'uprobe:./my_program:process_frame {
    @start[tid] = nsecs;
}
uretprobe:./my_program:process_frame /@start[tid]/ {
    @latency_us = hist((nsecs - @start[tid]) / 1000);
    delete(@start[tid]);
}'

eBPF advantages over strace/ftrace: - Near-zero overhead (JIT-compiled in kernel) - Can aggregate data in-kernel (histograms, counts) - No context-switch overhead per event - Programmable (complex logic without dumping everything)


Part 4: Advanced Debugging

4.1 GDB — Beyond Basics

Most developers use GDB for breakpoint → step → print. These advanced features solve problems that basic debugging cannot:

Conditional breakpoints

# Break only when a condition is true
(gdb) break process_frame if frame_id == 42

# Break only after N hits (skip startup)
(gdb) break main.cpp:100
(gdb) ignore 1 1000    # skip first 1000 hits of breakpoint 1

# Break when a variable changes (hardware watchpoint)
(gdb) watch counter     # break on ANY write to counter
(gdb) rwatch buffer[10] # break on ANY read from buffer[10]
(gdb) awatch flags      # break on read OR write

# Watch with condition
(gdb) watch counter if counter > 100

Watchpoints (hardware-assisted)

Watchpoints are the killer feature for finding memory corruption:

# Find who overwrites a variable
(gdb) break main
(gdb) run
(gdb) watch *(int*)0x7fffffffde4c    # watch specific address
(gdb) continue
# GDB stops at the EXACT instruction that modifies the address

# Watch a struct member
(gdb) watch my_object.state_
# Stops whenever state_ changes, shows old and new values

Hardware watchpoints use CPU debug registers (limited to 4 on x86). Software watchpoints work for more locations but are extremely slow.

Catchpoints

# Break on exception throw
(gdb) catch throw
(gdb) catch throw std::runtime_error  # specific type

# Break on exception catch
(gdb) catch catch

# Break on syscall
(gdb) catch syscall write

# Break on fork/exec
(gdb) catch fork
(gdb) catch exec

Reverse debugging

GDB can record execution and step backwards:

(gdb) break main
(gdb) run
(gdb) record              # start recording
(gdb) continue            # run until crash/breakpoint
# Now you can go BACKWARDS:
(gdb) reverse-continue    # run backwards until previous breakpoint
(gdb) reverse-step        # step backwards one line
(gdb) reverse-next        # step backwards over function calls
(gdb) reverse-finish      # run backwards until function entry

# Find when a variable was last changed
(gdb) watch -l my_var
(gdb) reverse-continue    # finds previous write to my_var

Limitation: Recording slows execution 10-100x. Best used with small reproduction cases, not full robot systems.

Pretty-printers

Make GDB display STL containers and custom types readably:

# STL pretty printers (usually auto-loaded)
(gdb) print my_vector
# $1 = std::vector of length 3, capacity 4 = {1, 2, 3}

# Custom pretty-printer (Python, in ~/.gdbinit or .gdbinit)
# ~/.gdbinit or project .gdbinit
import gdb.printing

class PosePrinter:
    """Pretty-print Pose2D(x, y, theta)"""
    def __init__(self, val):
        self.val = val

    def to_string(self):
        x = float(self.val['x_'])
        y = float(self.val['y_'])
        theta = float(self.val['theta_'])
        return f"Pose2D(x={x:.3f}, y={y:.3f}, θ={theta:.4f})"

def build_printer():
    pp = gdb.printing.RegexpCollectionPrettyPrinter("my_project")
    pp.add_printer('Pose2D', '^Pose2D$', PosePrinter)
    return pp

gdb.printing.register_pretty_printer(gdb.current_objfile(), build_printer())

Remote debugging (for robot targets)

# On the robot (target)
gdbserver :2345 ./my_node

# On your dev machine
gdb ./my_node
(gdb) target remote robot_ip:2345
(gdb) break main
(gdb) continue

4.2 Core Dump Analysis

When a program crashes, the kernel can save a core dump — a snapshot of the process’s memory at the moment of death:

# Enable core dumps
ulimit -c unlimited

# Set core dump pattern (system-wide)
echo '/tmp/core.%e.%p.%t' | sudo tee /proc/sys/kernel/core_pattern

# Run the crashing program
./my_program    # → segfault → /tmp/core.my_program.12345.1619280000

# Analyze with GDB
gdb ./my_program /tmp/core.my_program.12345.1619280000
(gdb) bt                  # backtrace — shows where it crashed
(gdb) frame 3             # switch to frame 3 in the backtrace
(gdb) info locals         # show local variables in that frame
(gdb) print *this         # if inside a member function
(gdb) info threads        # show all threads at crash time
(gdb) thread 2            # switch to thread 2
(gdb) bt                  # backtrace for thread 2

Automated core dump analysis in CI:

#!/bin/bash
# Run tests, check for core dumps
ulimit -c unlimited
./run_tests

for core in /tmp/core.*; do
    echo "=== CRASH DETECTED ==="
    gdb -batch -ex "bt full" -ex "info threads" -ex "thread apply all bt" \
        ./my_test_binary "$core"
done

4.3 Post-Mortem Tools

When you don’t have GDB or a core dump, these tools extract information from the binary itself:

# addr2line — convert address to file:line
addr2line -e ./my_program -f 0x4011a3
# process_frame
# /home/user/src/main.cpp:42

# nm — list symbols
nm ./my_program | grep ' T '    # exported (Text) symbols
nm ./my_program | grep process  # find a specific symbol
nm -C ./my_program              # demangle C++ names

# objdump — disassembly
objdump -d ./my_program | less
objdump -d -S ./my_program      # interleave source (needs -g)

# readelf — ELF header info
readelf -h ./my_program         # file header
readelf -S ./my_program         # section headers
readelf --debug-dump=line ./my_program  # line number info

# c++filt — demangle a single symbol
echo '_ZN5MyApp12process_dataERKSt6vectorIiSaIiEE' | c++filt
# MyApp::process_data(std::vector<int, std::allocator<int>> const&)

Part 5: Putting It All Together

5.1 CI Pipeline Design for Quality Tools

A production CI pipeline should run these tools in parallel:

# .github/workflows/quality.yml
name: Quality Pipeline

on: [push, pull_request]

jobs:
  # ── Static analysis (fast, run first) ──
  static-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: clang-tidy
        run: |
          cmake -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
          clang-tidy -p build/ src/*.cpp
      - name: cppcheck
        run: cppcheck --enable=all --error-exitcode=1 src/

  # ── Sanitizers (parallel, separate jobs) ──
  asan:
    runs-on: ubuntu-latest
    steps:
      - run: cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address,undefined -fno-omit-frame-pointer -g"
      - run: cmake --build build && cd build && ctest

  tsan:
    runs-on: ubuntu-latest
    steps:
      - run: cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=thread -g"
      - run: cmake --build build && cd build && ctest

  # ── Profiling (nightly, expensive) ──
  profile:
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule'
    steps:
      - run: cmake -B build -DCMAKE_BUILD_TYPE=Release
      - run: cmake --build build
      - run: |
          perf stat ./build/my_benchmark
          valgrind --tool=callgrind --callgrind-out-file=callgrind.out ./build/my_benchmark
          # Compare against baseline

5.2 Suppression Management for Large Codebases

In a real robot codebase (ROS + custom code + third-party), expect noise:

project/
├── sanitizer/
│   ├── asan.supp           # ASan suppressions
│   ├── tsan.supp           # TSan suppressions
│   └── lsan.supp           # Leak suppressions
├── .clang-tidy             # clang-tidy config
├── cppcheck.supp           # cppcheck suppressions
└── CMakeLists.txt          # Sanitizer targets

Rule: Every suppression MUST have a comment explaining: 1. Why it’s suppressed (false positive? third-party? benign?) 2. A link to the upstream issue if applicable 3. When it can be removed

# tsan.supp
# Benign race in ros::init() counter — reported upstream as ros/ros_comm#2134
race:ros::init
# False positive: atomic<bool> with relaxed ordering — TSan doesn't model this
race:StatusFlags::is_ready
# Third-party: libcurl internal threading — we can't fix this
race:libcurl*

5.3 Tool Selection Quick Reference

Symptom First tool Second tool Third tool
Crash (segfault) Core dump + GDB ASan rebuild Valgrind memcheck
Data corruption ASan GDB watchpoints UBSan
Race condition TSan GDB with thread commands LTTng
Deadlock TSan strace -e futex GDB info threads
Slow execution perf stat perf record → flamegraph Callgrind
Cache misses perf stat cache events Cachegrind SoA refactor
Syscall overhead strace -c bpftrace Buffering
Memory leak ASan (LSan) Valgrind memcheck massif
Undefined behavior UBSan ASan Code review
Coding standard clang-tidy cppcheck -Wall -Werror

Exercises in This Module

Exercise Focus Tools
ex01 Coding standards violations — detect and fix clang-tidy, compiler warnings
ex02 Static analysis traps — patterns that hide bugs cppcheck, code review
ex03 Cache profiling — AoS vs SoA, row vs column perf stat, cachegrind
ex04 Call-graph hotspot — find the bottleneck callgrind, perf record
ex05 Syscall audit — reduce I/O overhead strace -c
ex06 Tracepoint framework — build LTTng-style tracing LTTng concepts
ex07 Watchpoint hunting — find memory corruption GDB watchpoints
ex08 Flamegraph-driven optimization perf record + flamegraph
puzzle01 The observant profiler — cache line effects perf + false sharing
puzzle02 The invisible allocation — RT latency ltrace + perf
puzzle03 The lying benchmark — measurement traps perf stat