Module 16 taught you what sanitizers catch and how to read their reports. This module teaches the broader quality ecosystem: how to prevent bugs before they happen (coding standards + static analysis), how to find performance bottlenecks (profiling), how to trace system behaviour in production (LTTng, strace, perf), and how to debug the toughest problems (advanced GDB).
Key insight: Sanitizers are reactive — they find bugs after you write them. Coding standards + static analysis are preventive. Profiling and tracing are diagnostic. A production-grade workflow uses ALL three layers.
Layer 1 — PREVENTION: Coding standards → static analysis → compiler warnings
Layer 2 — DETECTION: Sanitizers (ASan, TSan, UBSan) → Module 16
Layer 3 — DIAGNOSIS: Profiling (perf, callgrind) → Tracing (LTTng, strace)
Layer 4 — DEBUG: GDB advanced → core dumps → post-mortem
Coding standards reduce bugs by banning patterns that are known to cause defects. They are not about style (tabs vs spaces) — they encode decades of hard-won knowledge about what goes wrong in C++ code.
| Standard | Scope | Target Domain | Key Feature |
|---|---|---|---|
| CppCoreGuidelines | General | All C++ | Broad, modern, tool-enforceable |
| MISRA C++:2023 | Safety | Automotive, medical, aerospace | Decidable, every rule provable |
| CERT C++ | Security | Network, systems | Focuses on exploitability |
| AUTOSAR C++14 | Safety | Automotive ECU | Based on MISRA, adds automotive rules |
| SEI CERT C | Security | C code (not C++) | Applicable to C-style code in C++ |
| JSF++ AV | Safety | Joint Strike Fighter | Military aerospace, very strict |
For robot software (our context): CppCoreGuidelines + selected MISRA rules give the best cost/benefit ratio. Full MISRA compliance is only needed for ISO 26262 / IEC 61508 certification.
The full guidelines have 500+ rules. Here are the ones that catch the most bugs:
// R.1: Manage resources automatically using RAII
// BAD — raw pointer ownership
void bad() {
auto* p = new Widget();
do_something(*p); // if this throws → LEAK
delete p;
}
// GOOD — unique_ptr enforces cleanup
void good() {
auto p = std::make_unique<Widget>();
do_something(*p); // exception-safe, no leak possible
}
// R.3: A raw pointer (T*) is non-owning
// Rule: if you see T*, the pointed-to object is managed elsewhere.
// Ownership MUST use unique_ptr or shared_ptr.
void process(Widget* w); // non-owning: caller keeps ownership
auto owner = std::make_unique<Widget>();
process(owner.get()); // explicitly non-owning
// R.5: Prefer scoped objects, don't heap-allocate unnecessarily
// BAD — heap allocation for no reason
auto* data = new std::vector<int>{1, 2, 3};
// ...
delete data;
// GOOD — stack allocation
std::vector<int> data{1, 2, 3};
// F.42: Return a T* to indicate a position (only)
// Never return a pointer to indicate ownership transfer.
// F.43: Never return a pointer or reference to a local object
// BAD
int& bad() {
int local = 42;
return local; // dangling reference — undefined behavior
}
// C.31: All resources acquired by a class must be released by the destructor
// This IS the Rule of Zero / Rule of Five.
// CP.1: Assume code will run in a multi-threaded environment
// Even single-threaded code may become multi-threaded later.
// CP.2: Avoid data races
// A data race = two threads access the same memory, at least one writes,
// no synchronization. This is UNDEFINED BEHAVIOR (not just a bug).
// CP.20: Use RAII for locking, never plain lock()/unlock()
// BAD
mutex_.lock();
do_work(); // if this throws → deadlock (mutex never unlocked)
mutex_.unlock();
// GOOD
{
std::lock_guard<std::mutex> lock(mutex_);
do_work(); // unlock guaranteed even on exception
}
// CP.44: Remember to name your lock_guards and unique_locks
// BAD — anonymous temporary, unlocks immediately!
std::lock_guard<std::mutex>{mutex_}; // ← UNLOCKS RIGHT HERE
do_work(); // ← UNPROTECTED!
// GOOD — named variable lives until scope end
std::lock_guard<std::mutex> lock{mutex_};
do_work(); // ← protected
// ES.48: Avoid casts
// If you must cast, prefer static_cast over C-style casts.
// C-style casts can silently do reinterpret_cast.
// I.11: Never transfer ownership by a raw pointer
// Use unique_ptr for single ownership, shared_ptr for shared.
// ES.46: Avoid lossy narrowing conversions
// BAD
int64_t big = 1LL << 40;
int32_t small = big; // silent truncation
// GOOD
int32_t small = static_cast<int32_t>(big); // explicit
// Even better: use gsl::narrow<> which throws on data loss
MISRA rules are classified as Required, Advisory, or Mandatory. Here are the most impactful Required rules:
| Rule | Category | What it bans | Why |
|---|---|---|---|
| 0.1.2 | Mandatory | Unreachable code | Dead code masks bugs |
| 4.6.1 | Required | Implicit narrowing conversions | Silent data loss |
| 6.0.1 | Required | goto |
Unstructured control flow |
| 6.7.2 | Required | Global non-const variables | Hidden coupling |
| 6.8.2 | Required | Single-use variables | Unnecessarily complex |
| 7.0.5 | Required | C-style casts | Bypasses type system |
| 7.6.1 | Required | reinterpret_cast |
Undefined behavior risk |
| 8.2.5 | Required | Virtual functions in constructors | Surprising dispatch |
| 8.14.1 | Required | const_cast to remove const |
Breaks type safety |
| 9.3.1 | Required | malloc/free in C++ |
Use RAII instead |
| 10.3.1 | Required | Empty throw (throw;) outside catch |
Calls std::terminate |
| 12.3.1 | Required | NULL macro | Use nullptr |
| 15.0.2 | Required | Uninitialized variables | Undefined behavior |
| 19.0.1 | Required | #define for constants |
Use constexpr |
| 21.10.1 | Required | signal() for signal handling |
Race conditions |
Key insight: Many MISRA rules can be enforced automatically with clang-tidy + cppcheck. You don’t need to memorize them — configure your tools.
CERT rules focus on security — what an attacker can exploit:
| Rule | What | Exploit |
|---|---|---|
| STR50-CPP | Guarantee null termination | Buffer overflow |
| MEM50-CPP | Don’t access freed memory | Use-after-free → RCE |
| MEM51-CPP | Properly deallocate memory | Memory corruption |
| INT50-CPP | Don’t cast to smaller type | Integer truncation |
| ERR50-CPP | Don’t call exit() in destructors |
Stack unwinding break |
| CON50-CPP | Don’t destroy a locked mutex | Undefined behavior |
| DCL50-CPP | Don’t define a C-style variadic function | Type confusion |
| EXP50-CPP | Don’t depend on order of evaluation | Sequencing bugs |
clang-tidy is the primary tool for automated coding standard enforcement:
# Install (Ubuntu 20.04)
sudo apt install clang-tidy-10 # or later
# Run all checks
clang-tidy -checks='*' source.cpp -- -std=c++2a
# Run specific check categories
clang-tidy -checks='cppcoreguidelines-*,modernize-*,bugprone-*' source.cpp
# Auto-fix what it can
clang-tidy -checks='modernize-*' -fix source.cpp
# Use with compile_commands.json (from CMake)
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON ..
clang-tidy -p build/ source.cpp
| Category | Catches | Example |
|---|---|---|
bugprone-* |
Likely bugs | Dangling handles, infinite loops, misused move |
cppcoreguidelines-* |
CppCoreGuidelines violations | Owning raw pointers, C-arrays |
modernize-* |
Pre-C++11 patterns | NULL → nullptr, raw loops → algorithms |
performance-* |
Performance issues | Unnecessary copies, move-eligible returns |
readability-* |
Readability | Magic numbers, inconsistent naming |
cert-* |
CERT rules | Security-related issues |
misc-* |
Miscellaneous | Unused parameters, redundant expressions |
clang-analyzer-* |
Deep analysis | Null deref paths, dead stores |
.clang-tidy):---
Checks: >
-*,
bugprone-*,
cppcoreguidelines-*,
modernize-*,
performance-*,
readability-*,
cert-*,
-modernize-use-trailing-return-type,
-readability-magic-numbers,
-cppcoreguidelines-avoid-magic-numbers
WarningsAsErrors: 'bugprone-*,cert-*'
HeaderFilterRegex: '.*'
CheckOptions:
- key: readability-identifier-naming.ClassCase
value: CamelCase
- key: readability-identifier-naming.FunctionCase
value: camelBack
- key: readability-identifier-naming.VariableCase
value: lower_case
- key: readability-identifier-naming.MemberPrefix
value: ''
- key: readability-identifier-naming.MemberSuffix
value: '_'
cppcheck does pattern-based analysis (no Clang AST dependency):
# Install
sudo apt install cppcheck
# Basic analysis
cppcheck --enable=all --std=c++20 source.cpp
# With suppression file
cppcheck --enable=all --suppressions-list=cppcheck.supp source.cpp
# Generate XML report (for CI integration)
cppcheck --enable=all --xml source.cpp 2> cppcheck_report.xml
# Check a whole directory
cppcheck --enable=all -I include/ src/
| Issue | Example |
|---|---|
| Null pointer dereference paths | Complex conditional chains |
| Buffer overflows | Array index out of bounds |
| Uninitialized variables | Conditional initialization paths |
| Memory leaks | Non-RAII allocation patterns |
| Resource leaks | File descriptors, sockets |
| Redundant conditions | if (p != NULL) after p = new X |
| Portability issues | Different behavior across compilers |
cppcheck.supp):// Suppress specific findings
unusedFunction:src/test_helper.cpp
uninitvar:third_party/*.cpp
// Suppress by ID
memleak:src/legacy.cpp:42
The cheapest static analysis is your compiler’s warning flags:
# Recommended minimum for any project
-Wall -Wextra -Wpedantic
# Safety-critical / production quality
-Wall -Wextra -Wpedantic -Werror \
-Wconversion -Wsign-conversion \
-Wdouble-promotion -Wformat=2 \
-Wnull-dereference -Wold-style-cast \
-Wshadow -Wunused
# GCC-specific extras
-Wlogical-op -Wduplicated-cond -Wduplicated-branches \
-Wuseless-cast -Wrestrict
# Clang-specific extras
-Wmost -Weverything # (warning: very noisy)
| Flag | Catches | Example |
|---|---|---|
-Wconversion |
Implicit narrowing | int x = 3.14; |
-Wsign-conversion |
Signed ↔ unsigned | unsigned u = -1; |
-Wshadow |
Variable shadowing | Inner x hides outer x |
-Wdouble-promotion |
Float→double promotion | float f; printf("%f", f); |
-Wold-style-cast |
C-style casts | (int)ptr instead of static_cast |
-Wnull-dereference |
Potential null deref | GCC path analysis |
-Wformat=2 |
Printf format mismatches | printf("%d", "hello") |
-Wduplicated-cond |
Duplicate if conditions |
Copy-paste bugs |
# Install
sudo apt install iwyu
# Run (needs compile_commands.json)
iwyu_tool.py -p build/ source.cpp
# IWYU output example:
# source.cpp should add: #include <algorithm>
# source.cpp should remove: #include <iostream> // not used
IWYU ensures each file includes exactly what it uses — no transitive include dependencies, no unused headers. This improves compile times and makes dependencies explicit.
perf is the standard Linux profiling tool. It uses hardware performance
counters (PMU — Performance Monitoring Unit) built into the CPU.
# Install
sudo apt install linux-tools-common linux-tools-$(uname -r)
# Allow non-root profiling (set once)
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
# Count cache misses, branch mispredictions, instructions
perf stat ./my_program
# Example output:
# 1,234,567,890 instructions # 2.50 IPC
# 45,678,901 cache-misses # 3.7% of all cache refs
# 12,345,678 branch-misses # 1.2% of all branches
# 2.345 seconds time elapsed
# Specific events
perf stat -e cache-misses,cache-references,L1-dcache-load-misses ./my_program
# Repeat N times for statistical significance
perf stat -r 10 ./my_program
Key metric: IPC (Instructions Per Cycle). An IPC < 1.0 usually means the CPU is stalling on memory access (cache misses). IPC > 2.0 is good.
# Record call stacks at 99 Hz (use prime numbers to avoid aliasing)
perf record -g -F 99 ./my_program
# Show interactive profile
perf report
# Show annotated source (requires -g debug info)
perf annotate
# Key perf report columns:
# Overhead — % of total samples in this function
# Children — % of total samples in this function + its callees
# Self — % of total samples ONLY in this function (excludes callees)
# Record for a running process
perf record -g -p $(pidof my_process) -- sleep 10
# Record specific events
perf record -e cache-misses -g ./my_program
# Record with call graph (dwarf = most reliable, needs -g in compilation)
perf record --call-graph dwarf -F 99 ./my_program
Callgrind simulates the CPU’s cache hierarchy and counts instruction costs:
# Profile
valgrind --tool=callgrind ./my_program
# Output: callgrind.out.<pid>
# Annotate source (text)
callgrind_annotate callgrind.out.12345
# Visualize with KCachegrind (GUI)
kcachegrind callgrind.out.12345
| Feature | perf | Callgrind |
|---|---|---|
| Speed | ~1x (hardware counters) | ~20-100x (simulation) |
| Accuracy | Statistical sampling | Exact instruction count |
| Cache model | Real hardware | Simulated (may differ) |
| Call graph | Yes (sampling) | Yes (exact) |
| Thread support | Yes | Limited |
| Root required | Usually no | No |
Rule of thumb: Use perf for quick profiling, callgrind when you need
exact call counts or cache simulation details.
# Run cache simulation
valgrind --tool=cachegrind ./my_program
# Output example:
# ==12345== D1 miss rate: 4.2% ( 3.8% rd + 6.1% wr)
# ==12345== LLd miss rate: 0.8% ( 0.7% rd + 1.2% wr)
# Annotate per-line cache misses
cg_annotate cachegrind.out.12345
What to look for:
- D1 miss rate > 5% → data cache pressure, likely array access pattern issue
- LLd (Last-Level cache) miss rate > 2% → memory bandwidth bottleneck
- High Dw (data writes) miss rate → false sharing in multi-threaded code
Flamegraphs visualize profiling data as stacked function calls where width represents time. Created by Brendan Gregg.
# Install
git clone https://github.com/brendangregg/FlameGraph.git
# Generate from perf data
perf record -g -F 99 ./my_program
perf script > out.perf
FlameGraph/stackcollapse-perf.pl out.perf > out.folded
FlameGraph/flamegraph.pl out.folded > flamegraph.svg
# Open in browser
firefox flamegraph.svg
┌──────────── main() ────────────────────┐
│ ┌─── process_data() ────────────────┐ │
│ │ ┌── sort_items() ──────────┐ │ │ ← WIDEST = HOTTEST
│ │ │ ┌─ compare() ─────┐ │ │ │
│ │ │ └─────────────────┘ │ │ │
│ │ └──────────────────────────┘ │ │
│ │ ┌── validate() ─┐ │ │
│ │ └────────────────┘ │ │
│ └───────────────────────────────────┘ │
└────────────────────────────────────────┘
sort_items() is the hottest function.Cache-friendly code can be 10-100x faster than cache-unfriendly code for the same algorithmic complexity:
// Cache-friendly: sequential access (row-major in C++)
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
matrix[i][j] *= 2; // stride = sizeof(element)
// Cache-HOSTILE: column-major access
for (int j = 0; j < N; ++j)
for (int i = 0; i < N; ++i)
matrix[i][j] *= 2; // stride = N * sizeof(element)
For N=4096, column-major is ~10x slower due to L1 cache thrashing.
// AoS — cache-unfriendly if you only access one field
struct Particle { float x, y, z, mass, charge, radius; };
std::vector<Particle> particles(N);
// Process only positions → loads mass, charge, radius into cache too (waste!)
for (auto& p : particles) p.x += p.vx * dt;
// SoA — cache-friendly for single-field access
struct Particles {
std::vector<float> x, y, z, mass, charge, radius;
};
Particles ps;
ps.x.resize(N); ps.vx.resize(N);
// Process only x → only x data in cache (no waste)
for (size_t i = 0; i < N; ++i) ps.x[i] += ps.vx[i] * dt;
// BAD: Two threads writing to adjacent cache lines
struct alignas(8) Counters {
std::atomic<int> count_a; // thread 1 writes here
std::atomic<int> count_b; // thread 2 writes here
// Both fit in ONE 64-byte cache line → ping-pong between cores
};
// GOOD: Pad to separate cache lines
struct Counters {
alignas(64) std::atomic<int> count_a; // own cache line
alignas(64) std::atomic<int> count_b; // own cache line
};
False sharing can cause a 10x slowdown because every write by one thread invalidates the other thread’s copy of the cache line.
LTTng is a high-performance tracing framework for Linux. It can trace: - Kernel events (syscalls, scheduling, interrupts, block I/O) - Userspace events (your application’s tracepoints)
LTTng’s overhead is ~100ns per tracepoint — low enough for production use. This is 10-100x lower than printf-debugging or syslog.
┌─────────────────────────────────────────┐
│ Your Application │
│ ┌─────────────────────────────────┐ │
│ │ TRACEPOINT(my_app, request_start│ │
│ │ , size_t, req_id │ │
│ │ , int, priority │ │
│ │ ) │ │
│ └───────────┬─────────────────────┘ │
│ │ ~100ns per tracepoint │
│ ┌───────────▼──────────────────────┐ │
│ │ LTTng-UST (Userspace Tracer) │ │
│ └───────────┬──────────────────────┘ │
└──────────────┼──────────────────────────┘
│ shared memory ring buffer
┌──────────────▼──────────────────────────┐
│ lttng-sessiond (Session Daemon) │
│ ┌──────────────────────────────────┐ │
│ │ lttng-consumerd → trace files │ │
│ └──────────────────────────────────┘ │
└─────────────────────────────────────────┘
│
┌──────────────▼──────────────────────────┐
│ Analysis Tools │
│ babeltrace2, Trace Compass, LTTng Live │
└─────────────────────────────────────────┘
# Ubuntu 20.04+
sudo apt install lttng-tools lttng-modules-dkms liblttng-ust-dev babeltrace2
# Verify
lttng version
Create a tracepoint provider header (my_tp.h):
/* my_tp.h */
#undef TRACEPOINT_PROVIDER
#define TRACEPOINT_PROVIDER my_app
#undef TRACEPOINT_INCLUDE
#define TRACEPOINT_INCLUDE "./my_tp.h"
#if !defined(_MY_TP_H) || defined(TRACEPOINT_HEADER_MULTI_READ)
#define _MY_TP_H
#include <lttng/tracepoint.h>
TRACEPOINT_EVENT(
my_app, /* provider name */
request_start, /* event name */
TP_ARGS(
size_t, req_id,
int, priority,
const char*, endpoint
),
TP_FIELDS(
ctf_integer(size_t, req_id, req_id)
ctf_integer(int, priority, priority)
ctf_string(endpoint, endpoint)
)
)
TRACEPOINT_EVENT(
my_app,
request_end,
TP_ARGS(
size_t, req_id,
int, status_code,
uint64_t, duration_ns
),
TP_FIELDS(
ctf_integer(size_t, req_id, req_id)
ctf_integer(int, status_code, status_code)
ctf_integer(uint64_t, duration_ns, duration_ns)
)
)
#endif /* _MY_TP_H */
#include <lttng/tracepoint-event.h>
Create the tracepoint provider source (my_tp.c):
/* my_tp.c */
#define TRACEPOINT_CREATE_PROBES
#define TRACEPOINT_DEFINE
#include "my_tp.h"
Use in your C++ code:
#include "my_tp.h"
void handle_request(size_t id, int priority, const char* endpoint) {
tracepoint(my_app, request_start, id, priority, endpoint);
// ... do work ...
auto start = std::chrono::steady_clock::now();
process(id);
auto elapsed = std::chrono::steady_clock::now() - start;
uint64_t ns = std::chrono::duration_cast<std::chrono::nanoseconds>(elapsed).count();
tracepoint(my_app, request_end, id, 200, ns);
}
Compile:
gcc -c my_tp.c -I.
g++ -std=c++2a -c main.cpp -I.
g++ main.o my_tp.o -ldl -llttng-ust -o my_app
# Create a session
lttng create my-session --output=/tmp/my-trace
# Enable userspace events
lttng enable-event --userspace 'my_app:*'
# Enable kernel events (optional, needs root)
sudo lttng enable-event --kernel sched_switch,sched_wakeup
# Start tracing
lttng start
# Run your application
./my_app
# Stop and destroy session
lttng stop
lttng destroy
# View trace (text)
babeltrace2 /tmp/my-trace
# Example output:
# [10:30:01.123456789] my_app:request_start: req_id=42 priority=3 endpoint="/api/data"
# [10:30:01.124567890] my_app:request_end: req_id=42 status_code=200 duration_ns=1111101
# Trace ROS 2 DDS events + your custom tracepoints
lttng create ros-trace
lttng enable-event --userspace 'ros2:*'
lttng enable-event --userspace 'my_navigation:*'
lttng enable-event --kernel sched_switch # see thread scheduling
lttng start
ros2 run my_package my_node
# ... reproduce issue ...
lttng stop && lttng destroy
babeltrace2 /tmp/ros-trace | grep -E 'callback|timer|my_navigation'
strace intercepts all system calls made by a process. Invaluable for:
- Understanding what files a program opens
- Finding why a program hangs (stuck in read(), futex(), etc.)
- Measuring syscall latency
- Debugging permission errors
# Trace all syscalls
strace ./my_program
# Trace specific syscall categories
strace -e trace=file ./my_program # open, read, write, stat, etc.
strace -e trace=network ./my_program # socket, connect, send, recv
strace -e trace=process ./my_program # fork, exec, exit
strace -e trace=memory ./my_program # mmap, brk, mprotect
# Attach to running process
strace -p $(pidof my_process)
# Count syscalls (summary)
strace -c ./my_program
# Example output:
# % time seconds calls errors syscall
# ------ ----------- ------ ------ --------
# 45.23 0.002345 12345 0 write
# 30.12 0.001567 8901 0 read
# 10.45 0.000543 4567 23 open
# 5.67 0.000294 2345 0 close
# With timestamps (microsecond resolution)
strace -T -t ./my_program
# 10:30:01 write(1, "hello\n", 6) = 6 <0.000015>
# ^^^^^^^^^^^ syscall duration
# Follow child processes
strace -f ./my_program
# Output to file (stderr is trace output)
strace -o trace.log ./my_program
# Find which config files a ROS node reads
strace -e trace=open,openat rosrun my_package my_node 2>&1 | grep -v ENOENT
# Find why a node hangs on startup
strace -e trace=futex,read,poll rosrun my_package my_node
# If stuck in futex() → waiting for a lock (mutex contention or deadlock)
# If stuck in poll() → waiting for network data (topic not published)
# Measure I/O syscall counts for a log-heavy node
strace -c -e trace=write rosrun my_package my_node
ltrace traces dynamic library calls (like strace for libc/libstdc++):
# Trace library calls
ltrace ./my_program
# Example output:
# malloc(64) = 0x5555557b0260
# memcpy(0x5555557b0260, "hello", 5) = 0x5555557b0260
# free(0x5555557b0260)
# Trace specific libraries
ltrace -e 'malloc+free' ./my_program
# Count calls
ltrace -c ./my_program
Use case: Finding unexpected allocations in a real-time code path.
If you see malloc calls from a function that should be allocation-free,
you have a latency problem.
ftrace is the Linux kernel’s built-in tracer. trace-cmd is its CLI:
# Install
sudo apt install trace-cmd
# Record function tracer for 5 seconds
sudo trace-cmd record -p function -l 'sched_*' sleep 5
sudo trace-cmd report | head -50
# Record function graph (call tree with timing)
sudo trace-cmd record -p function_graph -l 'ext4_*' -- dd if=/dev/zero of=/tmp/test bs=4k count=1000
sudo trace-cmd report
# Trace scheduling events (who preempted whom)
sudo trace-cmd record -e sched:sched_switch -e sched:sched_wakeup sleep 5
sudo trace-cmd report
In a robot-style robot system, if your 100Hz control loop occasionally takes 15ms instead of 10ms, ftrace can show you: - Which kernel thread preempted your RT thread - How long the preemption lasted - Whether it was a scheduling issue or I/O stall
# Trace scheduling of your RT thread
sudo trace-cmd record -e sched:sched_switch \
-f 'next_comm == "my_rt_thread" || prev_comm == "my_rt_thread"' \
sleep 10
eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the kernel. bpftrace provides a one-liner scripting interface:
# Install
sudo apt install bpftrace # Ubuntu 20.04+
# Count syscalls by type
sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Trace malloc sizes
sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc {
@sizes = hist(arg0); }'
# Latency of read() syscall
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/ {
@read_ns = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
# Function call latency in YOUR binary
sudo bpftrace -e 'uprobe:./my_program:process_frame {
@start[tid] = nsecs;
}
uretprobe:./my_program:process_frame /@start[tid]/ {
@latency_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
eBPF advantages over strace/ftrace: - Near-zero overhead (JIT-compiled in kernel) - Can aggregate data in-kernel (histograms, counts) - No context-switch overhead per event - Programmable (complex logic without dumping everything)
Most developers use GDB for breakpoint → step → print. These advanced features solve problems that basic debugging cannot:
# Break only when a condition is true
(gdb) break process_frame if frame_id == 42
# Break only after N hits (skip startup)
(gdb) break main.cpp:100
(gdb) ignore 1 1000 # skip first 1000 hits of breakpoint 1
# Break when a variable changes (hardware watchpoint)
(gdb) watch counter # break on ANY write to counter
(gdb) rwatch buffer[10] # break on ANY read from buffer[10]
(gdb) awatch flags # break on read OR write
# Watch with condition
(gdb) watch counter if counter > 100
Watchpoints are the killer feature for finding memory corruption:
# Find who overwrites a variable
(gdb) break main
(gdb) run
(gdb) watch *(int*)0x7fffffffde4c # watch specific address
(gdb) continue
# GDB stops at the EXACT instruction that modifies the address
# Watch a struct member
(gdb) watch my_object.state_
# Stops whenever state_ changes, shows old and new values
Hardware watchpoints use CPU debug registers (limited to 4 on x86). Software watchpoints work for more locations but are extremely slow.
# Break on exception throw
(gdb) catch throw
(gdb) catch throw std::runtime_error # specific type
# Break on exception catch
(gdb) catch catch
# Break on syscall
(gdb) catch syscall write
# Break on fork/exec
(gdb) catch fork
(gdb) catch exec
GDB can record execution and step backwards:
(gdb) break main
(gdb) run
(gdb) record # start recording
(gdb) continue # run until crash/breakpoint
# Now you can go BACKWARDS:
(gdb) reverse-continue # run backwards until previous breakpoint
(gdb) reverse-step # step backwards one line
(gdb) reverse-next # step backwards over function calls
(gdb) reverse-finish # run backwards until function entry
# Find when a variable was last changed
(gdb) watch -l my_var
(gdb) reverse-continue # finds previous write to my_var
Limitation: Recording slows execution 10-100x. Best used with small reproduction cases, not full robot systems.
Make GDB display STL containers and custom types readably:
# STL pretty printers (usually auto-loaded)
(gdb) print my_vector
# $1 = std::vector of length 3, capacity 4 = {1, 2, 3}
# Custom pretty-printer (Python, in ~/.gdbinit or .gdbinit)
# ~/.gdbinit or project .gdbinit
import gdb.printing
class PosePrinter:
"""Pretty-print Pose2D(x, y, theta)"""
def __init__(self, val):
self.val = val
def to_string(self):
x = float(self.val['x_'])
y = float(self.val['y_'])
theta = float(self.val['theta_'])
return f"Pose2D(x={x:.3f}, y={y:.3f}, θ={theta:.4f})"
def build_printer():
pp = gdb.printing.RegexpCollectionPrettyPrinter("my_project")
pp.add_printer('Pose2D', '^Pose2D$', PosePrinter)
return pp
gdb.printing.register_pretty_printer(gdb.current_objfile(), build_printer())
# On the robot (target)
gdbserver :2345 ./my_node
# On your dev machine
gdb ./my_node
(gdb) target remote robot_ip:2345
(gdb) break main
(gdb) continue
When a program crashes, the kernel can save a core dump — a snapshot of the process’s memory at the moment of death:
# Enable core dumps
ulimit -c unlimited
# Set core dump pattern (system-wide)
echo '/tmp/core.%e.%p.%t' | sudo tee /proc/sys/kernel/core_pattern
# Run the crashing program
./my_program # → segfault → /tmp/core.my_program.12345.1619280000
# Analyze with GDB
gdb ./my_program /tmp/core.my_program.12345.1619280000
(gdb) bt # backtrace — shows where it crashed
(gdb) frame 3 # switch to frame 3 in the backtrace
(gdb) info locals # show local variables in that frame
(gdb) print *this # if inside a member function
(gdb) info threads # show all threads at crash time
(gdb) thread 2 # switch to thread 2
(gdb) bt # backtrace for thread 2
#!/bin/bash
# Run tests, check for core dumps
ulimit -c unlimited
./run_tests
for core in /tmp/core.*; do
echo "=== CRASH DETECTED ==="
gdb -batch -ex "bt full" -ex "info threads" -ex "thread apply all bt" \
./my_test_binary "$core"
done
When you don’t have GDB or a core dump, these tools extract information from the binary itself:
# addr2line — convert address to file:line
addr2line -e ./my_program -f 0x4011a3
# process_frame
# /home/user/src/main.cpp:42
# nm — list symbols
nm ./my_program | grep ' T ' # exported (Text) symbols
nm ./my_program | grep process # find a specific symbol
nm -C ./my_program # demangle C++ names
# objdump — disassembly
objdump -d ./my_program | less
objdump -d -S ./my_program # interleave source (needs -g)
# readelf — ELF header info
readelf -h ./my_program # file header
readelf -S ./my_program # section headers
readelf --debug-dump=line ./my_program # line number info
# c++filt — demangle a single symbol
echo '_ZN5MyApp12process_dataERKSt6vectorIiSaIiEE' | c++filt
# MyApp::process_data(std::vector<int, std::allocator<int>> const&)
A production CI pipeline should run these tools in parallel:
# .github/workflows/quality.yml
name: Quality Pipeline
on: [push, pull_request]
jobs:
# ── Static analysis (fast, run first) ──
static-analysis:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: clang-tidy
run: |
cmake -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
clang-tidy -p build/ src/*.cpp
- name: cppcheck
run: cppcheck --enable=all --error-exitcode=1 src/
# ── Sanitizers (parallel, separate jobs) ──
asan:
runs-on: ubuntu-latest
steps:
- run: cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=address,undefined -fno-omit-frame-pointer -g"
- run: cmake --build build && cd build && ctest
tsan:
runs-on: ubuntu-latest
steps:
- run: cmake -B build -DCMAKE_CXX_FLAGS="-fsanitize=thread -g"
- run: cmake --build build && cd build && ctest
# ── Profiling (nightly, expensive) ──
profile:
runs-on: ubuntu-latest
if: github.event_name == 'schedule'
steps:
- run: cmake -B build -DCMAKE_BUILD_TYPE=Release
- run: cmake --build build
- run: |
perf stat ./build/my_benchmark
valgrind --tool=callgrind --callgrind-out-file=callgrind.out ./build/my_benchmark
# Compare against baseline
In a real robot codebase (ROS + custom code + third-party), expect noise:
project/
├── sanitizer/
│ ├── asan.supp # ASan suppressions
│ ├── tsan.supp # TSan suppressions
│ └── lsan.supp # Leak suppressions
├── .clang-tidy # clang-tidy config
├── cppcheck.supp # cppcheck suppressions
└── CMakeLists.txt # Sanitizer targets
Rule: Every suppression MUST have a comment explaining: 1. Why it’s suppressed (false positive? third-party? benign?) 2. A link to the upstream issue if applicable 3. When it can be removed
# tsan.supp
# Benign race in ros::init() counter — reported upstream as ros/ros_comm#2134
race:ros::init
# False positive: atomic<bool> with relaxed ordering — TSan doesn't model this
race:StatusFlags::is_ready
# Third-party: libcurl internal threading — we can't fix this
race:libcurl*
| Symptom | First tool | Second tool | Third tool |
|---|---|---|---|
| Crash (segfault) | Core dump + GDB | ASan rebuild | Valgrind memcheck |
| Data corruption | ASan | GDB watchpoints | UBSan |
| Race condition | TSan | GDB with thread commands | LTTng |
| Deadlock | TSan | strace -e futex |
GDB info threads |
| Slow execution | perf stat |
perf record → flamegraph |
Callgrind |
| Cache misses | perf stat cache events |
Cachegrind | SoA refactor |
| Syscall overhead | strace -c |
bpftrace |
Buffering |
| Memory leak | ASan (LSan) | Valgrind memcheck | massif |
| Undefined behavior | UBSan | ASan | Code review |
| Coding standard | clang-tidy | cppcheck | -Wall -Werror |
| Exercise | Focus | Tools |
|---|---|---|
| ex01 | Coding standards violations — detect and fix | clang-tidy, compiler warnings |
| ex02 | Static analysis traps — patterns that hide bugs | cppcheck, code review |
| ex03 | Cache profiling — AoS vs SoA, row vs column | perf stat, cachegrind |
| ex04 | Call-graph hotspot — find the bottleneck | callgrind, perf record |
| ex05 | Syscall audit — reduce I/O overhead | strace -c |
| ex06 | Tracepoint framework — build LTTng-style tracing | LTTng concepts |
| ex07 | Watchpoint hunting — find memory corruption | GDB watchpoints |
| ex08 | Flamegraph-driven optimization | perf record + flamegraph |
| puzzle01 | The observant profiler — cache line effects | perf + false sharing |
| puzzle02 | The invisible allocation — RT latency | ltrace + perf |
| puzzle03 | The lying benchmark — measurement traps | perf stat |