If you’ve written C++ in application-land, try/catch feels natural. In real-time and safety-critical systems (automotive, robotics firmware, avionics), exceptions are almost universally banned. Here’s why — and what you use instead.
When an exception is thrown, the runtime must:
1. Walk the call stack frame-by-frame
2. Find a matching catch handler (potentially many frames up)
3. Invoke destructors for every automatic object in each unwound frame
4. De-allocate stack frames
The cost depends on how deep the stack is at throw-time — something you cannot bound at compile time.
throw site catch site
| |
v v
[ frame 7 ] --dtor--> [ frame 6 ] --dtor--> ... --dtor--> [ frame 1 ]
~Widget() ~Buffer() ~Lock()
In a hard-RT loop running at 1 kHz, you have a 1 ms budget. Stack unwinding through 5+ frames with destructors can easily blow that. Worse, the cost varies per invocation — it’s data-dependent, not just code-dependent.
Exception matching uses Run-Time Type Information. The compiler emits type metadata for every class hierarchy involved in catch clauses. This:
- Bloats binary size (matters on MCUs with 256 KB flash)
- Adds indirection through type-info tables at catch time
- Is incompatible with -fno-rtti (which many embedded toolchains require)
MISRA C++:2023 Rule 18.3.1: “An exception shall not be thrown.” AUTOSAR AP R20-11 similarly prohibits exceptions in adaptive-platform code. These aren’t style preferences — they’re certification requirements. If your code needs ISO 26262 ASIL-D or DO-178C Level A, exceptions are a non-starter.
Exceptions create invisible goto paths. Any function call can jump to an arbitrary catch handler. In safety-critical code, every control path must be analysed — exceptions make that combinatorially harder.
The simplest pattern. Every function returns a status:
enum class ErrCode { Ok, Timeout, BadParam, HwFault };
ErrCode read_sensor(float* out) {
if (!hw_ready()) return ErrCode::HwFault;
*out = hw_read();
return ErrCode::Ok;
}
Pros: Zero overhead, fully deterministic, trivially analysable. Cons: Caller can silently ignore the return. The “real” return must go through an out-parameter. Composing multiple fallible calls is verbose.
std::optional<float> read_sensor() {
if (!hw_ready()) return std::nullopt;
return hw_read();
}
Good when the only failure mode is “absent.” No room to say why it failed.
using SensorResult = std::variant<float, ErrCode>;
SensorResult read_sensor() {
if (!hw_ready()) return ErrCode::HwFault;
return hw_read();
}
Now we can carry which error. But std::visit is clunky.
C++23 gives us std::expected<T, E>. Before that, use tl::expected or roll your own:
std::expected<float, ErrCode> read_sensor() {
if (!hw_ready()) return std::unexpected(ErrCode::HwFault);
return hw_read();
}
The killer feature is and_then / transform chaining (railway-oriented programming):
read_sensor() --[ok]--> calibrate() --[ok]--> filter() --[ok]--> publish()
| | |
+---[err]-------------+------[err]---------+-------> handle error once
Each step only runs if the previous succeeded. Errors skip forward like a railway switch. One error-handling site at the end, not interleaved checks:
auto result = read_sensor()
.and_then(calibrate)
.and_then(filter)
.transform(to_message);
if (!result) log_error(result.error());
This is the dominant pattern in modern safety-critical C++: deterministic, zero-overhead, composable, and every error path is explicit.
The C++ standard (§6.7.1) defines a memory location as: - A scalar object (int, float, pointer), OR - A maximal sequence of adjacent bit-fields of non-zero width
Two threads accessing the same memory location where at least one is a write, and they’re not ordered by synchronization → data race → undefined behavior.
This is the entire foundation. Everything below exists to establish ordering.
Within one thread, statements execute in order:
Thread 1:
a = 1; // A
b = 2; // B
A sequenced-before B (always)
The hard part. Two operations in different threads have no inherent ordering unless you create one:
Thread 1 Thread 2
--------- ---------
x = 42; ???
flag = true; if (flag)
use(x); // is x guaranteed to be 42?
With plain (non-atomic) variables: NO. This is a data race. The compiler and CPU can both reorder.
Modern CPUs don’t write directly to cache. They write to a store buffer first:
CPU Core 0 CPU Core 1
+-----------+ +-----------+
| registers | | registers |
+-----+-----+ +-----+-----+
| |
+-----v-----+ +-----v-----+
| STORE BUF | | STORE BUF |
| x=42 | (not yet visible) | |
| flag=true | | |
+-----+-----+ +-----+-----+
| |
+=====v=================================v=====+
| SHARED CACHE (L3) |
| x = 0 (stale!) flag = false (stale!) |
+==============================================+
Core 0 wrote x=42 then flag=true into its store buffer. Core 1 reads from shared cache and might see flag=true (drained) but x=0 (not yet drained). The stores can become visible out of order.
Caches maintain coherence via MESI states for each cache line:
+----------+----------------------------------------------------+
| State | Meaning |
+----------+----------------------------------------------------+
| Modified | I have the only copy, it's dirty (newer than RAM) |
| Exclusive| I have the only copy, it's clean |
| Shared | Multiple caches have clean copies |
| Invalid | My copy is stale, must re-fetch |
+----------+----------------------------------------------------+
Core 0 writes x:
Core 0: x line -> Modified
Core 1: x line -> Invalid (snooped invalidation)
Core 1 reads x:
Core 1 sees Invalid -> requests from Core 0
Core 0 flushes Modified line -> both go to Shared
MESI guarantees eventual coherence per cache line. But “eventual” isn’t ordered. The store buffer sits between the core and the cache — that’s where reordering sneaks in.
Thread 1 Thread 2
-------- --------
x.store(42, release)
\
\ synchronizes-with
\
+---> flag.load(acquire)
if true:
x.load(relaxed) == 42 ✓ guaranteed
The release store publishes everything before it. The acquire load subscribes to everything the release published. Together they form a happens-before edge.
memory_order_release on a store = “drain my store buffer before (or as part of) making this store visible.”
memory_order_acquire on a load = “don’t let any of my subsequent reads/writes execute before this load completes.”
On x86, stores already have release semantics (Total Store Order). On ARM/RISC-V, the compiler emits actual fence instructions (dmb ish on ARM).
BEFORE release fence:
+-------------+
| Store Buffer| Cache
| x = 42 | [stale]
| y = 7 | [stale]
+-------------+
release store of flag=true triggers drain:
+-------------+
| Store Buffer| Cache
| (empty) | x = 42 ✓
| | y = 7 ✓
| | flag=true ✓
+-------------+
Other cores doing acquire-load of flag will now
also see x=42 and y=7 (happens-before guarantee)
std::atomic<bool> ready{false};
int data = 0; // non-atomic!
// Producer (Thread 1)
data = 42; // ordinary write
ready.store(true, std::memory_order_release); // publish
// Consumer (Thread 2)
while (!ready.load(std::memory_order_acquire)) // subscribe
; // spin
assert(data == 42); // GUARANTEED — happens-before established
This is safe without making data atomic. The acquire-release pair on ready orders everything before the release against everything after the acquire.
CAS is the fundamental lock-free building block. It atomically does:
“If the value at address X is what I expect, replace it with my new value. Otherwise, tell me what it actually is.”
std::atomic<int> counter{0};
void increment() {
int expected = counter.load(std::memory_order_relaxed);
while (!counter.compare_exchange_weak(
expected, // read & updated on failure
expected + 1, // desired
std::memory_order_acq_rel)) {
// expected was updated to the current value — retry
}
}
Why weak? On ARM/RISC-V, CAS is implemented as LL/SC (load-linked/store-conditional). weak allows spurious failure (the LL/SC window expired), which is fine inside a loop. strong adds a retry internally — wasteful in a CAS loop.
Thread 1: Thread 2:
read A (value = ptr_X)
pop X
push Y
push X back (same address!)
CAS(expected=ptr_X,
desired=ptr_Z)
=> SUCCEEDS! (X is back)
=> But the stack changed underneath!
CAS only compares the value, not the history. If a value goes A→B→A, CAS thinks nothing changed. Solutions: - Tagged pointers: Pack a version counter into the pointer (use upper bits on 64-bit) - Hazard pointers: Protect nodes from reclamation while in use - Epoch-based reclamation: Defer frees until safe
A Single-Producer Single-Consumer queue is the simplest useful lock-free structure. One thread writes, one reads — no CAS needed, just acquire-release on the indices.
Buffer (capacity = 8, only 7 usable to distinguish full from empty):
Index: 0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
Data: | | D | E | F | | | | |
+---+---+---+---+---+---+---+---+
^ ^
| |
head=1 tail=4
(consumer) (producer)
Readable slots: [head, tail) = {1, 2, 3} → D, E, F
Writable slots: [tail, head) with wrap = {4, 5, 6, 7, 0}
Full when: (tail + 1) % cap == head
Empty when: tail == head
tail (after writing data into buf[tail])head (after reading data from buf[head])The ordering contract:
Producer: Consumer:
--------- ---------
buf[tail] = item; // ordinary auto item = buf[head]; // ordinary
tail.store( head.store(
next_tail, next_head,
memory_order_release); // publish memory_order_release); // publish
// reads head with acquire // reads tail with acquire
// to check "is there space?" // to check "is there data?"
Correctness argument:
1. Producer stores data, then release-stores tail. Consumer acquire-loads tail, then reads data. The acquire-release pair guarantees the consumer sees the data.
2. Consumer reads data, then release-stores head. Producer acquire-loads head, then writes new data to the freed slot. The acquire-release pair guarantees the producer doesn’t overwrite data the consumer hasn’t read yet.
3. No two threads write the same index → no CAS, no ABA.
Producer Thread Consumer Thread
--------------- ---------------
buf[1] = 'D' (A)
|
| seq-before
v
tail.store(2, rel) (B) ----synchronizes-with----> tail.load(acq) (C)
|
| seq-before
v
read buf[1] (D)
A happens-before B (sequenced-before)
B happens-before C (synchronizes-with: release/acquire)
C happens-before D (sequenced-before)
Therefore: A happens-before D ✓ (data is visible)
idx & (cap - 1)struct alignas(64) SPSCQueue {
alignas(64) std::atomic<size_t> head{0}; // consumer cache line
alignas(64) std::atomic<size_t> tail{0}; // producer cache line
alignas(64) T buf[N]; // data cache lines
};
Without the padding, head and tail sit on the same cache line. Every write by the producer invalidates the consumer’s cache line (and vice versa) — destroying performance despite being logically contention-free.
| Concept | Key Insight |
|---|---|
| No exceptions in RT | Non-deterministic timing, RTTI bloat, MISRA ban |
| Expected<T,E> | Deterministic, composable, zero-overhead error handling |
| Memory location | C++ unit of data-race analysis |
| Store buffer | Why writes become visible out of order |
| MESI | Cache-line level coherence, not ordering |
| Release | “Drain my store buffer, publish everything before me” |
| Acquire | “Don’t move anything past me until I complete” |
| CAS | Atomic read-modify-write; ABA is the trap |
| SPSC queue | acquire-release on indices, no CAS needed |
exercises/ directory)| File | Topic |
|---|---|
ex01_expected.cpp |
Implement Expected<T,E> with and_then chaining |
ex02_error_pipeline.cpp |
Build a 4-stage processing pipeline using Expected |
ex03_variant_visitor.cpp |
Error handling with std::variant and std::visit |
ex04_store_buffer_demo.cpp |
Demonstrate store-buffer reordering (run under TSan) |
ex05_acquire_release.cpp |
Publish/subscribe pattern with acquire-release |
ex06_cas_counter.cpp |
Lock-free counter using CAS loop |
ex07_spsc_queue.cpp |
Implement and test a lock-free SPSC ring buffer |