Skills tested: Two-layer architecture timing, ROS2 real-time issues, Jetson↔MCU coordination Difficulty: Advanced Time estimate: 60–90 minutes
Field report from warehouse (automated site, no operator present):
Multiple robots intermittently drift 5–15 cm off the planned path during corridor transit. The drift happens mid-straight-line, not at corners. It corrects within 1–2 seconds but causes near-misses with rack posts. AMCL localization is accurate (checked). Path planner output is correct (checked). The issue happens more under heavy CPU load (many robots active, cameras streaming).
What you have: - Nav2 controller (Regulated Pure Pursuit) runs at 20 Hz - Motor bridge (SPI) runs at 1 kHz - Velocity smoother at 50 Hz - Jetson Xavier NX, 6 cores, running ROS2 + AMCL + 2 cameras + DDS
Log the time between consecutive cmd_vel publications:
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Twist
class CmdVelMonitor(Node):
def __init__(self):
super().__init__('cmdvel_monitor')
self.last_time = None
self.intervals = []
self.create_subscription(Twist, '/cmd_vel', self.cb, 10)
def cb(self, msg):
now = self.get_clock().now()
if self.last_time:
dt = (now - self.last_time).nanoseconds / 1e6 # ms
self.intervals.append(dt)
if dt > 80: # Flag gaps > 80 ms (4× nominal)
self.get_logger().warn(f'cmd_vel gap: {dt:.1f} ms')
self.last_time = now
cmd_vel interval statistics (5 minute window):
Mean: 50.2 ms (expected: 50 ms = 20 Hz ✓)
Median: 49.8 ms
P95: 62.3 ms
P99: 98.7 ms ← Almost 2× period
Max: 187.4 ms ← Nearly 4× period!
Gaps > 80 ms: 23 occurrences in 5 minutes
Gaps > 100 ms: 8 occurrences
Gaps > 150 ms: 2 occurrences
Finding: The Nav2 controller is supposed to publish cmd_vel every 50 ms, but under load, it occasionally takes 100–187 ms. During these gaps, the motor bridge sends the last cmd_vel repeatedly.
The robot is driving straight at $v = 0.5$ m/s, $\omega = 0$. A small perturbation makes the robot rotate slightly ($\omega_{actual} = 0.01$ rad/s). Normally, the Nav2 controller corrects this in 50 ms:
t=0: cmd_vel: v=0.5, ω=0.0 Robot heading: 0.000 rad
t=50ms: Nav2 sees slight drift Heading drifted to: 0.0005 rad
cmd_vel: v=0.5, ω=-0.01 (corrective angular velocity)
t=100ms: Corrected back Heading: 0.0001 rad
With a 150 ms gap:
t=0: cmd_vel: v=0.5, ω=0.0 Robot heading: 0.000 rad
t=50ms: Nav2 SHOULD correct... Heading: 0.0005 rad
But Nav2 is delayed! Motor bridge sends ω=0.0 (stale)
t=100ms: Still delayed! Heading: 0.0015 rad (drift growing)
t=150ms: Nav2 finally updates Heading: 0.0025 rad = 0.14°
cmd_vel: v=0.5, ω=-0.05 (larger correction needed)
t=200ms: Heading: 0.001 rad Still correcting...
Cross-track error from 150 ms gap:
$$e_{cross} = v \cdot \sin(\Delta\theta) \cdot t_{gap} \approx 0.5 \times 0.0025 \times 0.15 = 0.19 \text{ mm}$$
Wait — that’s tiny. So where does the 5–15 cm drift come from?
One gap doesn’t cause 15 cm drift. But multiple gaps in sequence do. And the real problem is more subtle:
The estimator is also affected. When the Nav2 controller’s timer fires late:
1. The TF tree (odom → base_link) is stale
2. The controller computes the error using an old robot pose
3. The correction is wrong — it corrects for where the robot WAS, not where it IS
4. This wrong correction causes a temporary overshoot in the other direction
5. The next cycle (also potentially late) compounds the error
Under heavy CPU load, the ROS2 DDS middleware also suffers:
Camera 1 topics: 30 Hz × 2 MB = 60 MB/s
Camera 2 topics: 30 Hz × 2 MB = 60 MB/s
LiDAR: 10 Hz × 100 KB = 1 MB/s
AMCL particles: 5 Hz × 50 KB = 0.25 MB/s
cmd_vel: 20 Hz × 48 bytes = ~1 KB/s ← Tiny!
Total DDS load: ~120 MB/s
The cmd_vel message is tiny but competes with camera topics for DDS thread time. When the DDS executor is busy serializing a 2 MB image, cmd_vel publication can be delayed by up to one image serialization time (~5–15 ms on Xavier NX).
# On the Jetson
sudo perf top -p $(pgrep nav2_controller)
35.2% librcl.so [.] rmw_take
22.1% nav2_controller [.] RegulatedPurePursuit::computeVelocityCommands
8.3% libfastrtps.so [.] eprosima::fastrtps::rtps::MessageReceiver
7.1% libtf2_ros.so [.] tf2_ros::BufferCore::lookupTransform
Finding: 35% of the controller’s time is spent in rmw_take (receiving DDS messages). When other nodes flood DDS, the controller spends more time waiting for the executor.
# Measure actual timer callback intervals
import statistics
intervals = [50.1, 49.8, 50.3, 51.2, 48.9, 50.0, 50.5,
89.3, # ← gap
10.7, # ← catch-up (timer fires immediately after gap)
50.2, 50.1, 50.0, 50.3,
102.5, # ← larger gap
47.5, # ← partial catch-up
50.0]
print(f"Jitter std: {statistics.stdev(intervals):.1f} ms")
# Result: 15.3 ms standard deviation (should be < 2 ms for 20 Hz)
# Set SCHED_FIFO priority for the controller process
sudo chrt -f 50 ros2 run nav2_controller controller_server
Or in the ROS2 launch file:
# Use ROS2 real-time executor
from rclpy.executors import MultiThreadedExecutor
executor = MultiThreadedExecutor(num_threads=2)
executor.add_node(controller_node)
# Pin to isolated CPU cores
os.sched_setaffinity(0, {4, 5}) # Cores 4-5 for controller
Isolate control topics from camera topics:
# nav2_controller params
ros__parameters:
use_sim_time: false
# Use intra-process communication for cmd_vel
# (bypasses DDS serialization entirely)
use_intra_process_comms: true
The motor bridge should detect and handle stale commands:
class MotorBridge(Node):
def __init__(self):
super().__init__('motor_bridge')
self.cmd_vel_age_threshold = 0.08 # 80 ms = 1.5× period
self.decel_rate = 0.5 # m/s² when cmd_vel is stale
def spi_callback(self):
age = (self.get_clock().now() - self.cmd_vel_stamp).nanoseconds / 1e9
if age > self.cmd_vel_age_threshold:
# STALE: decelerate smoothly instead of holding
self.get_logger().warn(f'cmd_vel stale: {age*1000:.0f} ms')
# Reduce speed toward zero
v = self.current_v
if abs(v) > 0.01:
sign = 1.0 if v > 0 else -1.0
v -= sign * self.decel_rate * 0.001 # 1 ms step
v = max(0, v) if sign > 0 else min(0, v)
# Hold angular velocity at zero (don't turn while stale)
omega = 0.0
else:
v, omega = self.last_cmd_vel
self.send_to_mcu(v, omega)
class GapAwareVelocitySmoother:
def __init__(self, max_accel, max_decel, nominal_dt):
self.max_accel = max_accel
self.max_decel = max_decel
self.nominal_dt = nominal_dt
self.current_v = 0.0
self.current_omega = 0.0
self.last_update_time = None
def smooth(self, target_v, target_omega, current_time):
if self.last_update_time is not None:
actual_dt = current_time - self.last_update_time
# If actual_dt > 2× nominal, we had a gap
if actual_dt > 2.0 * self.nominal_dt:
# Use nominal_dt for smoothing (don't allow huge jumps)
dt = self.nominal_dt
# Log the gap
print(f"Gap detected: {actual_dt*1000:.0f} ms")
else:
dt = actual_dt
else:
dt = self.nominal_dt
self.last_update_time = current_time
# Standard acceleration limiting
dv = target_v - self.current_v
max_dv = (self.max_accel if dv > 0 else self.max_decel) * dt
dv = max(min(dv, max_dv), -max_dv)
self.current_v += dv
dw = target_omega - self.current_omega
max_dw = self.max_accel * dt
dw = max(min(dw, max_dw), -max_dw)
self.current_omega += dw
return self.current_v, self.current_omega
cmd_vel gaps > 80 ms: 23 per 5 minutes
Max path deviation: 15.2 cm
Near-misses per hour: 3.1
cmd_vel gaps > 80 ms: 2 per 5 minutes (RT priority)
Max path deviation: 2.1 cm (gap detection + decel)
Near-misses per hour: 0.0
On a shared Jetson, you have a timing budget just like a financial budget:
Jetson Xavier NX: 6 cores, 8 GB RAM
Core allocation:
Cores 0-1: Linux kernel, DDS, cameras, misc
Cores 2-3: AMCL, costmap, planner (can be slow)
Core 4: Nav2 controller (20 Hz, SCHED_FIFO)
Core 5: Motor bridge (1 kHz, SCHED_FIFO)
CPU budget at 20 Hz (50 ms per cycle):
TF lookup: 2 ms
Costmap query: 5 ms
RPP computation: 3 ms
DDS publish: 1 ms
─────────────────────────────
Total: 11 ms (22% utilization, safe)
Headroom: 39 ms (78% margin for jitter)
SCHED_FIFO with priority 50 is used for the controller. What happens to other ROS2 nodes when the controller needs CPU?nominal_dt during gaps instead of actual_dt. Why? What would happen with actual_dt?