03 — Nav2 Architecture: BehaviorTree, Costmaps, Planners, Controllers

Why the robot stops, why recovery behaviors trigger, why a path is never found

Prerequisite: 01-nodes-topics-actions.md (actions, lifecycle), 02-tf2-time-qos.md (TF2, QoS) Unlocks: Interpreting Nav2 log output, diagnosing stuck navigation, tuning costmap inflation, choosing between planners and controllers, understanding BehaviorTree XML files

Why Should I Care? (Context)

Nav2 is the standard ROS2 navigation stack for mobile robots. Nearly every AMR navigation failure traces back to one of four places:

Localization — the stack thinks the robot is somewhere it isn’t
Costmap — there’s an obstacle in the costmap that doesn’t exist in reality (or vice versa)
Planner — no valid path exists given the current costmap state
BehaviorTree — the recovery loop ran out of retries and gave up

Without understanding how these four layers interact, navigation bugs look random. With this mental model, you can read a Nav2 log and immediately identify which layer failed and why.

PART 1 — OVERVIEW: THE 4-LAYER STACK

1.1 The Layered Architecture

  ┌─────────────────────────────────────────────────────────┐
  │  Layer 4: BehaviorTree Executor (bt_navigator)          │
  │  Decides WHAT to do and in what sequence                │
  │  "Try planning → try following → try recovery → repeat" │
  └───────────────────────────┬─────────────────────────────┘
                              │ calls actions
  ┌───────────────────────────▼─────────────────────────────┐
  │  Layer 3: BT Action Plugins                             │
  │  ComputePathToPose → nav2_planner                       │
  │  FollowPath        → nav2_controller                    │
  │  ClearEntireCostmap → costmap_2d server                 │
  │  Spin / BackUp / Wait → recovery servers                │
  └───────────────────────────┬─────────────────────────────┘
                              │ reads/writes
  ┌───────────────────────────▼─────────────────────────────┐
  │  Layer 2: Costmap Servers                               │
  │  Global costmap: full-map obstacle grid for planning    │
  │  Local costmap: rolling window for controller           │
  └───────────────────────────┬─────────────────────────────┘
                              │ reads
  ┌───────────────────────────▼─────────────────────────────┐
  │  Layer 1: Localization                                  │
  │  AMCL: map + laser → /amcl_pose                         │
  │  robot_localization EKF: odom + IMU → filtered /odom    │
  └─────────────────────────────────────────────────────────┘

Each layer is independent — you can replace AMCL with a different localizer, or swap the planner, without touching the others. The BehaviorTree is the glue that orchestrates them as actions.

1.2 Nav2 Node Graph (Simplified)

                         ┌──────────────┐
  /map ──────────────────► map_server   │
                         └──────┬───────┘
                                │/map (TRANSIENT_LOCAL)
          ┌─────────────────────▼───────────────────────────┐
          │           costmap_2d_global                      │
          │  StaticLayer + ObstacleLayer + InflationLayer    │
          └──────────────────────┬──────────────────────────┘
                                 │/global_costmap/costmap
  /scan ──►┌───────────────────┐ │
           │ costmap_2d_local  │ │
           └────────────┬──────┘ │
                        │/local_costmap/costmap
                        │        │
                ┌───────▼───┐    │
  /odom ───────►│ nav2_     │    │
  /amcl_pose ──►│ controller│    │
                └───────┬───┘    │
                        │        │ ┌────────────┐
                        │        └─► nav2_planner│
                        │          └─────┬───────┘
                     /cmd_vel            │/plan
                        │          ┌─────▼───────┐
                        │          │ bt_navigator │◄── NavigateToPose action
                        └──────────┤             │
                                   └─────────────┘

PART 2 — BEHAVIOR TREES

2.1 What a Behavior Tree IS

A BehaviorTree (BT) is a hierarchical state machine where each node returns one of three values:

SUCCESS   — I completed my task
FAILURE   — I could not complete my task
RUNNING   — I am still working (call me again next tick)

The tree is “ticked” at a configurable rate (default ~10Hz in Nav2). Each tick propagates from root to leaf, collecting return values.

Node types:

Sequence      — AND:  tick children left to right. Stop and return FAILURE
                      on first FAILURE child. Return SUCCESS only if ALL succeed.

Fallback      — OR:   tick children left to right. Stop and return SUCCESS
                      on first SUCCESS child. Return FAILURE only if ALL fail.

Decorator     — Modifier: wraps one child and changes its return value
                (Retry N times, Invert, ForceSuccess, RateController...)

Action        — Leaf: calls an actual ROS2 action server. Returns RUNNING
                      while the action is in progress.

2.2 The Default NavigateToPose BehaviorTree

<BehaviorTree>
  <RecoveryNode number_of_retries="6" name="NavigateRecovery">

    <PipelineSequence name="NavigateWithReplanning">
      <RateController hz="1.0">
        <ComputePathToPose goal="{goal}" path="{path}"/>
      </RateController>
      <FollowPath path="{path}" controller_id="FollowPath"/>
    </PipelineSequence>

    <ReactiveFallback name="RecoveryFallback">
      <GoalUpdated/>
      <SequenceWithMemory name="RecoveryActions">
        <ClearEntireCostmap name="ClearGlobalCostmap-Context"
                            service_name="global_costmap/clear_entirely_global_costmap"/>
        <Spin spin_dist="1.57"/>
        <Wait wait_duration="5.0"/>
        <BackUp backup_dist="0.30" backup_speed="0.05"/>
      </SequenceWithMemory>
    </ReactiveFallback>

  </RecoveryNode>
</BehaviorTree>

Walk-through — what happens when ComputePathToPose returns FAILURE:

Tick 1:
  RecoveryNode ticks PipelineSequence
  → RateController ticks ComputePathToPose → FAILURE (no path found)
  → PipelineSequence returns FAILURE
  RecoveryNode tries fallback: ReactiveFallback
  → GoalUpdated? No → FAILURE
  → SequenceWithMemory:
      ClearEntireCostmap → SUCCESS (costmap cleared)
      Spin → RUNNING (still spinning)

Tick 2-N: (Spin still running)
  Spin → RUNNING

Tick N+1: (Spin complete)
  Wait 5s → RUNNING

...eventually:
  BackUp → SUCCESS
  SequenceWithMemory → SUCCESS
  ReactiveFallback → SUCCESS
  RecoveryNode: recovery done, retry PipelineSequence

If this fails 6 times total: RecoveryNode returns FAILURE → navigation aborted.

2.3 PipelineSequence vs Sequence

This is one of the most misunderstood Nav2 internals:

Sequence:           Ticks children from the beginning each tick.
                    If child 1 is RUNNING, re-ticks child 1, skips child 2.

PipelineSequence:   Re-ticks ALL previously-RUNNING children each tick.
                    If child 1 returns SUCCESS, also ticks child 2 on same tick.
                    Effect: planner and controller run concurrently.

PipelineSequence during navigation:
  Tick 1:  RateController(ComputePathToPose) → SUCCESS   ← path computed
           FollowPath → RUNNING                          ← start following
  Tick 2:  RateController: 1Hz, skip this tick
           FollowPath → RUNNING                          ← still following
  Tick 3:  RateController: 1Hz, skip
           FollowPath → RUNNING
  ...
  Tick 10: RateController: 1Hz, replan now
           ComputePathToPose → SUCCESS                   ← updated path
           FollowPath → RUNNING (receives updated path)

Key insight: The RateController hz="1.0" causes the path to be replanned every second, not every tick. This prevents the planner from consuming CPU on every 10Hz tick.

2.4 The “Robot Stuck in Recovery Loop” Pattern

Robot behavior:                        BT state:
  spinning...                            Spin → RUNNING
  waiting...                             Wait → RUNNING
  backing up...                          BackUp → RUNNING
  trying to plan...                      ComputePathToPose → FAILURE
  spinning again...                      (retry 2 of 6)
  ...
  "Goal aborted"                         RecoveryNode: retries exhausted

Root cause checklist: 1. Is there actually a valid path? Check if the goal pose is inside an obstacle in the global costmap. 2. Is the global costmap showing a phantom obstacle? Old sensor data, mis-configured inflation radius. 3. Is the robot’s localization correct? If AMCL has diverged, the costmap obstacle locations don’t match reality. 4. Is inflation_radius too large for the corridor width? Robot inflation footprint blocks its own path.

PART 3 — COSTMAPS

3.1 Costmap Values

Cost value  Meaning
──────────────────────────────────────────────────────────
  0          FREE — robot can be here
  1–252      INFLATED — near an obstacle; higher = closer
  253        INSCRIBED — robot center would touch an obstacle
  254        LETHAL — obstacle cell; robot cannot be here
  255        UNKNOWN — not yet observed

The planner’s job: find a path through cells with cost < 253. The controller’s job: follow that path while the local costmap stays clear.

3.2 Costmap Layers

Final costmap (master)
    │
    ├── StaticLayer     ← reads from /map (the pre-built occupancy grid)
    │                     FREE cells in map = 0, OCCUPIED = 254
    │
    ├── ObstacleLayer   ← reads from /scan, /depth/points
    │                     marks new obstacles seen by sensors
    │                     can clear cells when sensor sees free space
    │
    └── InflationLayer  ← expands all obstacles outward by inflation_radius
                          creates cost gradient (253 at robot_radius, 1 at inflation_radius)

Before InflationLayer:           After InflationLayer:
                                  inflation_radius = 0.5m
  ░░░░░░░░░░░░░░░░                ░░░░░░░░░░░░░░░░
  ░░░░░░░░░░░░░░░░                ░░1 1 1 1 1░░░░░
  ░░░░░████░░░░░░░                ░░1 ╔═════╗1░░░░
  ░░░░░████░░░░░░░   ────────►    ░░1 ║     ║1░░░░
  ░░░░░████░░░░░░░                ░░1 ║wall ║1░░░░
  ░░░░░░░░░░░░░░░░                ░░1 ╚═════╝1░░░░
  ░░░░░░░░░░░░░░░░                ░░1 1 1 1 1░░░░░
                                  ░░░░░░░░░░░░░░░░

  ░ = FREE (0)                    Numbers = cost gradient (1–252)
  █ = LETHAL (254)                ╔╗╚╝ = INSCRIBED/LETHAL border

3.3 Global vs Local Costmap

Property	Global Costmap	Local Costmap
Size	Full map (e.g., 50m × 30m)	Rolling window around robot (e.g., 4m × 4m)
Update rate	Slow (~1Hz)	Fast (~5–10Hz, matches sensor rate)
Layers	Static + Obstacle + Inflation	Obstacle + Inflation only (no static)
Purpose	Path planning	Reactive collision avoidance
Frame	`map` frame	Robot-centered (`odom` or `base_link`)

Key insight: The local costmap does NOT use the static map layer. It only sees what the sensors currently see. This is why the robot can drive through a previously-mapped obstacle if the sensor isn’t observing it right now.

3.4 inflation_radius Tuning

Too small (e.g., exactly robot_radius):
  Robot path hugs walls → robot scrapes obstacles
  Paths through narrow corridors may work but margins are zero

Too large (e.g., 1.5× corridor width):
  Inflation from both walls overlaps → entire corridor is high-cost
  Planner may declare NO PATH even though the robot could fit
  Robot refuses to enter narrow sections

Correct:
  inflation_radius = robot_radius + safety_margin
  safety_margin = 0.1–0.3m depending on robot speed and controller accuracy

Diagnosis: If the planner reports “no path found” but the corridor is clearly passable, reduce inflation_radius or check that the robot_radius parameter matches the actual robot footprint.

PART 4 — PLANNERS AND CONTROLLERS

4.1 The Planner: Global Path Planning

The planner receives a goal pose and the global costmap, and returns a path (a list of poses from current position to goal).

NavFn Planner (Dijkstra / A*):

Strengths:  Reliable, well-tested, handles most warehouse maps
Weaknesses: Doesn't account for robot kinematics (can produce paths a
            differential-drive robot can't physically follow)
Parameters: use_astar: true/false, allow_unknown: true/false

SmacPlanner (SE2 / Hybrid-A* search):

Strengths:  Kinematically-feasible paths (respects min turn radius)
            Better for non-holonomic robots in tight spaces
Weaknesses: Slower than NavFn, more parameters to tune
Parameters: minimum_turning_radius, motion_model_for_search: DUBIN/REEDS_SHEPP

Here, SE2 means the planner reasons over $(x, y, \theta)$ instead of just $(x, y)$. SE2 is the state space; Hybrid-A* is the style of search algorithm operating in that state space.

Why NavFn and Smac feel so different

NavFn and Smac are not just two implementations of the same idea. They search in different spaces.

NavFn searches a 2D grid: “Which cells get me from start to goal?”
SmacPlanner Hybrid-A* searches a higher-dimensional state: “Which position + heading states can this robot actually drive through?”

That is why NavFn is fast and robust on open warehouse maps, but can return a path that looks good on the map and still asks the robot to do something physically ugly. Smac is slower because it carries more information, but that extra information is exactly what lets it respect turning radius and direction changes.

Dijkstra vs A*: the search cost functions

For grid planners, the most important mental model is the scoring function used to decide what node to expand next.

Dijkstra:

$$ f(n) = g(n) $$

$g(n)$ = exact cost from the start to the current node
No goal-directed guess is used
Result: the search expands outward like a ripple

A*:

$$ f(n) = g(n) + h(n) $$

$g(n)$ = exact cost from the start so far
$h(n)$ = heuristic estimate from the current node to the goal
Result: the search is biased toward states that appear closer to the goal

For a 4-connected grid, a common heuristic is the Manhattan distance:

$$ h(n) = |x_n - x_g| + |y_n - y_g| $$

If $h(n)$ never overestimates the true remaining cost, A is still optimal. That is the key rule behind an admissible heuristic*.

Tiny grid example

Imagine a robot moving one cell at a time with cost 1 per move:

S . . .
. X . .
. X . G
. . . .

S = start
G = goal
X = blocked cell

If you run Dijkstra, it does not care where the goal is yet. It keeps expanding by increasing path cost:

This is good when you truly want the full shortest-path cost field, but it explores many cells that obviously do not help reach the goal quickly.

If you run A*, the heuristic nudges the search toward G. It still avoids the wall, but it wastes less work on cells that are moving away from the goal.

Intuition:

Dijkstra says: “Expand the cheapest thing found so far.”
A* says: “Expand the cheapest thing found so far, adjusted by how promising it looks.”

In Nav2 terms, NavFn can run in either style depending on use_astar: true/false.

Dubins vs Reeds-Shepp: motion models for car-like robots

Once the robot is non-holonomic and has a minimum turning radius, shortest-path planning is no longer just a grid-search problem. The planner must care about how the vehicle can actually move.

The two classical motion models used here are:

Feature	Dubins	Reeds-Shepp
Direction of travel	Forward only	Forward and reverse
Primitive idea	Left, Right, Straight	Forward/reverse Left, Right, Straight
Typical optimal path length	Up to 3 segments	Up to 5 segments
Best fit	Forward-only vehicle models	Cars, forklifts, parking maneuvers

Dubins model

The Dubins car assumes:

fixed forward motion only
bounded curvature (equivalently, a minimum turning radius)
no sideways motion

Its shortest paths are built from circular arcs and straight lines. The optimal path is always in one of two families containing the six classical Dubins words:

CSC: curve-straight-curve, such as LSL, RSR, LSR, RSL
CCC: curve-curve-curve, such as LRL, RLR

This is a good model when reversing is impossible or undesirable. A fixed-wing UAV is a classic example: it cannot stop and back up to fix a bad approach.

Dubins kinematics

In the plane, define the vehicle state as:

q = (x, y, \theta)

where:

$x, y$ = position in the map
$\theta$ = heading angle

For the normalized Dubins car, forward speed is fixed and steering is bounded:

\dot{x} = \cos(\theta)

\dot{y} = \sin(\theta)

\dot{\theta} = u_1, \quad u_1 \in \{-u_{max}, 0, u_{max}\}

This is the math version of: the car must keep moving forward, and it can only turn left, go straight, or turn right within a bounded curvature.

Reeds-Shepp model

The Reeds-Shepp car adds one extra capability: reverse gear.

That single change matters a lot. The robot can now:

back out of tight areas
switch direction at a cusp
find shorter maneuvers in confined spaces

Because of that extra freedom, optimal paths can use more segments and more word types. This is why Reeds-Shepp is the better match for parking, forklifts, and car-like robots that must re-orient in narrow aisles.

Reeds-Shepp kinematics

Reeds-Shepp keeps the same state, but adds a direction-control variable:

\dot{x} = u_2 \cos(\theta)

\dot{y} = u_2 \sin(\theta)

\dot{\theta} = u_1

with:

u_2 \in \{-1, 1\}

Here:

$u_2 = 1$ means forward motion
$u_2 = -1$ means reverse motion

That is why Reeds-Shepp paths can contain cusps: points where the vehicle stops, changes direction, and then continues with the opposite sign of velocity.

Why both are called non-holonomic

Dubins and Reeds-Shepp are both non-holonomic models. In practical terms, that means the robot cannot instantly move sideways. Its velocity is constrained by its current heading, so every path must be built from motions the wheels can really produce.

That constraint is exactly why a plain grid planner is not enough for car-like robots. The robot’s state is not just “which cell am I in?” but also “which way am I facing?”

Practical intuition:

Dubins answers: “What is the shortest path if I am never allowed to reverse?”
Reeds-Shepp answers: “What is the shortest path if I may drive forward and back?”

In Nav2 SmacPlanner, this appears directly in:

motion_model_for_search: DUBIN
# or
motion_model_for_search: REEDS_SHEPP

Choose DUBIN when reverse motion is not allowed by platform or policy. Choose REEDS_SHEPP when reverse maneuvers are acceptable and tight-space performance matters more than search simplicity.

How this maps to real robots

Differential-drive AMR in wide warehouse aisles: NavFn or Smac can both work; the controller often matters more than the motion model.
Car-like robot or tugger with limited steering: Smac with a realistic turning radius matters a lot.
Forklift / parking-style behavior: Reeds-Shepp is often the right search model, because reversing is part of the task, not a failure.

Real-world use cases: where each model actually shows up

Use Dubins when forward momentum is mandatory or reverse is physically impossible:

Fixed-wing UAVs: the aircraft must keep forward airspeed and cannot back up in the sky.
Towed marine survey vessels: a ship dragging long sonar arrays avoids reverse maneuvers because backing up risks tangling or damaging the payload.

Use Reeds-Shepp when the shortest feasible maneuver depends on gear changes:

Autonomous valet parking: backing into or out of tight spaces is part of normal operation.
Warehouse forklifts or AGVs in narrow aisles: three-point turns and reverse alignment maneuvers are often unavoidable.

Visual companion: NavFn vs Smac Search Spaces

Companion exercise: Exercise05 — Search Costs and Motion Models

Follow-up exercise: Exercise06 — Hybrid-A* and Minimum Turning Radius

Interview sheet: Exercise07 — Bellman-Ford, Dijkstra, and A* Interview Traps

Interview gotchas: Dijkstra, A, and Hybrid-A

These are the questions that usually expose whether someone actually understands the search model.

1. When does A* become exactly Dijkstra?

When:

$$ h(n) = 0 $$

Then:

$$ f(n) = g(n) + h(n) = g(n) $$

So A* expands nodes exactly like Dijkstra.

2. Can A* return a non-shortest path?

Yes, if the heuristic overestimates the true remaining cost.

That is a non-admissible heuristic, and it breaks A*’s optimality guarantee.

3. What is an admissible heuristic?

A heuristic that never overestimates:

h(n) \leq d^*(n, goal)

where $d^*$ is the true shortest remaining cost.

4. What is a consistent heuristic?

A stronger condition:

h(n) \leq c(n, n') + h(n')

for every neighbor $n’$.

This gives A* a cleaner search frontier and avoids many re-openings.

5. Why is A* usually faster than Dijkstra?

Because it uses both:

exact cost so far: $g(n)$
estimated cost to go: $h(n)$

So it spends less time exploring directions that are obviously not helping.

6. When is Dijkstra still the better answer?

when you do not have a useful heuristic
when you need shortest paths to many or all nodes, not just one goal
when the graph is small enough that heuristic design is not worth the overhead

7. Can Dijkstra handle negative weights?

No. Negative edges break its greedy ordering assumption.

If negative weights exist, use an algorithm designed for that case, such as Bellman-Ford.

8. What happens if the heuristic is too small?

A* stays correct, but becomes less informed and therefore slower.

At the extreme, if $h(n)=0$, it collapses all the way to Dijkstra.

9. What happens if the heuristic is perfect?

If the heuristic equals the exact remaining cost, A* expands almost only the nodes on the optimal path. That is the ideal case.

10. Why isn’t greedy best-first search enough?

Because greedy search uses only:

$$ f(n) = h(n) $$

and ignores the real path cost already paid.

A* works because it balances both:

$$ f(n) = g(n) + h(n) $$

Greedy search can chase a promising-looking direction and still end up with a bad total path.

11. If all edges have equal weight, should you still use Dijkstra?

Usually no. Use BFS instead. It is simpler and faster for unweighted or equal-weight graphs.

12. What extra difficulty does `Hybrid-A` add compared with plain A?

Hybrid-A* is not just choosing among grid cells. It is choosing among position-plus-heading states, under curvature constraints. That makes the state space much larger, but it also makes the returned path something a real non-holonomic robot can actually execute.

When path planning fails:

# Check planner logs:
ros2 topic echo /plan       # should publish a path
ros2 topic echo /planning_server/goal_status

# Common log messages:
# [nav2_planner] Failed to create a plan from X to Y
#   → goal inside obstacle, or costmap not initialized
# [nav2_planner] Timeout exceeded
#   → map too large, NavFn too slow, reduce resolution or switch to SmacPlanner

4.2 The Controller: Local Path Following

The controller runs at ~20Hz, receives the planned path + local costmap, and outputs /cmd_vel.

DWB (Dynamic Window Approach)

Algorithm:
  1. Sample N velocity commands (vx, vθ) within the dynamic window
     (window = what's kinematically reachable in the next timestep)
  2. Project each sampled velocity forward for T seconds → trajectory
  3. Score each trajectory with multiple critics
  4. Send the velocity with the best total score

Trajectory Critics:
  PathAlign    — stay close to the planned path heading
  PathDist     — stay close to the planned path laterally
  GoalAlign    — point toward goal when close
  ObstacleDist — penalize trajectories near obstacles

Key parameters:
  max_vel_x: 0.5         # max forward speed (m/s)
  max_vel_theta: 1.0     # max rotation speed (rad/s)
  sim_time: 1.7          # seconds to roll out each sampled trajectory
  vx_samples: 20         # how many forward-speed samples
  vtheta_samples: 20     # how many rotation-speed samples

If sim_time is too low, the controller can’t see far enough ahead — it may choose a trajectory that looks safe in the next 0.5s but collides 0.6s later.

If sim_time is too high, the controller becomes over-conservative — every trajectory appears risky because it extrapolates too far.

RPP (Regulated Pure Pursuit)

RPP picks a lookahead point on the path and steers toward it. It is: - smoother than DWB - simpler to tune - often better for Ackermann or high-speed robots

Key idea: closer lookahead = sharper path following but more oscillation; farther lookahead = smoother path but cuts corners.

4.3 Common Controller Failure Patterns

Robot oscillates left-right in corridor:
  → PathAlign weight too high, or sim_time too low

Robot cuts corners and clips obstacles:
  → lookahead too large (RPP), obstacle critic too weak (DWB)

Robot gets stuck rotating in place:
  → goal tolerance too strict, local costmap sees obstacle at goal pose

4.4 Goal Tolerances

xy_goal_tolerance: 0.25   # meters
yaw_goal_tolerance: 0.25  # radians (~14°)

If these are too tight, the robot may endlessly re-approach the goal because it can’t physically settle within the tolerance.

Symptom: robot reaches goal area, then does micro-corrections forever.

PART 5 — HOW LOCALIZATION FEEDS NAV2

5.1 `robot_localization` vs `amcl`

AMCL:
  Input: map + laser scan
  Output: pose in map frame → /amcl_pose, map → odom TF
  Purpose: global localization on known map

robot_localization EKF:
  Input: wheel odom + IMU (+ GPS or other sensors)
  Output: filtered odom in odom frame → /odometry/filtered
  Purpose: smooth local motion estimate

Nav2 needs both: - global pose in map frame for planning - smooth local velocity / odom for the controller

If AMCL is wrong, the planner plans from the wrong place. If EKF odom is noisy, the controller can’t track smoothly.

5.2 TF Chain Required by Nav2

map → odom → base_link → laser_frame

If any link in that chain is missing or delayed, Nav2 will fail with TF lookup timeouts.

Common failure:

[bt_navigator] Timed out waiting for transform from base_link to map

Interpretation: either AMCL not publishing map → odom, or timestamps are too old.

PART 6 — FIELD DEBUGGING CHECKLIST

When a navigation failure occurs, inspect in this order:

TF is valid? bash ros2 run tf2_tools view_frames ros2 topic echo /tf
Global costmap initialized? bash ros2 topic echo /global_costmap/costmap
Planner producing path? bash ros2 topic echo /plan
Controller publishing /cmd_vel? bash ros2 topic echo /cmd_vel
Recovery loop active? Look for repeated Spin, BackUp, ClearEntireCostmap in logs.

This sequence isolates the failed layer in under 2 minutes.

Summary

BehaviorTree decides what to try next
Planner finds a global path through the global costmap
Controller turns that path into velocity commands using the local costmap
Localization supplies the pose and odom that make the rest of Nav2 meaningful

If you understand those four layers, most Nav2 logs stop looking like noise.

Practice Exercises

RPP (Regulated Pure Pursuit)

Algorithm:
  1. Find the "lookahead point" on the path at lookahead_dist ahead
  2. Compute the curvature (1/radius) to reach that point
  3. Scale linear speed by curvature (slow down on curves)
     and by proximity to obstacles (slow down near obstacles)
  4. Output (vx, vθ)

Key parameters:
  desired_linear_vel: 0.5    # target forward speed (m/s)
  lookahead_dist: 0.6        # how far ahead to look on the path (m)
  max_angular_accel: 3.2     # rad/s²
  use_cost_regulated_linear_velocity_scaling: true  # slow near obstacles
  regulated_linear_scaling_min_speed: 0.25          # never go below this

RPP failure modes: - Robot overshoots corners: increase lookahead_dist or decrease speed. - Robot cuts corners into obstacles: decrease lookahead_dist. - Robot stops short of goal: goal_dist_tol too large. Check xy_goal_tolerance.

4.3 DWB vs RPP: Decision Guide

Use RPP when:                           Use DWB when:
  Wide, clear corridors                   Cluttered environments
  Industrial/warehouse predictable paths  Dynamic obstacles
  Simpler parameter tuning needed         Need fine-grained trajectory scoring
  Differential drive, stable odometry     Need custom trajectory critics
  Speed regulation near obstacles         Multi-objective optimization needed

Starting point for a warehouse AMR: RPP with desired_linear_vel: 0.5, lookahead_dist: 0.6. Tune from there.

4.4 Controller Failure Modes

# Controller stopped publishing /cmd_vel:
ros2 topic hz /cmd_vel         # should be ~20Hz when navigating

# Common log messages:
# [nav2_controller] Could not find any valid trajectories
#   → DWB: all sampled trajectories are in collision
#   → Fix: check local costmap for phantom obstacles
# [nav2_controller] Goal not reached within tolerance
#   → Robot is circling near goal but not converging
#   → Fix: increase xy_goal_tolerance or check localization accuracy
# [nav2_controller] Exceeded time tolerance
#   → Robot took too long to reach goal (e.g., blocked by real obstacle)

PART 5 — ROBOT LOCALIZATION (EKF NODE)

5.1 What robot_localization Does

The robot_localization package provides an Extended Kalman Filter (EKF) that fuses multiple odometry sources into a single, smooth /odom output.

Sensor inputs:
  /wheel_odom  (nav_msgs/Odometry)   ← from motor controller
  /imu/data    (sensor_msgs/Imu)     ← from IMU

                    ┌──────────────┐
  /wheel_odom ─────►│              │
                    │  EKF node    ├──► /odometry/filtered  (smooth /odom)
  /imu/data ───────►│              │      └──► TF: odom → base_link
                    └──────────────┘

Why bother with EKF if you already have wheel odometry? - Wheel odometry alone: noisy, susceptible to wheel slip - IMU alone: drifts (gyro bias accumulates) - EKF fusion: IMU provides fast, high-frequency correction; wheels provide long-term position

5.2 Key Parameters

ekf_node:
  ros__parameters:
    frequency: 50.0                  # EKF update rate (Hz)
    two_d_mode: true                 # robot moves in 2D plane only

    # Which states each sensor provides (x, y, z, roll, pitch, yaw, vx, vy, vz, vroll, vpitch, vyaw, ax, ay, az)
    odom0: /wheel_odom
    odom0_config: [true,  true,  false,   # x, y, z
                   false, false, true,    # roll, pitch, yaw
                   true,  false, false,   # vx, vy, vz
                   false, false, true,    # vroll, vpitch, vyaw
                   false, false, false]   # ax, ay, az

    imu0: /imu/data
    imu0_config: [false, false, false,
                  true,  true,  true,    # fuse orientation from IMU
                  false, false, false,
                  true,  true,  true,    # fuse angular velocity from IMU
                  true,  false, false]   # fuse linear accel X only

    process_noise_covariance: [...]      # how much to trust the process model
    initial_estimate_covariance: [...]   # initial uncertainty

5.3 AMCL: Map-Based Localization

AMCL (Adaptive Monte Carlo Localization) uses a particle filter to localize the robot against a known map.

Input:   /scan  (laser measurement)
Input:   /map   (pre-built occupancy grid from SLAM)
Input:   /initialpose (initial guess, or use /tf odom→base)
Output:  /amcl_pose  (geometry_msgs/PoseWithCovarianceStamped)
Output:  TF: map → odom  (the localization correction)

Particle filter intuition:
  100 particles spread around initial pose
    │
    ▼ Robot moves (odom update)
  All particles moved according to motion model
    │
    ▼ Robot observes scan
  Particles consistent with scan get HIGH weight
  Particles inconsistent with scan get LOW weight
    │
    ▼ Resample: high-weight particles survive, low-weight die
  Particles converge on true pose

AMCL failure modes:

# Localization diverged (particles spread all over map):
# - Robot was lifted and moved (external intervention)
# - Wheel slip caused large odometry error → particles fell off the true pose
# Fix: publish a new /initialpose from rviz2 (2D Pose Estimate button)

# AMCL not converging:
# - Global localization needed: ros2 service call /reinitialize_global_localization
# - scan → map mismatch: check if map matches current environment

# Common log message:
# [amcl] Particle filter variance too high
#   → Localization is uncertain; Nav2 may still navigate but with degraded accuracy

5.4 Localization Failure → Replanning Loop

Failure chain:
  robot_localization EKF publishes stale odom
       │
       ▼
  TF odom→base_link not updating at full rate
       │
       ▼
  Nav2 controller believes robot hasn't moved
       │
       ▼
  Controller keeps commanding velocity to "catch up"
       │
       ▼
  Robot moves faster than intended → overshoots goal
  OR: replanning loop (controller sees progress=zero → BT triggers recovery)

Diagnosis:
  ros2 topic hz /odometry/filtered    # should be ~50Hz
  ros2 topic hz /tf                   # should reflect odom publisher rate

  Low rate or gaps → EKF is stalling → check IMU and wheel odom topics

Summary — What to Remember

Nav2 Component	Purpose	Key Failure Mode
BehaviorTree executor (`bt_navigator`)	Orchestrates planning, following, recovery in a tree structure	Recovery loop exhausts retries → goal aborted
`PipelineSequence`	Runs planner + controller concurrently; replanner fires at `RateController` hz	Confused with `Sequence`; results in either over-replanning or never replanning
`RecoveryNode`	Retries the navigation sequence N times after recovery actions	Wrong `number_of_retries`; robot gives up too early or loops forever
Global costmap	Full-map obstacle grid for path planning	Stale obstacle from old sensor data → path planning fails; fix with `ClearEntireCostmap`
Local costmap	Rolling window for reactive control	Phantom obstacle → controller stops → BT triggers recovery
`inflation_radius`	Expands obstacles to keep robot body clear	Too large → robot can’t enter corridors; too small → robot scrapes walls
StaticLayer	Loads pre-built map into costmap	Not loaded (TRANSIENT_LOCAL QoS mismatch) → costmap shows only sensor obstacles
ObstacleLayer	Adds real-time sensor obstacles to costmap	Not updating (scan QoS mismatch or sensor driver down) → robot drives into real obstacles
NavFn planner	Dijkstra/A* global path	Slow on large maps; ignores robot kinematics
SmacPlanner	Hybrid-A* kinematically feasible path	Slower; needs `minimum_turning_radius` tuned
DWB controller	Samples trajectories, scores with critics	CPU heavy; oscillates if `PathAlign` weight unbalanced
RPP controller	Regulated pure pursuit	Overshoots corners if `lookahead_dist` too large
`robot_localization` EKF	Fuses wheel odom + IMU → smooth `/odom`	Stale odom if IMU stops; replanning loop if `/odom` gaps
AMCL	Particle filter localization against known map	Diverges after wheel slip or robot being moved; fix with `/initialpose`

03 — Nav2 Architecture: BehaviorTree, Costmaps, Planners, Controllers

Why the robot stops, why recovery behaviors trigger, why a path is never found

Why Should I Care? (Context)

PART 1 — OVERVIEW: THE 4-LAYER STACK

1.1 The Layered Architecture

1.2 Nav2 Node Graph (Simplified)

PART 2 — BEHAVIOR TREES

2.1 What a Behavior Tree IS

2.2 The Default NavigateToPose BehaviorTree

2.3 PipelineSequence vs Sequence

2.4 The “Robot Stuck in Recovery Loop” Pattern

PART 3 — COSTMAPS

3.1 Costmap Values

3.2 Costmap Layers

3.3 Global vs Local Costmap

3.4 inflation_radius Tuning

PART 4 — PLANNERS AND CONTROLLERS

4.1 The Planner: Global Path Planning

Why NavFn and Smac feel so different

Dijkstra vs A*: the search cost functions

Tiny grid example

Dubins vs Reeds-Shepp: motion models for car-like robots

Dubins model

Dubins kinematics

Reeds-Shepp model

Reeds-Shepp kinematics

Why both are called non-holonomic

How this maps to real robots

Real-world use cases: where each model actually shows up

Interview gotchas: Dijkstra, A, and Hybrid-A

1. When does A* become exactly Dijkstra?

2. Can A* return a non-shortest path?

3. What is an admissible heuristic?

4. What is a consistent heuristic?

5. Why is A* usually faster than Dijkstra?

6. When is Dijkstra still the better answer?

7. Can Dijkstra handle negative weights?

8. What happens if the heuristic is too small?

9. What happens if the heuristic is perfect?

10. Why isn’t greedy best-first search enough?

11. If all edges have equal weight, should you still use Dijkstra?

12. What extra difficulty does Hybrid-A* add compared with plain A*?

4.2 The Controller: Local Path Following

DWB (Dynamic Window Approach)

RPP (Regulated Pure Pursuit)

4.3 Common Controller Failure Patterns

4.4 Goal Tolerances

PART 5 — HOW LOCALIZATION FEEDS NAV2

5.1 robot_localization vs amcl

5.2 TF Chain Required by Nav2

PART 6 — FIELD DEBUGGING CHECKLIST

Summary

Practice Exercises

RPP (Regulated Pure Pursuit)

4.3 DWB vs RPP: Decision Guide

4.4 Controller Failure Modes

PART 5 — ROBOT LOCALIZATION (EKF NODE)

5.1 What robot_localization Does

5.2 Key Parameters

5.3 AMCL: Map-Based Localization

5.4 Localization Failure → Replanning Loop

Summary — What to Remember

12. What extra difficulty does `Hybrid-A` add compared with plain A?

5.1 `robot_localization` vs `amcl`