Prerequisite: 03 — Nav2 Architecture, 01 — Nodes Topics Actions, 02 — Tf2 Time Qos Unlocks: Confident Nav2 debugging, cleaner system decomposition, faster incident triage, better launch and ownership decisions for AMR software teams
Most Nav2 failures are not caused by a bug inside Nav2 itself. They come from a bad system boundary:
/cmd_velIf you do not know where Nav2 starts and stops, every navigation incident turns into random log surfing. If you do know the boundaries, you can ask the right question first:
That is the point of this lesson: treat Nav2 as a distributed subsystem with hard contracts, not as a black box called “navigation”.
Nav2 owns the logic that converts a navigation goal into planned motion commands.
It does not own the full autonomy stack.
Mission / Fleet / UI layer
│ creates goals and policies
▼
Nav2
│ computes path, follows path, runs recoveries
▼
Base controller / safety layer / motor drivers
│ turn Twist into wheel motion
▼
Robot hardware
If you say “Nav2 is responsible for moving the robot,” that is directionally true but operationally incomplete. In practice, Nav2 sits between systems that define intent above it and systems that enforce physical safety below it.
| Area | Usually owned by Nav2 | Usually owned outside Nav2 | Why it matters |
|---|---|---|---|
| Goal execution | Yes | Mission layer decides which goal to send | Nav2 executes; it should not invent business intent |
| Global path generation | Yes | Map production and semantic zoning come from elsewhere | Planner quality depends on upstream map quality |
| Local obstacle avoidance | Partly | Sensor drivers and perception quality are upstream | Nav2 can only react to what enters the costmap |
| Robot localization consumption | Yes | Localization generation is external | Nav2 trusts TF and pose sources |
| Recovery policy | Yes | Fleet or task layer may decide escalation after repeated failure | Recovery logic belongs near navigation policy |
| Traffic coordination between robots | No | Fleet manager / traffic manager | Nav2 is usually single-robot local intelligence |
| Hard safety stop | No | Safety PLC / collision monitor / certified safety chain | Safety must override Nav2 when required |
| Docking workflow orchestration | Partly | Product-specific mission logic often wraps Nav2 | Docking needs navigation plus product rules |
Production rule: if a failure spans multiple boxes, the incident owner still needs to separate primary cause from visible symptom. Nav2 often shows the symptom first.
Nav2 assumes four things are already true:
At minimum, Nav2 expects a reliable chain like:
map -> odom -> base_link -> sensor frames
If this chain is missing, stale, or inconsistent, every server downstream becomes noisy:
Nav2 does not prove localization is correct. It uses the pose it receives.
That means a robot can:
The global planner and local controller both read a filtered version of the world. If that world model is wrong, the path logic is still internally correct and operationally useless.
Nav2 publishes motion intent, usually on /cmd_vel or a shaped derivative. The base stack, safety layer, and sometimes a velocity smoother still need to accept, preserve, and execute those commands.
An AMR that never moves after a valid plan often has a downstream problem, not a planning problem.
NavigateToPose action client
│
▼
bt_navigator
│ │ │
│ │ └────────► behavior_server
│ │
│ └─────────────► controller_server
│
└──────────────────► planner_server
map_server ───────────────► global_costmap
sensor topics ────────────► local/global costmaps
amcl / ekf / odom ────────► planner + controller + costmaps via TF
lifecycle_manager manages startup/shutdown for the whole group
This is the minimum mental model you should be able to draw from memory.
bt_navigator: The Orchestratorbt_navigator is the policy engine. It does not compute paths itself and it does not produce wheel-level control directly. It coordinates the sequence:
receive goal -> compute path -> follow path -> react to failure -> recover or abort
Its job is to answer:
In AMR terms, bt_navigator decides the navigation playbook, not the underlying geometry.
planner_server: Global Route ComputationThe planner computes a path from start pose to goal pose on the global costmap.
Inputs:
Outputs:
nav_msgs/PathThe planner does not guarantee the robot can track the path cleanly. That is the controller’s job.
controller_server: Path Tracking Under Local ConditionsThe controller consumes:
and produces motion commands.
This server is where many “the planner is bad” complaints actually land. Common reality:
That is not a planner bug unless the path quality itself is unusable.
behavior_server: Recovery and Utility Behaviorsbehavior_server usually hosts behaviors such as:
SpinBackUpWaitThese are invoked from the BehaviorTree when the normal compute-follow flow fails.
AMR reality: recoveries are not just technical conveniences. They encode product policy.
Examples:
The costmaps are the bridge between sensors, maps, and motion logic.
Global costmap: strategic route space across the map
Local costmap: tactical obstacle space around the robot
The planner mainly trusts the global costmap. The controller mainly trusts the local costmap.
When these diverge from reality, Nav2 behavior diverges from operator expectations.
lifecycle_manager: The Control PlaneThis is the most overlooked Nav2 node during debugging.
lifecycle_manager handles ordered state transitions for managed nodes:
configure -> activate -> monitor bond -> deactivate / cleanup / shutdown
Without it, you can have a planner process running but inactive, a controller waiting on unavailable costmap data, or a navigator that accepts nothing because the dependency chain never reached ACTIVE.
If Nav2 startup is weird, lifecycle state is one of the first things to check.
NavigateToPose Request, Step by Step1. Mission layer sends NavigateToPose(goal)
2. bt_navigator accepts goal and loads/ticks the BehaviorTree
3. Planner action node requests a path from planner_server
4. planner_server reads global costmap + pose -> returns nav_msgs/Path
5. Controller action node sends path to controller_server
6. controller_server reads local costmap + current pose -> publishes /cmd_vel
7. Robot moves; BT keeps ticking
8. Tree may replan, continue following, recover, or abort
9. Result returns to the original action client
Every navigation incident should be mapped onto one of these nine steps.
| Interface | Type | Why it matters |
|---|---|---|
NavigateToPose |
ROS2 action | The top-level contract most apps call |
/plan or internal planned path |
topic/action result | Tells you whether planning succeeded at all |
/cmd_vel |
topic | Confirms whether Nav2 is commanding motion |
| TF transforms | transform stream | Required by almost every server |
| global/local costmap topics | topic | Shows what Nav2 thinks the world looks like |
| lifecycle services and status | lifecycle/service | Explains startup health |
Fast triage heuristic:
/cmd_vel: investigate controller, activation, path validity/cmd_vel exists but robot does not move: investigate downstream base/safety stackWhen a new goal arrives during execution, preemption does not mean “start over from scratch everywhere instantly.” It means the action and BT machinery need to:
goalPoor preemption handling shows up as:
That is why system architecture matters: goal replacement crosses action contracts, BT policy, and server responsiveness.
In a warehouse, a robot usually obeys more than geometry.
Examples of policies outside raw Nav2 planning:
Nav2 can support these policies, but should not absorb all of them.
Bad architecture: mission logic hidden inside a custom planner plugin.
Better architecture: mission system decides which goal is legal; Nav2 executes the local navigation contract for that goal.
Many AMRs have a safety chain that can override or suppress motion:
Nav2 -> velocity smoother -> collision monitor / safety PLC -> base controller
If this chain clips commands, Nav2 may appear sluggish, indecisive, or broken while actually behaving correctly.
Typical symptoms:
The visible Nav2 failure is secondary. The primary cause is lower in the actuation path.
The map says the aisle is clear, but pallets protrude into the lane in reality. Global planner keeps finding a route; local controller repeatedly aborts.
Static map reflects last month’s layout. Planner chooses a valid path on an obsolete topology.
Mission layer sends a goal at the center of a rack face instead of the legally reachable staging pose.
Operators increase controller aggressiveness to improve cycle time, causing overshoot near end-of-aisle turns.
Architecture helps because each of these belongs to a different owner.
When a robot “cannot navigate,” ask these in order:
This sequence avoids a common waste pattern: tuning planners before proving the controller or base ever had a fair chance.
| Field symptom | First layer to inspect | Why |
|---|---|---|
| Goal rejected immediately | action contract / lifecycle | server may be inactive or goal invalid |
| “No path found” | planner + global costmap + localization | planner failure is usually upstream-data sensitive |
| Path exists but robot never starts moving | controller + /cmd_vel path |
proves whether local execution started |
| Robot jerks, spins, backs up, aborts | BT + controller + local costmap | recoveries are policy responses to lower-level failure |
| Robot moves incorrectly despite clean logs | TF/localization/base stack | can be a physically wrong but internally consistent system |
Use this before blaming a single node:
If not, the architecture is still fuzzy in your head.
A healthy production Nav2 stack has these characteristics:
You should now be able to explain:
NavigateToPose goal becomes motionContinue to 02 — Nav2 Bringup Lifecycle Actions.
That lesson turns this architecture into operational reality: how the servers are started, transitioned to ACTIVE, and called through their action contracts.