01 — Nav2 System Architecture

Prerequisite: 03 — Nav2 Architecture, 01 — Nodes Topics Actions, 02 — Tf2 Time Qos Unlocks: Confident Nav2 debugging, cleaner system decomposition, faster incident triage, better launch and ownership decisions for AMR software teams

Why Should I Care? (Context)

Most Nav2 failures are not caused by a bug inside Nav2 itself. They come from a bad system boundary:

The mission layer sends goals Nav2 cannot safely execute
Localization gives Nav2 a pose that is smooth but wrong
Perception updates the costmap too slowly or with stale data
Platform code clips or rewrites /cmd_vel
Operators blame the planner when the real problem is lifecycle startup, TF, or map semantics

If you do not know where Nav2 starts and stops, every navigation incident turns into random log surfing. If you do know the boundaries, you can ask the right question first:

Did the goal contract fail?
Did the planner fail?
Did the controller refuse the path?
Did the robot never receive a usable velocity command?
Did another subsystem make Nav2 look guilty?

That is the point of this lesson: treat Nav2 as a distributed subsystem with hard contracts, not as a black box called “navigation”.

PART 1 — WHAT NAV2 OWNS VS WHAT IT DOES NOT

1.1 The Short Version

Nav2 owns the logic that converts a navigation goal into planned motion commands.

It does not own the full autonomy stack.

Mission / Fleet / UI layer
    │ creates goals and policies
    ▼
Nav2
    │ computes path, follows path, runs recoveries
    ▼
Base controller / safety layer / motor drivers
    │ turn Twist into wheel motion
    ▼
Robot hardware

If you say “Nav2 is responsible for moving the robot,” that is directionally true but operationally incomplete. In practice, Nav2 sits between systems that define intent above it and systems that enforce physical safety below it.

1.2 Ownership Table

Area	Usually owned by Nav2	Usually owned outside Nav2	Why it matters
Goal execution	Yes	Mission layer decides which goal to send	Nav2 executes; it should not invent business intent
Global path generation	Yes	Map production and semantic zoning come from elsewhere	Planner quality depends on upstream map quality
Local obstacle avoidance	Partly	Sensor drivers and perception quality are upstream	Nav2 can only react to what enters the costmap
Robot localization consumption	Yes	Localization generation is external	Nav2 trusts TF and pose sources
Recovery policy	Yes	Fleet or task layer may decide escalation after repeated failure	Recovery logic belongs near navigation policy
Traffic coordination between robots	No	Fleet manager / traffic manager	Nav2 is usually single-robot local intelligence
Hard safety stop	No	Safety PLC / collision monitor / certified safety chain	Safety must override Nav2 when required
Docking workflow orchestration	Partly	Product-specific mission logic often wraps Nav2	Docking needs navigation plus product rules

Production rule: if a failure spans multiple boxes, the incident owner still needs to separate primary cause from visible symptom. Nav2 often shows the symptom first.

1.3 The Four Upstream Contracts Nav2 Depends On

Nav2 assumes four things are already true:

A. The transform tree is coherent

At minimum, Nav2 expects a reliable chain like:

map -> odom -> base_link -> sensor frames

If this chain is missing, stale, or inconsistent, every server downstream becomes noisy:

planner cannot locate start pose correctly
controller tracks the wrong local pose
costmaps insert sensor data in the wrong place
recoveries trigger for the wrong reason

B. The robot pose is physically believable

Nav2 does not prove localization is correct. It uses the pose it receives.

That means a robot can:

plan through a rack because AMCL drifted laterally
refuse a goal because it thinks it starts inside an obstacle
oscillate near a goal because heading estimate is noisy

C. The map and sensor observations match reality closely enough

The global planner and local controller both read a filtered version of the world. If that world model is wrong, the path logic is still internally correct and operationally useless.

D. Velocity commands can actually reach the base

Nav2 publishes motion intent, usually on /cmd_vel or a shaped derivative. The base stack, safety layer, and sometimes a velocity smoother still need to accept, preserve, and execute those commands.

An AMR that never moves after a valid plan often has a downstream problem, not a planning problem.

PART 2 — THE CORE NAV2 SERVERS

2.1 The Canonical Runtime Topology

NavigateToPose action client
            │
            ▼
      bt_navigator
       │    │    │
       │    │    └────────► behavior_server
       │    │
       │    └─────────────► controller_server
       │
       └──────────────────► planner_server

map_server ───────────────► global_costmap
sensor topics ────────────► local/global costmaps
amcl / ekf / odom ────────► planner + controller + costmaps via TF

lifecycle_manager manages startup/shutdown for the whole group

This is the minimum mental model you should be able to draw from memory.

2.2 `bt_navigator`: The Orchestrator

bt_navigator is the policy engine. It does not compute paths itself and it does not produce wheel-level control directly. It coordinates the sequence:

receive goal -> compute path -> follow path -> react to failure -> recover or abort

Its job is to answer:

Which action should run next?
When should replanning happen?
When should recovery start?
When should a new goal preempt the current flow?

In AMR terms, bt_navigator decides the navigation playbook, not the underlying geometry.

2.3 `planner_server`: Global Route Computation

The planner computes a path from start pose to goal pose on the global costmap.

Inputs:

current pose from TF/localization
goal pose from the action request
global costmap data
planner plugin parameters

Outputs:

a nav_msgs/Path
failure if no valid path exists under the current map assumptions

The planner does not guarantee the robot can track the path cleanly. That is the controller’s job.

2.4 `controller_server`: Path Tracking Under Local Conditions

The controller consumes:

the current path
current robot pose and velocity
local costmap state

and produces motion commands.

This server is where many “the planner is bad” complaints actually land. Common reality:

the planner produced a valid path
the controller finds it dynamically infeasible or unsafe at runtime
the robot slows, oscillates, or triggers recovery

That is not a planner bug unless the path quality itself is unusable.

2.5 `behavior_server`: Recovery and Utility Behaviors

behavior_server usually hosts behaviors such as:

Spin
BackUp
Wait
assisted or custom recoveries depending on plugin set

These are invoked from the BehaviorTree when the normal compute-follow flow fails.

AMR reality: recoveries are not just technical conveniences. They encode product policy.

Examples:

In a human-heavy aisle, waiting may be better than backing up.
Near a loading station, backing blindly may be unacceptable.
In a one-way lane, repeated spinning may waste throughput and block traffic.

2.6 Costmap Servers: The Shared World Model

The costmaps are the bridge between sensors, maps, and motion logic.

Global costmap: strategic route space across the map
Local costmap: tactical obstacle space around the robot

The planner mainly trusts the global costmap. The controller mainly trusts the local costmap.

When these diverge from reality, Nav2 behavior diverges from operator expectations.

2.7 `lifecycle_manager`: The Control Plane

This is the most overlooked Nav2 node during debugging.

lifecycle_manager handles ordered state transitions for managed nodes:

configure -> activate -> monitor bond -> deactivate / cleanup / shutdown

Without it, you can have a planner process running but inactive, a controller waiting on unavailable costmap data, or a navigator that accepts nothing because the dependency chain never reached ACTIVE.

If Nav2 startup is weird, lifecycle state is one of the first things to check.

PART 3 — THE END-TO-END GOAL FLOW

3.1 One `NavigateToPose` Request, Step by Step

1. Mission layer sends NavigateToPose(goal)
2. bt_navigator accepts goal and loads/ticks the BehaviorTree
3. Planner action node requests a path from planner_server
4. planner_server reads global costmap + pose -> returns nav_msgs/Path
5. Controller action node sends path to controller_server
6. controller_server reads local costmap + current pose -> publishes /cmd_vel
7. Robot moves; BT keeps ticking
8. Tree may replan, continue following, recover, or abort
9. Result returns to the original action client

Every navigation incident should be mapped onto one of these nine steps.

3.2 Data Interfaces That Matter Most

Interface	Type	Why it matters
`NavigateToPose`	ROS2 action	The top-level contract most apps call
`/plan` or internal planned path	topic/action result	Tells you whether planning succeeded at all
`/cmd_vel`	topic	Confirms whether Nav2 is commanding motion
TF transforms	transform stream	Required by almost every server
global/local costmap topics	topic	Shows what Nav2 thinks the world looks like
lifecycle services and status	lifecycle/service	Explains startup health

Fast triage heuristic:

no plan: investigate planner, costmap, goal, localization
plan exists but no /cmd_vel: investigate controller, activation, path validity
/cmd_vel exists but robot does not move: investigate downstream base/safety stack

3.3 Where Preemption Actually Lands

When a new goal arrives during execution, preemption does not mean “start over from scratch everywhere instantly.” It means the action and BT machinery need to:

accept or reject the new goal
cancel or replace in-flight actions cleanly
refresh blackboard values such as goal
trigger replanning and controller update with the new target

Poor preemption handling shows up as:

robot finishing the old path after UI changed the mission
delayed reaction to urgent reroute requests
stuck action handles with confusing result semantics

That is why system architecture matters: goal replacement crosses action contracts, BT policy, and server responsiveness.

PART 4 — PRODUCTION BOUNDARIES FOR AMRS

In a warehouse, a robot usually obeys more than geometry.

Examples of policies outside raw Nav2 planning:

lane reservation
traffic priority at intersections
pick/drop task sequencing
battery-aware mission planning
docking queue management
operator intervention workflows

Nav2 can support these policies, but should not absorb all of them.

Bad architecture: mission logic hidden inside a custom planner plugin.

Better architecture: mission system decides which goal is legal; Nav2 executes the local navigation contract for that goal.

4.2 The Safety Layer Is a Separate Authority

Many AMRs have a safety chain that can override or suppress motion:

Nav2 -> velocity smoother -> collision monitor / safety PLC -> base controller

If this chain clips commands, Nav2 may appear sluggish, indecisive, or broken while actually behaving correctly.

Typical symptoms:

planner succeeds, controller publishes, robot barely moves
robot halts near pallet forks because safety field trips repeatedly
logs show progress checker failure even though controller keeps trying

The visible Nav2 failure is secondary. The primary cause is lower in the actuation path.

4.3 Warehouse-Specific Failure Modes

Narrow aisle optimism

The map says the aisle is clear, but pallets protrude into the lane in reality. Global planner keeps finding a route; local controller repeatedly aborts.

Map freshness mismatch

Static map reflects last month’s layout. Planner chooses a valid path on an obsolete topology.

Goal semantics mismatch

Mission layer sends a goal at the center of a rack face instead of the legally reachable staging pose.

Throughput-driven parameter drift

Operators increase controller aggressiveness to improve cycle time, causing overshoot near end-of-aisle turns.

Architecture helps because each of these belongs to a different owner.

PART 5 — HOW TO THINK DURING INCIDENT RESPONSE

5.1 The First Five Questions

When a robot “cannot navigate,” ask these in order:

Did Nav2 accept the goal?
Did a valid global path get produced?
Did the controller publish usable velocity commands?
Did the robot physically execute those commands?
Did recovery fail because of policy, world model, or downstream suppression?

This sequence avoids a common waste pattern: tuning planners before proving the controller or base ever had a fair chance.

5.2 Symptom-to-Layer Mapping

Field symptom	First layer to inspect	Why
Goal rejected immediately	action contract / lifecycle	server may be inactive or goal invalid
“No path found”	planner + global costmap + localization	planner failure is usually upstream-data sensitive
Path exists but robot never starts moving	controller + `/cmd_vel` path	proves whether local execution started
Robot jerks, spins, backs up, aborts	BT + controller + local costmap	recoveries are policy responses to lower-level failure
Robot moves incorrectly despite clean logs	TF/localization/base stack	can be a physically wrong but internally consistent system

5.3 Architecture Review Checklist

Use this before blaming a single node:

Can I name the owner of each interface from mission to wheel command?
Do I know which topics prove each stage is alive?
Do I know which failures are upstream of Nav2 versus inside Nav2?
Can I explain why the robot is allowed to attempt this goal at all?
Can I distinguish navigation policy from safety policy?

If not, the architecture is still fuzzy in your head.

PART 6 — WHAT GOOD LOOKS LIKE

6.1 A Healthy Nav2 System

A healthy production Nav2 stack has these characteristics:

clear ownership between mission, navigation, safety, and base control
lifecycle-driven startup with deterministic activation order
stable TF and believable localization
costmaps that reflect the operating environment closely enough to be useful
recoveries designed for the actual AMR environment, not demo defaults
observability on actions, paths, costmaps, and velocity commands

6.2 What You Should Be Able to Explain After This Lesson

You should now be able to explain:

where Nav2 sits in the AMR stack
what each main Nav2 server is responsible for
how a NavigateToPose goal becomes motion
which failures belong to Nav2 and which only appear there first
why system boundaries matter more than memorizing node names

6.3 Next Step

Continue to 02 — Nav2 Bringup Lifecycle Actions.

That lesson turns this architecture into operational reality: how the servers are started, transitioned to ACTIVE, and called through their action contracts.

01 — Nav2 System Architecture

The runtime boundaries, contracts, and data flow behind a production AMR navigation stack

Why Should I Care? (Context)

PART 1 — WHAT NAV2 OWNS VS WHAT IT DOES NOT

1.1 The Short Version

1.2 Ownership Table

1.3 The Four Upstream Contracts Nav2 Depends On

A. The transform tree is coherent

B. The robot pose is physically believable

C. The map and sensor observations match reality closely enough

D. Velocity commands can actually reach the base

PART 2 — THE CORE NAV2 SERVERS

2.1 The Canonical Runtime Topology

2.2 `bt_navigator`: The Orchestrator

2.3 `planner_server`: Global Route Computation

2.4 `controller_server`: Path Tracking Under Local Conditions

2.5 `behavior_server`: Recovery and Utility Behaviors

2.6 Costmap Servers: The Shared World Model

2.7 `lifecycle_manager`: The Control Plane

PART 3 — THE END-TO-END GOAL FLOW

3.1 One `NavigateToPose` Request, Step by Step

3.2 Data Interfaces That Matter Most

3.3 Where Preemption Actually Lands

PART 4 — PRODUCTION BOUNDARIES FOR AMRS

4.1 The Navigation Stack Is Not the Fleet Stack

4.2 The Safety Layer Is a Separate Authority

4.3 Warehouse-Specific Failure Modes

Narrow aisle optimism

Map freshness mismatch

Goal semantics mismatch

Throughput-driven parameter drift

PART 5 — HOW TO THINK DURING INCIDENT RESPONSE

5.1 The First Five Questions

5.2 Symptom-to-Layer Mapping

5.3 Architecture Review Checklist

PART 6 — WHAT GOOD LOOKS LIKE

6.1 A Healthy Nav2 System

6.2 What You Should Be Able to Explain After This Lesson

6.3 Next Step

01 — Nav2 System Architecture

The runtime boundaries, contracts, and data flow behind a production AMR navigation stack

Why Should I Care? (Context)

PART 1 — WHAT NAV2 OWNS VS WHAT IT DOES NOT

1.1 The Short Version

1.2 Ownership Table

1.3 The Four Upstream Contracts Nav2 Depends On

A. The transform tree is coherent

B. The robot pose is physically believable

C. The map and sensor observations match reality closely enough

D. Velocity commands can actually reach the base

PART 2 — THE CORE NAV2 SERVERS

2.1 The Canonical Runtime Topology

2.2 bt_navigator: The Orchestrator

2.3 planner_server: Global Route Computation

2.4 controller_server: Path Tracking Under Local Conditions

2.5 behavior_server: Recovery and Utility Behaviors

2.6 Costmap Servers: The Shared World Model

2.7 lifecycle_manager: The Control Plane

PART 3 — THE END-TO-END GOAL FLOW

3.1 One NavigateToPose Request, Step by Step

3.2 Data Interfaces That Matter Most

3.3 Where Preemption Actually Lands

PART 4 — PRODUCTION BOUNDARIES FOR AMRS

4.1 The Navigation Stack Is Not the Fleet Stack

4.2 The Safety Layer Is a Separate Authority

4.3 Warehouse-Specific Failure Modes

Narrow aisle optimism

Map freshness mismatch

Goal semantics mismatch

Throughput-driven parameter drift

PART 5 — HOW TO THINK DURING INCIDENT RESPONSE

5.1 The First Five Questions

5.2 Symptom-to-Layer Mapping

5.3 Architecture Review Checklist

PART 6 — WHAT GOOD LOOKS LIKE

6.1 A Healthy Nav2 System

6.2 What You Should Be Able to Explain After This Lesson

6.3 Next Step

2.2 `bt_navigator`: The Orchestrator

2.3 `planner_server`: Global Route Computation

2.4 `controller_server`: Path Tracking Under Local Conditions

2.5 `behavior_server`: Recovery and Utility Behaviors

2.7 `lifecycle_manager`: The Control Plane

3.1 One `NavigateToPose` Request, Step by Step