Exercise02 — Nav2 BT Trace and Recovery Loop
Estimated time: 80 to 95 minutes
Prerequisite lessons: 02 — Nav2 Bringup Lifecycle Actions, 03 — Nav2 Bt Navigator And Bt Xml, 08 — Nav2 Recoveries Progress And Goal Checkers, 11 — Nav2 Debugging Observability And Bag Analysis
Mode options:
- Simulation: run a blocked-goal or obstacle scenario and watch BT status in logs or Groot-style tooling.
- Log analysis: use the supplied BT traces and reason about the policy without running a robot.
- Field RCA: substitute a real recovery-loop incident and classify each retry stage using the same structure.
Validation goal: you should be able to read a BT trace and decide whether the real problem is planning, local execution, stale world modeling, or a bad recovery policy.
Overview
BehaviorTree traces are where senior Nav2 debugging starts to look different from guesswork.
Junior debugging often sounds like this:
- the robot spun, so spinning must be broken
- the planner failed, so the planner plugin must be wrong
- recovery ran too many times, so reduce retries
Senior debugging asks different questions:
- what node in the tree failed first?
- what assumption did the recovery subtree make about the failure story?
- was replanning helpful, neutral, or actively wasting aisle time?
- when should the BT have escalated instead of looping?
This lab trains that lens.
Section A — Tick-by-Tick Trace Reading
For each scenario, name the first failing contract and recommend the most defensible next investigation.
Scenario A1 — Infinite Replan, Zero Progress
[bt_navigator] Tick
[bt_navigator] ComputePathToPose: SUCCESS
[bt_navigator] FollowPath: RUNNING
[bt_navigator] Tick
[bt_navigator] FollowPath: FAILURE
[bt_navigator] GoalUpdated: FAILURE
[bt_navigator] ClearEntireCostmap: SUCCESS
[bt_navigator] Spin: SUCCESS
[bt_navigator] Tick
[bt_navigator] ComputePathToPose: SUCCESS
[bt_navigator] FollowPath: FAILURE
[controller_server] Progress checker failed: no movement in 10.0s
Questions:
- Which contract failed first: planning, control, progress checking, or recovery logic?
- Why is costmap clearing probably not the right primary fix?
- What two pieces of evidence would distinguish controller tuning from base deadband or downstream velocity clipping?
Answer guidance
The planner is succeeding, so the first failure is on the local execution side. The trace points at controller or command-path inability to make progress. Costmap clearing is being used as a generic recovery, but nothing in the evidence says stale obstacles are the main story.
The decisive evidence pair is usually:
- controller output versus final base command topic
- robot pose or odometry change during the same interval
If the controller publishes meaningful commands that never become motion, the problem is likely downstream. If commands themselves are indecisive or oscillatory, controller tuning becomes more likely.
Scenario A2 — Planner Fails, Recovery Works Once, Then Repeats
[bt_navigator] ComputePathToPose: FAILURE
[planner_server] No valid path found
[bt_navigator] ClearEntireCostmap: SUCCESS
[bt_navigator] ComputePathToPose: SUCCESS
[bt_navigator] FollowPath: RUNNING
[bt_navigator] FollowPath: FAILURE
[costmap_2d] Observation buffer dropped 18 messages due to stale transforms
[bt_navigator] ComputePathToPose: FAILURE
Questions:
- Why is “planner issue” an incomplete diagnosis here?
- What changed between the first and second planning attempts?
- What upstream contract looks weak enough to corrupt both planning and execution over time?
Answer guidance
This is not just a planner story because the planner briefly succeeds after a recovery, then execution later collapses again. That pattern suggests the world model is unstable rather than the planner algorithm being intrinsically incapable.
The important clue is stale transform handling inside costmap observation processing. TF timing or localization freshness can poison obstacle marking, which then causes path validity to oscillate over time.
Section B — Recovery Policy Critique
Read each policy proposal and decide whether it matches the failure story.
Proposal B1 — Blocked Aisle With Human Traffic
<ReactiveFallback name="RecoveryFallback">
<GoalUpdated/>
<SequenceWithMemory name="RecoveryActions">
<Spin spin_dist="3.14"/>
<BackUp backup_dist="0.50" backup_speed="0.10"/>
<Wait wait_duration="1.0"/>
</SequenceWithMemory>
</ReactiveFallback>
Questions:
- What assumption is this policy making about the cause of failure?
- Why might it be poor for a narrow warehouse aisle with pedestrians or forklifts?
- Reorder the recoveries into a safer first-pass policy and justify the first two steps.
Answer guidance
This tree assumes the robot is locally trapped and should physically maneuver first. In a human-heavy aisle that can be the wrong story. Spinning enlarges the risk envelope, backing up may worsen traffic conflicts, and a 1-second wait is usually too short to discriminate temporary blockage from a real deadlock.
A stronger first-pass policy is often:
1. short wait
2. replan or clear only the relevant costmap if stale perception is plausible
3. controlled backup only if geometry suggests the robot is nose-trapped
4. spin later, not first
Proposal B2 — Phantom Obstacles Near a Dock
<SequenceWithMemory name="RecoveryActions">
<Wait wait_duration="10.0"/>
<Wait wait_duration="10.0"/>
<Wait wait_duration="10.0"/>
</SequenceWithMemory>
Questions:
- Why is waiting alone a weak response to suspected stale obstacle data?
- What recovery action would better test the hypothesis that the world model is polluted?
- What evidence would tell you the real problem is localization rather than costmap dirtiness?
Answer guidance
If the world model is wrong, waiting does not actively refresh it. A costmap clear or targeted observation-debug step is more aligned with the failure story. If the robot or dock pose is globally wrong, however, repeated clears will not help; you would instead see consistent mismatch between RViz or log-reported pose and the physical staging geometry.
Section C — Build an Incident Timeline
Use the evidence below to write a five-step incident timeline.
12:00:00.010 Goal accepted by NavigateToPose
12:00:00.030 ComputePathToPose SUCCESS, path length 14.2m
12:00:02.400 FollowPath RUNNING
12:00:13.100 FollowPath FAILURE
12:00:13.120 progress_checker: robot pose changed only 0.04m in 10.0s
12:00:13.140 Recovery: ClearLocalCostmap SUCCESS
12:00:13.500 ComputePathToPose SUCCESS, path length 14.0m
12:00:23.600 FollowPath FAILURE
12:00:23.620 progress_checker: robot pose changed only 0.03m in 10.0s
12:00:23.700 Recovery: BackUp SUCCESS
12:00:24.900 ComputePathToPose SUCCESS
12:00:35.100 FollowPath FAILURE
12:00:35.200 Recovery budget exhausted, aborting
Tasks:
- Write the earliest plausible root-cause statement.
- List two alternate hypotheses that still fit the evidence.
- State which one observation would most efficiently discriminate between them.
Answer guidance
The earliest plausible root cause is failure to make local progress despite valid global paths. Alternate hypotheses include:
- controller tuning or overly strict progress checker
- downstream `cmd_vel` clipping or base deadband
- local costmap showing a hidden or stale obstacle that makes the controller refuse motion
The best discriminating observation is often a synchronized view of controller command output, final base command, and odometry delta over the same window.
Section D — Design a Better Escalation Rule
You are tuning Nav2 for an AMR that shares aisles with people and pallet traffic. A blocked aisle can be normal for 15 to 30 seconds, but spinning repeatedly in place is operationally unacceptable.
Questions:
- After how many failed local progress cycles should the BT escalate to mission or fleet logic rather than keep retrying internally?
- What operator-visible event should be emitted when that escalation happens?
- What information should be attached to that event so another system can make a better decision?
Answer guidance
There is no single magic number, but a strong answer ties the retry budget to aisle blocking norms, safety, and throughput. A good escalation event should differentiate between "temporarily blocked", "stuck due to local execution", and "global planning unavailable". The payload should include current pose, goal pose, last successful path age, recovery history, and a concise failure classification.
Deliverable Template
Scenario type:
Simulation / logs / production incident
First failing contract:
Evidence for that claim:
- trace lines:
- supporting logs:
- missing evidence:
Best next action:
Recovery or escalation recommendation:
Success Criteria
You have completed this lab well if you can:
- read a BT trace without confusing symptoms and causes
- explain when replanning is helpful and when it is just churn around a deeper local problem
- critique recovery ordering using real AMR operating constraints rather than abstract elegance
- define when the BT should stop retrying and hand control back to a broader mission system