Exercise02 — Nav2 BT Trace and Recovery Loop

Companion exercises for 03 — Nav2 Bt Navigator And Bt Xml, 08 — Nav2 Recoveries Progress And Goal Checkers, and 11 — Nav2 Debugging Observability And Bag Analysis

Estimated time: 80 to 95 minutes
Prerequisite lessons: 02 — Nav2 Bringup Lifecycle Actions, 03 — Nav2 Bt Navigator And Bt Xml, 08 — Nav2 Recoveries Progress And Goal Checkers, 11 — Nav2 Debugging Observability And Bag Analysis

Mode options:

Simulation: run a blocked-goal or obstacle scenario and watch BT status in logs or Groot-style tooling.
Log analysis: use the supplied BT traces and reason about the policy without running a robot.
Field RCA: substitute a real recovery-loop incident and classify each retry stage using the same structure.

Validation goal: you should be able to read a BT trace and decide whether the real problem is planning, local execution, stale world modeling, or a bad recovery policy.

Overview

BehaviorTree traces are where senior Nav2 debugging starts to look different from guesswork.

Junior debugging often sounds like this:

the robot spun, so spinning must be broken
the planner failed, so the planner plugin must be wrong
recovery ran too many times, so reduce retries

Senior debugging asks different questions:

what node in the tree failed first?
what assumption did the recovery subtree make about the failure story?
was replanning helpful, neutral, or actively wasting aisle time?
when should the BT have escalated instead of looping?

This lab trains that lens.

Section A — Tick-by-Tick Trace Reading

For each scenario, name the first failing contract and recommend the most defensible next investigation.

Scenario A1 — Infinite Replan, Zero Progress

[bt_navigator] Tick
[bt_navigator]   ComputePathToPose: SUCCESS
[bt_navigator]   FollowPath: RUNNING
[bt_navigator] Tick
[bt_navigator]   FollowPath: FAILURE
[bt_navigator]   GoalUpdated: FAILURE
[bt_navigator]   ClearEntireCostmap: SUCCESS
[bt_navigator]   Spin: SUCCESS
[bt_navigator] Tick
[bt_navigator]   ComputePathToPose: SUCCESS
[bt_navigator]   FollowPath: FAILURE
[controller_server] Progress checker failed: no movement in 10.0s

Questions:

Which contract failed first: planning, control, progress checking, or recovery logic?
Why is costmap clearing probably not the right primary fix?
What two pieces of evidence would distinguish controller tuning from base deadband or downstream velocity clipping?

Answer guidance

The planner is succeeding, so the first failure is on the local execution side. The trace points at controller or command-path inability to make progress. Costmap clearing is being used as a generic recovery, but nothing in the evidence says stale obstacles are the main story. The decisive evidence pair is usually: - controller output versus final base command topic - robot pose or odometry change during the same interval If the controller publishes meaningful commands that never become motion, the problem is likely downstream. If commands themselves are indecisive or oscillatory, controller tuning becomes more likely.

[ ] Done

Scenario A2 — Planner Fails, Recovery Works Once, Then Repeats

[bt_navigator]   ComputePathToPose: FAILURE
[planner_server] No valid path found
[bt_navigator]   ClearEntireCostmap: SUCCESS
[bt_navigator]   ComputePathToPose: SUCCESS
[bt_navigator]   FollowPath: RUNNING
[bt_navigator]   FollowPath: FAILURE
[costmap_2d] Observation buffer dropped 18 messages due to stale transforms
[bt_navigator]   ComputePathToPose: FAILURE

Questions:

Why is “planner issue” an incomplete diagnosis here?
What changed between the first and second planning attempts?
What upstream contract looks weak enough to corrupt both planning and execution over time?

Answer guidance

This is not just a planner story because the planner briefly succeeds after a recovery, then execution later collapses again. That pattern suggests the world model is unstable rather than the planner algorithm being intrinsically incapable. The important clue is stale transform handling inside costmap observation processing. TF timing or localization freshness can poison obstacle marking, which then causes path validity to oscillate over time.

[ ] Done

Section B — Recovery Policy Critique

Read each policy proposal and decide whether it matches the failure story.

Proposal B1 — Blocked Aisle With Human Traffic

<ReactiveFallback name="RecoveryFallback">
  <GoalUpdated/>
  <SequenceWithMemory name="RecoveryActions">
    <Spin spin_dist="3.14"/>
    <BackUp backup_dist="0.50" backup_speed="0.10"/>
    <Wait wait_duration="1.0"/>
  </SequenceWithMemory>
</ReactiveFallback>

Questions:

What assumption is this policy making about the cause of failure?
Why might it be poor for a narrow warehouse aisle with pedestrians or forklifts?
Reorder the recoveries into a safer first-pass policy and justify the first two steps.

Answer guidance

This tree assumes the robot is locally trapped and should physically maneuver first. In a human-heavy aisle that can be the wrong story. Spinning enlarges the risk envelope, backing up may worsen traffic conflicts, and a 1-second wait is usually too short to discriminate temporary blockage from a real deadlock. A stronger first-pass policy is often: 1. short wait 2. replan or clear only the relevant costmap if stale perception is plausible 3. controlled backup only if geometry suggests the robot is nose-trapped 4. spin later, not first

[ ] Done

Proposal B2 — Phantom Obstacles Near a Dock

<SequenceWithMemory name="RecoveryActions">
  <Wait wait_duration="10.0"/>
  <Wait wait_duration="10.0"/>
  <Wait wait_duration="10.0"/>
</SequenceWithMemory>

Questions:

Why is waiting alone a weak response to suspected stale obstacle data?
What recovery action would better test the hypothesis that the world model is polluted?
What evidence would tell you the real problem is localization rather than costmap dirtiness?

Answer guidance

If the world model is wrong, waiting does not actively refresh it. A costmap clear or targeted observation-debug step is more aligned with the failure story. If the robot or dock pose is globally wrong, however, repeated clears will not help; you would instead see consistent mismatch between RViz or log-reported pose and the physical staging geometry.

[ ] Done

Section C — Build an Incident Timeline

Use the evidence below to write a five-step incident timeline.

12:00:00.010  Goal accepted by NavigateToPose
12:00:00.030  ComputePathToPose SUCCESS, path length 14.2m
12:00:02.400  FollowPath RUNNING
12:00:13.100  FollowPath FAILURE
12:00:13.120  progress_checker: robot pose changed only 0.04m in 10.0s
12:00:13.140  Recovery: ClearLocalCostmap SUCCESS
12:00:13.500  ComputePathToPose SUCCESS, path length 14.0m
12:00:23.600  FollowPath FAILURE
12:00:23.620  progress_checker: robot pose changed only 0.03m in 10.0s
12:00:23.700  Recovery: BackUp SUCCESS
12:00:24.900  ComputePathToPose SUCCESS
12:00:35.100  FollowPath FAILURE
12:00:35.200  Recovery budget exhausted, aborting

Tasks:

Write the earliest plausible root-cause statement.
List two alternate hypotheses that still fit the evidence.
State which one observation would most efficiently discriminate between them.

Answer guidance

The earliest plausible root cause is failure to make local progress despite valid global paths. Alternate hypotheses include: - controller tuning or overly strict progress checker - downstream `cmd_vel` clipping or base deadband - local costmap showing a hidden or stale obstacle that makes the controller refuse motion The best discriminating observation is often a synchronized view of controller command output, final base command, and odometry delta over the same window.

[ ] Done

Section D — Design a Better Escalation Rule

You are tuning Nav2 for an AMR that shares aisles with people and pallet traffic. A blocked aisle can be normal for 15 to 30 seconds, but spinning repeatedly in place is operationally unacceptable.

Questions:

After how many failed local progress cycles should the BT escalate to mission or fleet logic rather than keep retrying internally?
What operator-visible event should be emitted when that escalation happens?
What information should be attached to that event so another system can make a better decision?

Answer guidance

There is no single magic number, but a strong answer ties the retry budget to aisle blocking norms, safety, and throughput. A good escalation event should differentiate between "temporarily blocked", "stuck due to local execution", and "global planning unavailable". The payload should include current pose, goal pose, last successful path age, recovery history, and a concise failure classification.

[ ] Done

Deliverable Template

Scenario type:
Simulation / logs / production incident

First failing contract:

Evidence for that claim:
- trace lines:
- supporting logs:
- missing evidence:

Best next action:

Recovery or escalation recommendation:

Success Criteria

You have completed this lab well if you can:

read a BT trace without confusing symptoms and causes
explain when replanning is helpful and when it is just churn around a deeper local problem
critique recovery ordering using real AMR operating constraints rather than abstract elegance
define when the BT should stop retrying and hand control back to a broader mission system