Exercise10 — AMR Debug Capstone
Estimated time: 120 to 150 minutes
Prerequisite lessons: 07 — Nav2 Localization Odom Amcl Ekf, 08 — Nav2 Recoveries Progress And Goal Checkers, 09 — Nav2 Waypoints Docking Zones And Missions, 10 — Nav2 Parameters Launch And Plugin Extension, 11 — Nav2 Debugging Observability And Bag Analysis, 12 — Nav2 Amr Failure Patterns And Capstone
Mode options:
- Simulation: reproduce parts of the incident in a Nav2-enabled warehouse world and compare observed symptoms to the evidence packet.
- Bag replay and logs: treat the supplied evidence as the primary source of truth and produce a structured root-cause analysis.
- Design review: complete the capstone as a production incident exercise with no hardware dependency.
Validation goal: by the end of this capstone you should be able to separate localization, costmap, controller, recovery-policy, zone-policy, and mission-layer mistakes in one coherent incident review instead of blaming “navigation” as a single black box.
Scenario
You are the on-call engineer for a warehouse AMR fleet.
At 09:14, robot amr-07 was executing this mission:
- leave outbound staging
- travel through packing zone corridor
- stop at pick station
P-12
- continue to charger after pick confirmation
Operations report:
- the robot slowed appropriately near the packing zone
- then hesitated, backed up once, turned sharply, and aborted
- the dashboard labeled the incident
navigation_failed
- the mission system immediately reassigned the order to another robot
Your job is to produce a defensible diagnosis from incomplete evidence.
Evidence Packet
[mission_manager] Dispatch mission pick_P12_then_charge to amr-07
[zone_manager] Entered speed zone packing_corridor, max_speed=0.25
[bt_navigator] Begin NavigateToPose to P-12 approach waypoint
[controller_server] Selected command: linear.x=0.18 angular.z=0.22
[velocity_smoother] Output command: linear.x=0.10 angular.z=0.08
[local_costmap.local_costmap] Message Filter dropping message: frame 'laser' at time 551.420 for reason 'the timestamp on the message is earlier than all the data in the transform cache'
[controller_server] No valid control command found
[behavior_server] Executing backup behavior
[amcl] Pose covariance increased to x=0.36 y=0.40 yaw=0.31
[bt_navigator] Recovery node returned SUCCESS
[planner_server] ComputePathToPose failed: no valid path found
[mission_manager] Navigation result=FAILED for waypoint P-12 approach
[mission_manager] Auto-diverting robot to charge mission
Supplemental notes:
- RViz screenshot from the incident shows the robot slightly rotated relative to the mapped corridor.
- A replay clip suggests intermittent human traffic was present in the packing corridor.
- The forklift keepout zone was not violated.
- The battery was at
38%, well above the normal charge-divert threshold.
Section A — First-Pass Triage
Answer in order, without skipping ahead to fixes.
- What are the first three competing hypotheses you would consider?
- Which single log line most strongly changes your investigation priority?
- Why is
navigation_failed an insufficient incident label here?
Answer guidance
Reasonable first hypotheses include:
- localization or TF timing degradation causing observation drops and controller invalidation
- a genuine transient corridor obstruction interacting with the recovery policy
- mission-layer misclassification or mis-escalation after a bounded Nav2 failure
The dropped-message line and covariance increase should strongly influence prioritization because they suggest state trust issues, not just route inconvenience.
Section B — Evidence Classification Table
Classify each evidence item.
| Evidence item |
Best category |
Why it matters |
| speed zone entry |
? |
? |
| smoothed command lower than controller intent |
? |
? |
| local costmap dropped laser message |
? |
? |
| backup behavior success |
? |
? |
| planner later reports no valid path |
? |
? |
| mission auto-diverts to charger at 38% battery |
? |
? |
Possible categories: spatial policy, command-chain behavior, TF or timing issue, recovery outcome, path feasibility symptom, mission-layer policy error.
Answer guidance
This table should help you avoid flattening the entire incident into one cause. Several entries are likely downstream effects or policy context rather than root cause.
Section C — Root Cause Analysis
Task C1 — Build the Causal Chain
Write a causal chain using this structure:
Trigger or degraded condition -> immediate Nav2 symptom -> recovery behavior -> downstream planning symptom -> mission-layer decision
Then answer:
- Which part of the chain currently has the strongest evidence?
- Which part is still only a hypothesis?
- What extra evidence would most reduce uncertainty?
Answer guidance
High-quality answers usually treat TF or localization degradation as an upstream candidate, controller invalidation and backup as middle symptoms, and the charge diversion as a possibly separate mission-policy mistake.
Task C2 — Distinguish Root Cause From Bad Outcome Policy
Questions:
- Is the automatic diversion to charge mission justified by the evidence given?
- Why is that question separate from diagnosing the original navigation issue?
- What two policy contracts appear weak even if Nav2 was partly at fault?
Answer guidance
The battery note suggests the diversion policy may be unjustified. That is separate because mission policy can still be wrong even when a real navigation issue occurred. Common weak contracts here are failure classification and post-failure mission escalation rules.
Section D — Focused Validation Plan
You have 45 minutes and no hardware access.
Design a validation plan with exactly four steps. It must include:
- one TF or timing check
- one localization-confidence check
- one command-path check
- one mission-policy review step
For each step, specify:
- what evidence you inspect
- what result would support your leading hypothesis
- what result would falsify it
Answer guidance
The goal is disciplined narrowing. Avoid broad fishing expeditions. Each step should be able to disconfirm part of your current story.
Section E — Operator and Dashboard Redesign
The current operator ticket only says navigation_failed.
Task E1 — Replace It With a Better Incident Summary
Write a replacement incident summary of 5 to 7 lines that separates:
- possible localization or TF health issue
- bounded recovery execution
- downstream path failure
- questionable mission escalation
Then answer:
- What labels should the dashboard have emitted instead?
- Which two timeline events should be highlighted for review by default?
Answer guidance
Strong answers create a layered narrative such as: `packing_zone_hold_or_state_degradation`, `recovery_executed`, `planner_no_path_after_recovery`, `mission_policy_diverted_to_charge`. The point is truthful decomposition.
Section F — Corrective Action Proposal
Propose one corrective action in each category below.
| Category |
Corrective action |
Why |
| observability |
? |
? |
| localization or TF robustness |
? |
? |
| recovery policy |
? |
? |
| mission-layer failure handling |
? |
? |
Answer guidance
The best answers avoid magical one-line fixes. They improve evidence quality, harden the likely weak contract, and reduce the chance of the mission system making a second bad decision after the first fault.
Section G — Final RCA Writeup
Write a concise root-cause analysis using this template.
Incident summary:
Most likely upstream fault:
Evidence supporting it:
-
-
-
Secondary contributing factors:
-
-
Uncertain or missing evidence:
-
Why the dashboard label was misleading:
Recommended next validations:
1.
2.
3.
Answer guidance
Strong writeups show hierarchy: upstream fault, downstream Nav2 symptoms, and mission-policy mistakes are separated rather than blended.
Section H — Senior Reflection
Answer briefly but concretely.
H1. Why do AMR incidents become expensive when teams use navigation_failed as a catch-all label?
H2. What would you teach a new on-call engineer to check first in incidents like this?
H3. If you could add one mandatory artifact to every Nav2 incident package, what would it be?
Answer guidance
Strong answers mention time-to-diagnosis, bad escalation decisions, and the value of one compact artifact that joins TF health, localization confidence, and mission context.
Deliverable Template
Incident ID:
Leading hypothesis:
Evidence table:
- localization or TF:
- command chain:
- recovery sequence:
- planner symptom:
- mission policy:
Causal chain:
Most likely root cause:
Secondary contributing factors:
Corrective actions:
1.
2.
3.
4.
Recommended dashboard labels:
Open questions:
Success Criteria
You have completed this capstone well if you can:
- decompose one AMR incident across localization, command path, recovery behavior, planner outcome, and mission policy
- state which evidence is strongest and which conclusions remain provisional
- design a focused validation plan that could falsify your current diagnosis without hardware access
- write an incident summary that operations, robotics engineers, and mission-software engineers can all act on