Exercise07 — Nav2 Recovery Policy Design
Estimated time: 100 to 125 minutes
Prerequisite lessons: 03 — Nav2 Bt Navigator And Bt Xml, 06 — Nav2 Local Control And Cmdvel, 08 — Nav2 Recoveries Progress And Goal Checkers, 12 — Nav2 Amr Failure Patterns And Capstone
Mode options:
- Simulation: trigger blocked-path and transient-obstacle scenarios in a Nav2 world and observe recovery behavior sequencing.
- Bag replay and logs: infer which recoveries were attempted and whether they matched the underlying failure mode.
- Design review: complete the exercise as a policy and BehaviorTree review for an AMR team without running hardware.
Validation goal: by the end of this lab you should be able to design a recovery policy that is safe, observable, and aligned with actual AMR failure modes instead of simply retrying until operations lose trust.
Overview
Most bad recovery policies fail in one of two ways:
- they retry the same losing action until the mission times out
- they perform aggressive maneuvers that are technically valid but operationally reckless
For AMRs in warehouses or factories, recoveries are not just a Nav2 concern. They are a product policy. They decide when to retry, when to back up, when to clear costmaps, when to wait for humans, and when to escalate to the mission layer.
This lab forces you to design recoveries from symptoms, safety constraints, and observability requirements.
Section A — Failure Mode Classification
Classify each incident by the first recovery category you would consider.
| Incident |
First category to consider |
Why |
| aisle blocked by a pallet that moved in unexpectedly |
? |
? |
| local costmap appears polluted by a transient ghost obstacle |
? |
? |
| robot cannot rotate cleanly because of tight footprint near staging rack |
? |
? |
| localization confidence suddenly degrades mid-mission |
? |
? |
Choose from categories such as: wait-and-retry, clear local perception state, replan, controlled backup or reposition, localization revalidation, or mission escalation.
Answer guidance
The main lesson is that different root causes deserve different first moves:
- a real pallet obstruction may justify a bounded wait, replan, or mission-level reroute
- ghost obstacles may justify perception-state cleanup only if evidence supports it
- tight-footprint rotation problems may require controlled repositioning rather than repeated spin attempts
- degraded localization should usually change trust and mission state, not merely trigger motion retries
Section B — Critique an Over-Retry Policy
Read the proposed policy below.
On any FollowPath failure:
1. clear both costmaps
2. retry FollowPath immediately
3. if it fails again, back up 0.10 m
4. retry FollowPath immediately
5. repeat this sequence up to 8 times
6. if still failing, abort mission
Questions:
- What is operationally weak about this policy?
- Which failure modes might it accidentally make worse?
- What observability is missing if this policy runs in production?
Answer guidance
Strong answers should mention that the policy treats all failures as equivalent, lacks branching by cause, and may erase useful evidence. It can worsen blocked-aisle congestion, localization uncertainty, and operator confusion. It also fails to expose whether the issue was perception, control, planning, or localization related.
Section C — Design a Tiered Recovery Policy
You are designing for a differential-drive AMR in narrow warehouse aisles.
Task C1 — Fill the Policy Table
| Failure signature |
First recovery action |
Second action if first fails |
Escalation rule |
| temporary human obstruction with valid localization |
? |
? |
? |
| repeated no-valid-control with smooth odometry but tight local geometry |
? |
? |
? |
| planner failure with obviously blocked aisle |
? |
? |
? |
| pose jump or localization confidence drop |
? |
? |
? |
Answer guidance
Good designs usually keep the first action low-risk and evidence-aligned:
- temporary human obstruction: short wait then bounded replan or mission pause
- no-valid-control in tight geometry: small controlled reposition or backing maneuver after checking footprint and local costmap realism
- planner failure in a truly blocked aisle: wait, reroute, or mission-layer dispatch change rather than endless retries
- localization trust drop: pause or slow, revalidate localization, then escalate if trust does not recover
Task C2 — Retry Budget Selection
For each category below, choose a retry philosophy and justify it.
| Category |
Suggested retry philosophy |
Why |
| human crossing or forklift passing nearby |
? |
? |
| local sensor dropout for less than 2 seconds |
? |
? |
| repeated collision-risk backup in a crowded aisle |
? |
? |
| localization reset request |
? |
? |
Possible philosophies:
- many cheap retries are acceptable
- few retries with strong observability
- one retry only, then escalate
- no autonomous retry without higher-level approval
Answer guidance
The answer should show operational judgment rather than generic optimism. Cheap retries are acceptable when the environment is clearly transient and safety is preserved. High-motion recoveries in crowded aisles usually deserve a much stricter budget. Localization reset actions often require explicit mission or operator involvement because they change trust in the robot's state.
Section D — BehaviorTree Decision Review
Assume your BT currently does this:
NavigateToPose
-> ComputePathToPose
-> FollowPath
-> on failure: ClearEntireCostmap
-> Retry NavigateToPose
Questions:
- What important diagnostic branching is missing from this tree?
- What extra condition or subtree would you add for localization-related failures?
- What extra condition or subtree would you add for blocked-aisle cases?
Answer guidance
The missing branch is cause-aware handling. A strong answer suggests separate conditions for localization trust, persistent obstacle presence, or controller-progress failure. The goal is not to write a perfect BT from memory; it is to show that clear-costmap-and-retry is not a universal policy.
Section E — Recovery Observability Design
Define the minimum telemetry you would log or surface whenever a recovery is triggered.
Your answer must include:
- one root-cause hint
- one state snapshot from Nav2
- one mission-layer context field
- one operator-facing label that is not misleading
Then answer these questions:
- Why is
recovery_executed=true useless by itself?
- What would you archive from bag replay or logs after a severe recovery sequence?
Answer guidance
Good answers include things like controller failure reason, costmap or TF snapshot, active mission step, and a label such as `blocked_aisle_waiting` or `localization_revalidation_required`. Evidence archives often include the triggering log window, costmap state, TF health, and mission context.
Section F — Write a Recovery Policy Memo
Write a concise policy memo for operations and software teams that covers:
- when the robot is allowed to recover autonomously
- when it should pause and wait
- when it must escalate to the mission layer or operator
- what evidence should appear in the incident ticket after an abort
Target length: 10 to 14 lines.
Answer guidance
The best memos balance autonomy with trust. They explicitly limit aggressive retries, distinguish transient obstructions from state-trust failures, and require evidence capture instead of "the robot got stuck" summaries.
Section G — AMR Production Reflection
Answer briefly but concretely.
G1. Why is a recovery policy partly a product decision and not just a BT implementation detail?
G2. Which two recovery actions would you treat as highest risk in a crowded warehouse, and why?
G3. If you could add only one dashboard view for recoveries, what would it show?
Answer guidance
Strong answers mention throughput, safety, operator trust, and the need to understand not just whether a recovery ran but whether it matched the actual failure mode.
Deliverable Template
AMR scenario:
Primary failure modes considered:
Tiered recovery policy:
1.
2.
3.
Retry budgets:
- transient obstruction:
- controller geometry issue:
- localization trust issue:
Mission escalation rules:
Observability fields:
- root cause hint:
- nav state snapshot:
- mission context:
- operator label:
Open risks:
Success Criteria
You have completed this lab well if you can:
- match recovery actions to failure signatures instead of applying one generic retry loop
- justify retry budgets using safety and operational reasoning, not only technical optimism
- show where BT branching should reflect localization, geometry, and obstacle differences
- define telemetry that makes recoveries understandable during incident review