08 — Nav2 Recoveries, Progress Checkers, and Goal Checkers

How to design safe retry behavior, detect real stuck conditions, and decide when an AMR should escalate instead of looping

Prerequisite: 07 — Nav2 Localization Odom Amcl Ekf, 06 — Nav2 Local Control And Cmdvel, 03 — Nav2 Bt Navigator And Bt Xml Unlocks: Safer recovery policy, fewer opaque mission failures, cleaner retry budgets, better differentiation between temporary blockage and real navigation breakdown

Why Should I Care? (Context)

When navigation fails in production, the most important question is not “can the robot try something?” It is “what should the robot try next, how many times, and when should it stop?”

Bad recovery design creates recognizable AMR failures:

robot spins repeatedly in a blocked aisle while traffic backs up
robot gives up too early on a temporary human obstruction
robot keeps clearing costmaps even though localization is the real problem
progress checker declares the robot stuck even though the base is inching forward correctly
goal checker reports success at the wrong pose and corrupts downstream task logic

Recoveries and checkers are where technical navigation meets operational policy.

PART 1 — RECOVERIES ARE POLICY, NOT CLEANUP NOISE

1.1 What Recoveries Actually Mean

A recovery is a deliberate policy decision that says:

the normal plan-follow loop is not succeeding
try a bounded alternative behavior to regain progress or clarity

Common Nav2 recovery behaviors include:

clearing costmaps
spinning
backing up
waiting

Those are not equivalent. Each assumes a different story about why navigation failed.

1.2 Match the Recovery to the Failure Story

Suspected failure story	Good first recovery bias	Bad first recovery bias
temporary human or forklift obstruction	wait, maybe replan	aggressive backup or repeated spin
stale or phantom obstacle in costmap	clear relevant costmap, then replan	repeated wait with no model refresh
robot nose trapped in tight geometry	short backup, then replan	repeated clear-only loop
localization confusion	escalate or relocalize workflow outside default recoveries	endless spin/clear cycles

If the recovery assumes the wrong story, the robot wastes time and operator trust.

1.3 Retry Budget Is a Product Decision

The number of retries is not just a technical knob.

More retries mean:

better resilience to temporary blockage
more opportunity for success without operator help
more time spent failing inside shared traffic space

Fewer retries mean:

faster escalation to mission or fleet logic
fewer useless loops
more false aborts in environments with frequent short-lived obstruction

That tradeoff depends on aisle width, human traffic, mission urgency, and operator expectations.

PART 2 — PROGRESS CHECKERS: IS THE ROBOT REALLY MOVING?

2.1 What a Progress Checker Is Trying to Prove

A progress checker asks a simple but operationally critical question:

has the robot made enough real progress within enough time to justify continuing?

That usually boils down to thresholds such as:

minimum movement radius
allowed time without sufficient movement

Those values sound simple, but they are heavily dependent on the robot and its motion stack.

2.2 False Stuck Detection Is Common

A progress checker can fire even when the controller is behaving rationally.

Typical causes:

velocity smoother reduces motion too much
base deadband prevents tiny commands from producing measurable progress
robot is intentionally inching during docking or narrow-aisle alignment
localization noise hides real low-speed movement

If the checker says “stuck,” prove whether the robot was actually unable to move or only unable to satisfy the chosen threshold.

2.3 Progress Thresholds Must Match Motion Mode

The same thresholds rarely fit all scenarios.

Scenario	Risk if threshold too strict	Risk if threshold too loose
normal aisle following	false stuck detection	slow recognition of real failures
dense human environment	constant needless recoveries	robot waits too long in congestion
docking or final alignment	aborts during valid inching motion	masks real low-speed deadlock

This is why high-quality AMR systems often treat docking and general navigation differently at the mission layer.

PART 3 — GOAL CHECKERS: WHEN ARE YOU REALLY “THERE”?

3.1 Goal Reached Is a Contract, Not a Feeling

A goal checker determines whether the robot has satisfied positional and angular tolerances strongly enough to report success.

This matters because success triggers downstream actions:

task completion
manipulator handoff
docking continuation
fleet scheduling updates

If success is declared too early, the robot may be operationally in the wrong place even though Nav2 says done.

3.2 Position and Yaw Tolerances Are Operational Choices

Loose tolerances can help throughput in coarse navigation tasks.

Tight tolerances matter for:

docking
pickup/dropoff staging
narrow station approach
handoff to another subsystem that assumes precise pose

Bad pattern:

team sees goal oscillation
they loosen tolerances to make the issue disappear
docking or staging later becomes unreliable

That is not fixing the root cause. It is shifting the failure downstream.

3.3 Stateful Goal Checking Can Be Useful and Dangerous

Some goal-checking behavior uses state to avoid flapping once tolerances are satisfied.

That can help stability, but it can also hide situations where the robot briefly enters tolerance and then drifts away.

Use it intentionally, especially if a downstream workflow assumes precise final pose.

PART 4 — DEFAULT RECOVERIES AND WHEN THEY MAKE SENSE

4.1 Clear Costmap

This makes sense when the world model may be stale or polluted.

Good use cases:

phantom obstacle from observation lag
old obstacle trail after a pallet moved
local costmap not matching visible reality

Bad use cases:

persistent real obstacle
wrong global localization
base cannot execute commands

If costmap clearing works repeatedly, do not celebrate. Ask why the stale obstacle keeps returning.

4.2 Wait

Waiting is underused in human-heavy or forklift-heavy environments.

It is often the safest first move when:

obstruction is temporary
backing up creates more conflict
spinning expands risk envelope near shelves or pedestrians

Waiting is bad when the robot is geometrically trapped or the world model itself is wrong.

4.3 Back Up

Backing up is useful when the robot needs space to replan or re-enter a corridor.

It is risky when:

rear perception is weak
shared-traffic policy forbids blind retreat
the robot is near a station or handoff zone

In production AMRs, backup distance and conditions are policy decisions, not just defaults.

4.4 Spin

Spin can help with sensor coverage and local environment refresh.

It can also be a throughput killer in narrow aisles.

If the robot repeatedly spins in a place where turning radius is operationally awkward, the policy likely belongs at the BT or mission layer, not in more retries.

PART 5 — RECOVERY DESIGN FOR COMMON AMR INCIDENTS

5.1 Blocked Aisle by Temporary Traffic

Recommended bias:

wait briefly
re-evaluate and maybe replan
escalate if blocked beyond policy budget

Why:

spinning and backing often worsen shared-space behavior
repeated aggressive recoveries reduce throughput for everyone

5.2 Phantom Obstacle in Local Costmap

Recommended bias:

clear local costmap
retry local follow
if repeated, investigate sensor or observation-source health

Do not let costmap clear become a permanent band-aid for perception integration defects.

5.3 Nose Trapped Near Shelf Corner

Recommended bias:

short backup
replan
retry with bounded count

This is the kind of incident where backup makes more sense than waiting because geometry, not traffic, is the main issue.

5.4 Docking or Staging Failure

Recommended bias:

avoid generic repeated recoveries
decide whether the issue is approach geometry, localization precision, or station-specific logic
escalate to a docking-specific routine or mission layer if needed

Default recoveries are often too generic for precise station work.

PART 6 — CHECKER TUNING WITHOUT LYING TO YOURSELF

6.1 Do Not Use Goal Tolerance to Hide Localization Noise

If the robot cannot settle because pose estimate is noisy, loosening the goal checker may only move the failure to docking, manipulation, or task completion.

Treat loose tolerances as a task-level decision, not a universal fix.

6.2 Do Not Use Progress Thresholds to Hide Base Problems

If the base ignores small commands, increasing movement time allowance may reduce false aborts but it does not repair the command-chain mismatch.

First prove that commanded low-speed motion is physically executable.

6.3 Validate Checkers on Three Separate Cases

Always validate on:

normal aisle motion
temporary obstruction
final approach or docking-like low-speed motion

If one threshold set only works in one case, document the limitation instead of pretending it is universal.

PART 7 — ESCALATION TO THE MISSION OR FLEET LAYER

7.1 Not Every Failure Should Be Solved Inside Nav2

Escalate when:

blockage persists beyond local retry budget
another robot or human policy needs to change the route
docking requires a product-specific decision tree
operator assistance is required

Nav2 should not carry all mission semantics alone.

7.2 Good Escalation Signals

Useful signals to expose upward:

reason for abort or failure type
number of retries consumed
whether progress checker or goal checker was the trigger
whether recoveries changed the world model or only retried movement

This is what lets the mission layer make informed next-step decisions.

7.3 Senior Interview Version

Strong answers explain that recoveries are:

bounded
scenario-specific
safe for the environment
observable from logs and metrics
connected to mission escalation rather than endless local looping

That is what distinguishes production AMR policy from demo navigation.

Next Lesson

Continue to 09 — Nav2 Waypoints Docking Zones And Missions. That lesson explains how waypoint flows, docking, costmap filters, and higher-level mission logic sit above these local retry and success contracts.