Top-K Gated Macro-Actions for Neural Heuristic Search

Published in Under review at SOCS 2026, 2026

Under review at SOCS 2026

This paper studies how to keep the depth-reduction benefits of macro-actions in neural heuristic search without paying the full branching-factor cost at every expansion.

Why it matters

Neural heuristic search reduces reliance on handcrafted heuristics, but it makes each node expansion more expensive because every extra successor often requires another neural network evaluation.

Macro-actions can shorten search depth, yet they usually increase branching. The central problem is balancing the depth reduction macros provide against the cost of exploring more successors.

What we did

  • Introduce top-K gating, which keeps all primitive actions but selects only the K best macro successors at each expansion.
  • Bound branching to b + K regardless of macro-pool size while preserving completeness.
  • Discover macros during approximate value iteration (AVI) and integrate them into training by adding their landing states.
  • Show on sliding-tile puzzles that K = 1 reduces node expansions by up to 28.6% on the 8-puzzle, 17.5% on the 15-puzzle, and 9.4% on the 24-puzzle.
  • Find that random macros provide less than 3% improvement, indicating the gains come from learned macro structure rather than simply adding longer actions.
End-to-end architecture showing data generation, AVI training, macro discovery, optional utility scoring, and top-K gated search.
Figure 1. End-to-end architecture. Left: DeepCubeA generates training data via backward walks from the goal and trains a cost-to-go heuristic h_theta via approximate value iteration (AVI). Center: periodically during training, GBFS trajectories under h_theta are mined for macro candidates. Right: at search time, the full learned macro pool is deployed via top-K gating. Utility scoring is an optional filtering step evaluated as an ablation. The trained h_theta is reused for both macro discovery and gating decisions; no retraining is required.
28.6% 8-puzzle expansion reduction
17.5% 15-puzzle expansion reduction
9.4% 24-puzzle expansion reduction
<3% random macro improvement

The overall pipeline is shown in Figure 1, where macro discovery and search are tightly coupled.

How it works

  • Macro discovery during training: mine action subsequences of length 2-5 from GBFS trajectories under the learned heuristic.
  • Training integration: add macro landing states into the AVI training distribution while keeping Bellman updates primitive-only.
  • Optional utility scoring: rank macro candidates using support, applicability, benefit, and branching penalty.
  • Top-K gating at search time: evaluate all valid macros but retain only the K lowest-heuristic successors.
  • Cost consistency: assign each macro a cost equal to its primitive length to preserve A* correctness.
Key design choice: primitives are never gated away. Top-K gating only filters macro successors, so the search remains complete while macro branching stays explicitly bounded.

Key contributions

  • Introduce top-K gating that preserves completeness and caps branching at b + K.
  • Propose a macro discovery pipeline integrated with AVI training.
  • Show that learned macros outperform random ones, with learned gating giving up to 28.6% lower expansions versus less than 3% for random macros.
  • Demonstrate that K = 1 is Pareto-optimal, matching larger K with minimal branching overhead.
  • Analyze when lower expansions do and do not translate to faster wall-clock runtime, with observed changes ranging from 21% faster to 11% slower.
PuzzlekPrimitiveGated K = 1Delta %
8-puz205.5 +/- 0.124.7 +/- 0.10-14.6
509.9 +/- 0.207.7 +/- 0.16-22.2
10016.4 +/- 0.2811.7 +/- 0.21-28.6
15-puz206.6 +/- 0.156.3 +/- 0.15-3.6
5013.9 +/- 0.2712.2 +/- 0.23-12.1
10029.5 +/- 1.8524.3 +/- 1.64-17.5
20067.1 +/- 4.0156.7 +/- 4.03-15.5
500149.2 +/- 8.16130.3 +/- 7.54-12.7
24-puz206.9 +/- 0.156.7 +/- 0.15-3.2
5015.8 +/- 0.3414.3 +/- 0.30-9.4
10035.3 +/- 0.9232.4 +/- 0.90-8.2
200103.3 +/- 3.5597.4 +/- 3.57-5.7
500394.0 +/- 19.6401.0 +/- 19.3+1.8dagger
Table 1. Mean node expansions (x 10^3) over solved instances under BWAS (lambda = 1.0). Delta % is the relative change versus primitive search; the best result per domain is shaded.

dagger Solve rate improves from 36.4% to 38.4%, so the solved-instance mean regresses slightly even though more 24-puzzle instances are solved.

Table 1 shows that K = 1 consistently reduces expansions across tested 8-puzzle, 15-puzzle, and 24-puzzle settings, with the strongest reductions at k = 100 for the 8- and 15-puzzle and at k = 50 for the 24-puzzle.

Recommended citation: Vedant Khandelwal et al. (2026). "Top-K Gated Macro-Actions for Neural Heuristic Search." Under review at SOCS 2026.