BTGenBot-2:
Efficient Behavior Tree Generation with Small Language Models

Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

Politecnico di Milano

Accepted at the International Joint Conference on Neural Networks (IJCNN 2026)

Paper Code

Models

We present BTGenBot-2, a 1B-parameter open-source small language model (SLM), trained on 5k natural language instructions and behavior tree pairs from a synthetic instruction-following dataset. BTGenBot-2 achieves state-of-the-art performance in generating directly executable behavior trees. We publicly release our synthetic dataset, model weights, codebase, and the BT Benchmark.

NVIDIA Nova Carter navigation in a warehouse

Divide objects based on specific properties

Multiple NVIDIA Nova Carter navigation in office

Stacking

Limo navigation with fallback in a warehouse

Pick and place

Simulations in NVIDIA Isaac Sim demonstrate the effective execution of generated behavior trees across both navigation and manipulation tasks in our BT Benchmark

The BTGenBot-2 Model

The model takes as input a natural language task description of a robotic task along with the set of available robot action primitives, generating a ROS-2 compatible BT in XML format. The model is adapted using a QLoRA adapter while keeping pre-trained parameters frozen. Generated BT are initially validated at inference time, checking for syntax and action-space consistency before execution. Additionally, at runtime, an inline logger tracks stack traces and blackboard states. In case of errors, the runtime parser triggers subtree regeneration for recovery.

Dataset

Data Generation

Starting with the TSE dataset, a new instruction-following dataset is created through four key steps: (1) cleanse the raw data using a Python XML parser, (2) for each original BT, use gpt-4o-mini to generate three variants, (3) repeat step 2 with the newly obtained dataset, (4) merge all resulting datasets while producing a natural-language description for each BT.

Dataset Sample

A representative example from the generated instruction-following dataset with its three components: the instruction that provides system contextual information, the input comprising a natural-language task description and its corresponding robot actions, and the output showcasing the generated XML-based behavior tree.

Experiments

Data Curation Pipeline Effectiveness

Average ROUGE and BLEU scores (mean ± std) with increasing dataset size:
ROUGE: 46.2 ± 1.94, 66.2 ± 1.94, 77.6 ± 0.80, 82.2 ± 0.75;
BLEU: 27.4 ± 1.02, 48.6 ± 1.85, 68.0 ± 1.10, 74.8 ± 0.75, corresponding to 600, 1,413, 3,791, and 5,204 samples. Standard deviation evaluated across 5 runs with temperature=0.9.

Preliminary Evaluation

Evaluation on a test set of 250 BTs from the custom instruction-following dataset demonstrates that fine-tuning substantially improves performance compared to the pre-trained baselines, with BTGenBot-2 achieving the highest scores. GPT-5 Thinking performs competitively, with GPT-5 Instant and Claude Opus 4.1 falling slightly behind. The original BTGenBot is outscored by its successor, BTGenBot-2.

Validation

Overall, the results show that BTGenBot-2 delivers the strongest combination of reliability, structural fidelity, and efficiency across tasks and prompting settings. It achieves substantial gains over the previous BTGenBot and consistently outperforms proprietary models, establishing itself as a robust and effective tool for generating complex BTs from natural language instructions.

Simulation

The figure presents two tasks simulated using NVIDIA Isaac Sim. The top panels illustrate a manipulation task where a robotic arm sorts cubes by color, while the bottom panels show a navigation task in which the iw.hub locates at least one box from specified positions. The left column shows the initial state, the center column displays intermediate states with the corresponding BT execution status, and the right column shows the final state. In the BT visualizations, green nodes indicate success, yellow nodes indicate a running process, and red nodes indicate failure.

Real Robot Validation

SO-ARM 101 performing manipulation tasks using behavior trees generated by BTGenBot-2

BibTeX

@article{izzo2026btgenbot,
      title={BTGenBot-2: Efficient Behavior Tree Generation with Small Language Models},
      author={Izzo, Riccardo Andrea and Bardaro, Gianluca and Matteucci, Matteo},
      journal={arXiv preprint arXiv:2602.01870},
      year={2026}
    }

BTGenBot-2:Efficient Behavior Tree Generation with Small Language Models