Robots as Tokens: Unified Diffusion Transformer for Coordinated Multi-Robot Trajectory Generation

Ruofei Bai1,2, Jie Chen2,3, Yuxin Cai1,2, Jun Li2, Wei-Yun Yau2, Lihua Xie1

1 Nanyang Technological University, Singapore

2 Agency for Science, Technology and Research, Singapore

3 National University of Singapore, Singapore

TL;DR

Can we train a single generative model to generate multi-robot trajectories in a feed-forward manner?

We propose Robots as Tokens (Roken), a unified generative multi-robot planner, capable of single-robot path planning, coordinated multi-robot planning, and conditional planning, by simply changing the input robot tokens.

Animated overview of the Roken framework input and tokenization Animated overview of the Roken framework trajectory generation
The unified Roken model supports three planning modes: single-robot planning, coordinated multi-robot planning, and conditional planning with partially fixed robot tokens.

Motivation

Robot planning has undergone a paradigm shift from classical optimization to distribution generation. Yet, most existing works propose generative planners for single-robot scenarios, or combine these single-robot planners with iterative post-processing for multi-robot planning.

Roken asks whether coordinated multi-robot trajectories, as a special spatiotemporal distribution, can be learned and generated with one generative model in a feed-forward manner.

Challenges

Method Overview

We propose Robots as Tokens (Roken), a unified generative model based on a diffusion transformer (DiT) that learns spatiotemporal distribution coordinated multi-robot trajectories, and generate multi-robot trajectories in a feed-forward manner. The key design of Roken is to represent each robot as a robot token, in which robot tokens naturally interact with each other through self-attention, and attend to environment map tokens for spatial awareness.

Roken network architecture with robot tokens, map tokens, and trajectory outputs
Roken treats robots as a set of interacting robot tokens, and decodes clean coordinated multi-robot trajectories from the final-layer tokens.

Pre-Training with Expert Data: In pre-training, Roken absorbs expert trajectories with different team sizes as different numbers of robot tokens, with random token masking for conditional trajectory generation. The training objective combines trajectory denoising with auxiliary tasks for local occupancy reconstruction and waypoint prediction, for enhanced condition injection and spatial understanding.

Post-Training with Reinforcement Learning: After pre-training, the policy is further refined through trajectory-level reinforcement learning to enhance its long-term planning capability and safety adherence.

Main Results

Roken is evaluated on unseen cluttered environments under a strict full-success metric that requires goal reaching, obstacle avoidance, inter-robot collision avoidance, and communication connectivity.

One Model, Many Trajectories

One Roken model is able to generate trajectories of multiple robots in a feed-forward manner, with coordinated behavior.

One Model, Variable Teams

Roken demonstrates good scalability due to its scalable design. After training with mixed team sizes, one Roken model can support variable numbers of robots without structural changes or additional training, by simply changing the number of input robot tokens.

Scalability evaluation across robot team sizes
Scalability evaluation across various robot team sizes.

One Model, New Scenarios

Roken adapts to new environments with only partially observed maps (although Roken is never trained in such scenarios), demonstrating its strong generalizability and transferability to new scenarios.

Limitations

Roken can fail in local-minimum scenarios that are underrepresented in the training data. Future improvements may include dataset enhancement, reinforcement learning post-training, and online safety shielding for real-world deployments.

Failure cases in local minima environments
Representative failure cases where the generated trajectories do not escape local minima.

BibTeX

@misc{bai_2026_roken,
  title={Robots as Tokens: Unified Diffusion Transformer for Coordinated Multi-Robot Trajectory Generation}, 
  author={Ruofei Bai and Jie Chen and Yuxin Cai and Jun Li and Wei-Yun Yau and Lihua Xie},
  year={2026},
  eprint={2606.15550},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2606.15550}, 
}