DriveMoE

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Shanghai Jiao Tong University

^*Indicates Equal Contribution ^†Correspondence Author

Abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-π₀. Specifically, we add Vision MoE to Drive-π₀ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-π₀. .

Challenging Corner Case Results

ParkingExit

ConstructionObstacleTwoWays

InterurbanActorFlow

SignalizedJunctionLeftTurnEnterFlow

BibTeX

@misc{yang2025drivemoe, title={DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving}, author={Zhenjie Yang and Yilin Chai and Xiaosong Jia and Qifeng Li and Yuqian Shao and Xuekai Zhu and Haisheng Su and Junchi Yan}, year={2025}, eprint={2505.16278}, archivePrefix={arXiv}, primaryClass={cs.CV},}

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Visualization of DriveMoE (built upon Drive-π₀) on the Bench2Drive benchmark. Vision MoE selects camera views contextually, defaulting to rear view if no critical view is identified. Action MoE activates the top-3 experts based on routing scores.

Abstract

Comparison of Different Vision and Action Modeling Strategies in VLA-based End-to-End Driving.

Framework of DriveMoE.

The Scene-Specialized Vision Mixture-of-Experts.

Skill-Specialized Action Mixture-of-Experts.

Challenging Corner Case Results

ParkingExit

ConstructionObstacleTwoWays

InterurbanActorFlow

SignalizedJunctionLeftTurnEnterFlow

BibTeX

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Visualization of DriveMoE (built upon Drive-π0) on the Bench2Drive benchmark. Vision MoE selects camera views contextually, defaulting to rear view if no critical view is identified. Action MoE activates the top-3 experts based on routing scores.

Abstract

Comparison of Different Vision and Action Modeling Strategies in VLA-based End-to-End Driving.

Framework of DriveMoE.

The Scene-Specialized Vision Mixture-of-Experts.

Skill-Specialized Action Mixture-of-Experts.

Challenging Corner Case Results

ParkingExit

ConstructionObstacleTwoWays

InterurbanActorFlow

SignalizedJunctionLeftTurnEnterFlow

BibTeX

Visualization of DriveMoE (built upon Drive-π₀) on the Bench2Drive benchmark. Vision MoE selects camera views contextually, defaulting to rear view if no critical view is identified. Action MoE activates the top-3 experts based on routing scores.