4. XLANG – Large-Scale VLA Dataset Curation & Annotation Pipeline

HKU XLANG Lab —2025

4.1 Overview

The goal of this project is to enable the training of a general Vision–Language–Action (VLA) policy that can reliably follow fine-grained, physically grounded language instructions rather than coarse, underspecified commands. While many existing agents can execute high-level prompts such as “open the drawer”, they often fail to capture the detailed execution intent required for real-world manipulation, for example “grasp the drawer rim and pull it outward”. This work is motivated by the need to align language with precise action semantics, ensuring that instructions encode contact, motion, and control structure. By emphasizing fine-grained language grounding during pretraining, the resulting VLA policy supports stronger instruction following, improved compositional generalization, and more reliable real-world deployment.

4.2 My Role & Contributions

4.2.1 Automated Dataset Discovery & Paper Filtering

I worked on:

using GPT-based classifiers to identify robot manipulation datasets from new papers,

screening for datasets that include V, L, and A (vision, language, action),

filtering out navigation-only or simulation-only datasets.

This allows the team to rapidly expand beyond the 72 datasets in OXE.

4.2.2 Dataset Downloading & Format Conversion Infrastructure

(1) I built a large-scale dataset aggregation pipeline that downloads and integrates all OXE datasets together with 20+ additional robotic datasets outside OXE, including AgiBot, RoboMIND, and REASSEMBLE, resulting in a corpus of over 2 million trajectories beyond OXE.

(2) I implemented unified conversion scripts that standardize heterogeneous dataset formats by extracting, reorganizing, and converting all datasets into a single LeRobot-compatible format, enabling consistent large-scale VLA pretraining and seamless dataset expansion.

4.2.3 Fine-Grained Instruction Annotation (LLM-Driven Human+AI Pipeline)

I designed and implemented a first-pass annotation pipeline that combines large language models (LLMs) with Grounding DINO to automate fine-grained instruction labeling across large-scale robotic datasets. The pipeline generates structured initial annotations for each trajectory, substantially reducing the manual effort required from human annotators.

Specifically, the system:

(1) uses Grounding DINO to ground object references and interaction regions in visual observations,

(2) leverages LLMs (e.g., Gemini) to generate step-level action sequences from raw videos,rewrites coarse template instructions into fine-grained, executable substeps,identifies trajectory-level variations across demonstrations

(3) produces consistent annotations across multi-view camera setups without manual alignment.

At the end, by providing a high-quality first pass, the pipeline shifts human annotation from full manual labeling to lightweight refinement and verification, leading to significant reductions in annotation time and cost.

This human-in-the-loop workflow scales to thousands of trajectories and enables efficient construction of high-quality, instruction-rich VLA training data.