Robust Motion Generation using Part-level Reliable Data from Videos

Boyuan Li1, Sipeng Zheng4, Bin Cao3, Ruihua Song1*, Zongqing Lu2,4
1Renmin University of China, 2Peking University, 3CASIA, 4BeingBeyond
*Corresponding author
RoPAR Method Overview

RoPAR leverages part-level reliable data from web videos to generate high-quality human motion.

Abstract

Extracting human motion from large-scale web videos offers a scalable solution to the data scarcity issue in character animation. However, some human parts in many video frames cannot be seen due to off-screen captures or occlusions. It brings a dilemma: discarding the data missing any part limits scale and diversity, while retaining it compromises data quality and model performance.

To address this problem, we propose leveraging credible part-level data extracted from videos to enhance motion generation via a robust part-aware masked autoregression model. First, we decompose a human body into five parts and detect the parts clearly seen in a video frame as "credible". Second, the credible parts are encoded into latent tokens by our proposed part-aware variational autoencoder. Third, we propose a robust part-level masked generation model to predict masked credible parts, while ignoring those noisy parts.

In addition, we contribute K700-M, a challenging new benchmark comprising approximately 200k real-world motion sequences, for evaluation. Experimental results indicate that our method successfully outperforms baselines on both clean and noisy datasets in terms of motion quality, semantic consistency and diversity.

Problem & Motivation

Web videos contain rich human motion data, but many frames suffer from occlusions or off-screen captures. Traditional approaches face a dilemma: discarding incomplete data limits dataset scale and diversity, while keeping it compromises quality.

Motivation

Detecting credible parts using joint confidence from ViTPose

Key Insight: Not all body parts are equally noisy in a given frame. By identifying and leveraging only the reliable (credible) parts, we can dramatically expand usable training data without compromising model quality.

Method Overview

Three Key Components

1. Identifying Credible Parts

We use ViTPose to obtain per-joint confidence scores, indicating visibility and detection accuracy. Body parts (torso, left/right arm, left/right leg) with average confidence above threshold Ï„ are marked as "credible".

2. Part-aware Variational Autoencoder (P-VAE)

We compress credible part-level motion into a compact latent space using a shared-parameter VAE. This prevents noisy parts from corrupting the learned representations while maintaining spatial consistency across body parts.

3. Robust Part-aware Generation Model (RoPAR)

Our masked autoregressive model selectively learns from credible tokens while unconditionally masking noisy ones. A diffusion head further refines the output for high-quality, diverse motion generation.

Experimental Results

Quantitative Comparisons

We compare our method with state-of-the-art baselines on both HumanML3D (clean) and K700-M (noisy) datasets. Our method achieves superior performance especially on the challenging noisy dataset.

Main Experimental Results

Table 1: Comparison with state-of-the-art methods on HumanML3D and K700-M datasets. Our method (RoPAR) achieves the best performance on the noisy K700-M dataset across all metrics.

Robustness Analysis

Sensitivity Analysis

Our method maintains stable performance across different noise levels, while baselines degrade significantly.

Qualitative Results

Qualitative Comparisons

Visual comparison shows our method generates motions with richer details and more natural transitions, especially for actions typically filmed in close-ups (e.g., "sitting and playing the drum").

K700-M Dataset

We introduce K700-M, a large-scale real-world motion dataset extracted from Kinetics-700 videos. The dataset contains:

  • 198,627 motion sequences extracted from real-world YouTube videos
  • Wide variety of scenes, lighting conditions, and camera angles
  • Text annotations generated using Gemini for each clip
  • Represents real-world challenges with occlusions and partial views

Dataset will be released upon publication.

BibTeX

@inproceedings{li2026ropar,
  title={Robust Motion Generation using Part-level Reliable Data from Videos},
  author={Li, Boyuan and Zheng, Sipeng and Cao, Bin and Song, Ruihua and Lu, Zongqing},
  year={2026}
}