Extracting human motion from large-scale web videos offers a scalable solution to the data scarcity issue in character animation.
However, some human parts in many video frames cannot be seen due to off-screen captures or occlusions.
It brings a dilemma: discarding the data missing any part limits scale and diversity, while retaining it compromises data quality and model performance.
To address this problem, we propose leveraging credible part-level data extracted from videos to enhance motion generation via a robust part-aware masked autoregression model.
First, we decompose a human body into five parts and detect the parts clearly seen in a video frame as "credible".
Second, the credible parts are encoded into latent tokens by our proposed part-aware variational autoencoder.
Third, we propose a robust part-level masked generation model to predict masked credible parts, while ignoring those noisy parts.
In addition, we contribute K700-M, a challenging new benchmark comprising approximately 200k real-world motion sequences, for evaluation.
Experimental results indicate that our method successfully outperforms baselines on both clean and noisy datasets in terms of motion quality, semantic consistency and diversity.