Qianli Ma

Hi there! I am a senior research scientist and tech lead at NVIDIA CosmosLab working on world model pretraining and distillation. I obtained my PhD at the Max Planck Institute for Intelligent Systems, Tübingen and ETH Zürich, advised by Michael Black and Siyu Tang.

Qianli Ma

Selected Publications

2026
CVPR

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

An agentic framework for generating realistic, physically valid, simulator-ready 3D scenes at scale — directly deployable for embodied policy training.
2026
ICLR

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
Consistency distillation meets reverse-divergence regularization — scaling diffusion model distillation to 14B parameters with up to 50× faster, high-quality, and diverse sampling.
rCM Teaser
2025
Technical Report

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

NVIDIA: Qianli Ma (contributor)
A generalist robot world model pretrained on human egocentric videos, with strong generalization to diverse objects and environments and real-time 10 FPS long-horizon rollouts.
2025
Technical Report

Cosmos Predict and Transfer 2.5

NVIDIA: Qianli Ma (core contributor)
Improved world simulation with video foundation models for Physical AI.
2025
CVPR

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Generating subgoal images as a visual chain-of-thought process that improves the long-horizon sequential content generation fidelity.
2025
CVPR

Articulated Kinematics Distillation from Video Diffusion Models

Physically plausible text-to-4D animation achieved by distilling realistic motion from video generative models onto rigged 3D characters.
AKD Teaser
2025
Technical Report

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

NVIDIA: Qianli Ma (core contributor)
High-quality video synthesis with adaptive, multimodal spatial control.
2025
Technical Report
Best of CES '25

Cosmos World Foundation Model Platform for Physical AI

NVIDIA: Qianli Ma (core contributor)
The first Cosmos foundation model for controllable, large-scale video generation.
2024
Technical Report

Edify 3D: Scalable High-Quality 3D Asset Generation

NVIDIA: Qianli Ma (core contributor)
Generating production-ready 3D assets — mesh, textures, and materials — from text or image prompts in under two minutes.
2024
CVPR

Inferring Dynamics from Point Trajectories

How to infer scene dynamics from sparse point trajectory observations? Here's a simple yet effective solution using a spatiotemporal MLP with carefully designed regularizations. No need for scene-specific priors.
DOMA Teaser
2023
ICCV
Oral

Dynamic Point Fields

Explicit point-based representation + implicit deformation field = dynamic surface models with instant inference and high quality geometry. Robust single-scan animation of challenging clothing types even under extreme poses.
DPF Teaser
2023
ICCV
Oral

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

Scene-conditioned diffusion model + collision-guided sampling = accurate character mesh recovery on observed pixels and plausible generation of unobserved parts.
EgoHMR Teaser
2022
3DV

Neural Point-based Shape Modeling of Humans in Challenging Clothing

The power of point-based digital human representations further unleashed: SkiRT models dynamic shapes of 3D clothed humans including those that wear challenging outfits such as skirts and dresses.
SkiRT Teaser
2022
ECCV

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

A large-scale dataset of 3D human characters interacting in 3D scenes, with multi-modal visual data from both third- and first-person views.
EgoBody Teaser
2021
NeurIPS

MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images

Creating an avatar of unseen subjects from as few as eight monocular depth images using a meta-learned, multi-subject, articulated, neural signed distance field model for clothed humans.
MetaAvatar Teaser
2021
ICCV

The Power of Points for Modeling Humans in Clothing

PoP — a point-based, unified model for multiple subjects and outfits that can turn a single, static 3D scan into an animatable avatar with natural pose-dependent clothing deformations.
POP Teaser
2021
CVPR

SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements

Modeling pose-dependent shapes of clothed humans explicitly with hundreds of articulated surface elements: the clothing deforms naturally even in the presence of topological change.
SCALE Teaser
2021
CVPR
Best Paper Candidate

SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks

Cycle-consistent implicit skinning fields + locally pose-aware implicit function = a fully animatable avatar with implicit surface from raw scans without surface registration.
SCANimate Teaser
2020
3DV

PLACE: Proximity Learning of Articulation and Contact in 3D Environments

An explicit representation for 3D person-scene contact relations that enables automated synthesis of realistic humans posed naturally in a given scene.
PLACE Teaser
2020
CVPR

Learning to Dress 3D People in Generative Clothing

CAPE — a graph-CNN-based generative model and a large-scale dataset for 3D human meshes in clothing in varied poses and garment types.
CAPE Teaser