Qianli Ma

Hi there! I am a Research Scientist at NVIDIA Research, where I train the Cosmos World Foundation Models to understand and simulate the physical world. I obtained my PhD at the Max Planck Institute for Intelligent Systems, Tübingen and ETH Zürich, advised by Michael Black and Siyu Tang.

Qianli Ma

Selected Publications

2025
Technical Report

World Simulation with Video Foundation Models for Physical AI

NVIDIA: Qianli Ma (core contributor)
The Cosmos 2.5 video foundation models for controllable simulation in physical AI applications such as robotics and autonomous vehicles.
2025
CVPR

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

Generating subgoal images as a visual chain-of-thought process that guides action generation and achieves superior robot manipulation performance.
2025
CVPR

Articulated Kinematics Distillation from Video Diffusion Models

Physically plausible text-to-4D animation: distill realistic motion from video generative models onto rigged 3D characters, grounded in a physics-based simulator.
AKD Teaser
2025
Technical Report

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

NVIDIA: Qianli Ma (core contributor)
High-quality video simulation of the physical world with adaptive, multimodal spatial control.
2025
Technical Report
Best of CES '25

Cosmos World Foundation Model Platform for Physical AI

NVIDIA: Qianli Ma (core contributor)
The first Cosmos video foundation models for controllable, large-scale simulation of future physical-world states.
2024
Technical Report

Edify 3D: Scalable High-Quality 3D Asset Generation

NVIDIA: Qianli Ma (core contributor)
Generating production-ready 3D assets — mesh, textures, and materials — from text or image prompts in under two minutes.
2024
CVPR

Inferring Dynamics from Point Trajectories

How to infer scene dynamics from sparse point trajectory observations? Here's a simple yet effective solution using a spatiotemporal MLP with carefully designed regularizations. No need for scene-specific priors.
DOMA Teaser
2023
ICCV
Oral

Dynamic Point Fields

Explicit point-based representation + implicit deformation field = dynamic surface models with instant inference and high quality geometry. Robust single-scan animation of challenging clothing types even under extreme poses.
DPF Teaser
2023
ICCV
Oral

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views

Generative human mesh recovery for images with body occlusion and truncations: scene-conditioned diffusion model + collision-guided sampling = accurate pose estimation on observed body parts and plausible generation of unobserved parts.
EgoHMR Teaser
2022
3DV

Neural Point-based Shape Modeling of Humans in Challenging Clothing

The power of point-based digital human representations further unleashed: SkiRT models dynamic shapes of 3D clothed humans including those that wear challenging outfits such as skirts and dresses.
SkiRT Teaser
2022
ECCV

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

A large-scale dataset of accurate 3D body shape, pose and motion of humans interacting in 3D scenes, with multi-modal streams from third-person and egocentric views, captured by Azure Kinects and a HoloLens2.
EgoBody Teaser
2021
NeurIPS

MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images

Creating an avatar of unseen subjects from as few as eight monocular depth images using a meta-learned, multi-subject, articulated, neural signed distance field model for clothed humans.
MetaAvatar Teaser
2021
ICCV

The Power of Points for Modeling Humans in Clothing

PoP — a point-based, unified model for multiple subjects and outfits that can turn a single, static 3D scan into an animatable avatar with natural pose-dependent clothing deformations.
POP Teaser
2021
CVPR

SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements

Modeling pose-dependent shapes of clothed humans explicitly with hundreds of articulated surface elements: the clothing deforms naturally even in the presence of topological change.
SCALE Teaser
2021
CVPR
Best Paper Candidate

SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks

Cycle-consistent implicit skinning fields + locally pose-aware implicit function = a fully animatable avatar with implicit surface from raw scans without surface registration.
SCANimate Teaser
2020
3DV

PLACE: Proximity Learning of Articulation and Contact in 3D Environments

An explicit representation for 3D person-scene contact relations that enables automated synthesis of realistic humans posed naturally in a given scene.
PLACE Teaser
2020
CVPR

Learning to Dress 3D People in Generative Clothing

CAPE — a graph-CNN-based generative model and a large-scale dataset for 3D human meshes in clothing in varied poses and garment types.
CAPE Teaser