Qianli Ma

Hi there! I am a research scientist at NVIDIA Research. I obtained my PhD at Max Planck Institute for Intelligent Systems and ETH Zürich, co-advised by Michael Black and Siyu Tang. Prior to that, I received my Master's degree in Optics and Photonics from Karlsruhe Institute of Technology and Bachelor's degree in Physics from Peking University.

My research uses machine learning to solve computer vision and graphics problems, with a current focus on generative modeling and reconstruction of dynamic 3D/4D scenes.

Email / Google Scholar / Twitter / Github

Publications

Articulated Kinematics Distillation from Video Diffusion Models
Xuan Li, Qianli Ma, Tsung-Yi Lin, Yongxin Chen, Chenfanfu Jiang, Ming-Yu Liu, Donglai Xiang
CVPR, 2025
Project Page / arXiv /

@inproceedings{li2025articulated,
title={Articulated Kinematics Distillation from Video Diffusion Models},
author={Li, Xuan and Ma, Qianli and Lin, Tsung-Yi and Chen, Yongxin and Jiang, Chenfanfu and Liu, Ming-Yu and Xiang, Donglai},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = jun,
year={2025}
}

Physically plausible text-to-4D animation: distill realistic motion from video generative models onto rigged 3D characters, grounded in a physics-based simulator. The result: superior 3D consistency and expressive motion quality.

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin
CVPR, 2025
Project Page / arXiv /

@inproceedings{zhao2025cot,
title={CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models},
author={Zhao, Qingqing and Lu, Yao and Kim, Moo Jin and Fu, Zipeng and Zhang, Zhuoyang and Wu, Yecheng and Li, Zhaoshuo and Ma, Qianli and Han, Song and Finn, Chelsea and Handa, Ankur and Liu, Ming-Yu and Xiang, Donglai and Wetzstein, Gordon and Lin, Tsung-Yi},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = jun,
year={2025}
}

For complex robot manipulation tasks, CoT-VLA generates subgoal images as a visual chain-of-thought process that guides action generation and achieves superior manipulation performance.

Inferring Dynamics from Point Trajectories
Yan Zhang, Sergey Prokudin, Marko Mihajlovic, Qianli Ma, Siyu Tang
CVPR, 2024
Project Page / Code / arXiv / Video /

@inproceedings{zhang2024degrees,
title={Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories},
author={Zhang, Yan and Prokudin, Sergey and Mihajlovic, Marko and Ma, Qianli and Tang, Siyu},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2018-2028},
month = jun,
year={2024}
}

How to infer scene dynamics from sparse point trajectory observations? Here's a simple yet effective solution using a spatiotemporal MLP with carefully designed regularizations. No need for scene-specific priors.

Dynamic Point Fields
Sergey Prokudin, Qianli Ma, Maxime Raafat, Julien Valentin, Siyu Tang
ICCV, 2023 (Oral)
Project Page / Code / Colab / arXiv / Video /

@inproceedings{prokudin2023dynamic,
  title={Dynamic Point Fields},
  author={Prokudin, Sergey and Ma, Qianli and Raafat, Maxime and Valentin, Julien and Tang, Siyu},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages = {7964--7976},
  month = oct,
  year={2023}
}

Explicit point-based representation + implicit deformation field = dynamic surface models with instant inference and high quality geometry. Robust single-scan animation of challenging clothing types even under extreme poses.

Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views
Siwei Zhang, Qianli Ma, Yan Zhang, Sadegh Aliakbarian, Darren Cosker, Siyu Tang
ICCV, 2023 (Oral)
Project Page / Code / arXiv / Video /

@inproceedings{zhang2023egohmr,
  title = {Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views},
  author = {Siwei Zhang, Qianli Ma, Yan Zhang, Sadegh Aliakbarian, Darren Cosker, Siyu Tang},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages = {7989--8000},
  month = oct,
  year = {2023}
}

Generative human mesh recovery for images with body occlusion and truncations: scene-conditioned diffusion model + collision-guided sampling = accurate pose estimation on observed body parts and plausible generation of unobserved parts.

Neural Point-based Shape Modeling of Humans in Challenging Clothing
Qianli Ma, Jinlong Yang, Michael J. Black, Siyu Tang
3DV, 2022
Project Page / Code / arXiv /

@inproceedings{SkiRT:3DV:2022,
  title = {Neural Point-based Shape Modeling of Humans in Challenging Clothing},
  author = {Ma, Qianli and Yang, Jinlong and Black, Michael J. and Tang, Siyu},
  booktitle = {International Conference on 3D Vision (3DV)},
  pages = {679--689},
  month = sep,
  year = {2022}
}

The power of point-based digital human representations further unleashed: SkiRT models dynamic shapes of 3D clothed humans including those that wear challenging outfits such as skirts and dresses.

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices
Siwei Zhang, Qianli Ma, Yan Zhang, Zhiyin Qian, Taein Kwon, Marc Pollefeys, Federica Bogo, Siyu Tang
ECCV, 2022
Project Page / Code / Dataset / arXiv / Video /

@inproceedings{Egobody:ECCV:2022,
  title = {{EgoBody}: Human Body Shape and Motion of Interacting People from Head-Mounted Devices},
  author = {Zhang, Siwei and Ma, Qianli and Zhang, Yan and Qian, Zhiyin and Kwon, Taein and Pollefeys, Marc and Bogo, Federica and Tang, Siyu},
  booktitle = {European Conference on Computer Vision (ECCV)},
  month = oct,
  year = {2022}
}

A large-scale dataset of accurate 3D body shape, pose and motion of humans interacting in 3D scenes, with multi-modal streams from third-person and egocentric views, captured by Azure Kinects and a HoloLens2.

The Power of Points for Modeling Humans in Clothing
Qianli Ma, Jinlong Yang, Siyu Tang, Michael J. Black
ICCV, 2021
Project Page / Code / Dataset / arXiv / Video /

@inproceedings{POP:ICCV:2021,
  title = {The Power of Points for Modeling Humans in Clothing},
  author = {Ma, Qianli and Yang, Jinlong and Tang, Siyu and Black, Michael J.},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages = {10974--10984},
  month = oct,
  year = {2021},
}

PoP — a point-based, unified model for multiple subjects and outfits that can turn a single, static 3D scan into an animatable avatar with natural pose-dependent clothing deformations.

MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images
Shaofei Wang, Marko Mihajlovic, Qianli Ma, Andreas Geiger, Siyu Tang
NeurIPS, 2021
Project Page / Code / arXiv / Video /

@inproceedings{MetaAvatar:NeurIPS:2021,
  title = {{MetaAvatar}: Learning Animatable Clothed Human Models from Few Depth Images},
  author={Wang, Shaofei and Mihajlovic, Marko and Ma, Qianli and Geiger, Andreas and Tang, Siyu},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  pages={2810--2822},
  month=dec,
  year={2021}
}

Creating an avatar of unseen subjects from as few as eight monocular depth images using a meta-learned, multi-subject, articulated, neural signed distance field model for clothed humans.

SCALE: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements
Qianli Ma, Shunsuke Saito, Jinlong Yang, Siyu Tang, Michael J. Black
CVPR, 2021
Project Page / Code / arXiv / Video /

@inproceedings{SCALE:CVPR:2021,
  title = {{SCALE}: Modeling Clothed Humans with a Surface Codec of Articulated Local Elements},
  author = {Ma, Qianli and Saito, Shunsuke and Yang, Jinlong and Tang, Siyu and Black, Michael J.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages = {16082-16093},
  month = jun,
  year = {2021},
}

Modeling pose-dependent shapes of clothed humans explicitly with hundreds of articulated surface elements: the clothing deforms naturally even in the presence of topological change.

SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks
Shunsuke Saito, Jinlong Yang, Qianli Ma, Michael J. Black
CVPR, 2021 (Best Paper Candidate)
Project Page / Code / arXiv / Video /

@inproceedings{SCANimate:CVPR:2021,
  title={{SCANimate}: Weakly Supervised Learning of Skinned Clothed Avatar Networks},
  author={Saito, Shunsuke and Yang, Jinlong and Ma, Qianli and Black, Michael J},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={2886--2897},
  month=jun,
  year={2021}
}

Cycle-consistent implicit skinning fields + locally pose-aware implicit function = a fully animatable avatar with implicit surface from raw scans without surface registration.

PLACE: Proximity Learning of Articulation and Contact in 3D Environments
Siwei Zhang, Yan Zhang, Qianli Ma, Michael J. Black, Siyu Tang
3DV, 2020
Project Page / Code / arXiv / Video /

@inproceedings{PLACE:3DV:2020,
  title = {{PLACE}: Proximity Learning of Articulation and Contact in {3D} Environments},
  author = {Zhang, Siwei and Zhang, Yan and Ma, Qianli and Black, Michael J. and Tang, Siyu},
  booktitle = {International Conference on 3D Vision (3DV)},
  pages = {642--651},
  month = nov,
  year = {2020}
}

An explicit representation for 3D person-scene contact relations that enables automated synthesis of realistic humans posed naturally in a given scene.

Learning to Dress 3D People in Generative Clothing
Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang, Michael J. Black
CVPR, 2020
Project Page / Code / Dataset / arXiv / Full Video / 1-min Video /

@inproceedings{CAPE:CVPR:20,
  title = {Learning to Dress {3D} People in Generative Clothing},
  author = {Ma, Qianli and Yang, Jinlong and Ranjan, Anurag and Pujades, Sergi and Pons-Moll, Gerard and Tang, Siyu and Black, Michael J.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={6468-6477},
  month = jun,
  year = {2020}
}

CAPE — a graph-CNN-based generative model and a large-scale dataset for 3D human meshes in clothing in varied poses and garment types.