HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

Overview

Photorealistic, universal and rapid 3D human avatar modeling from a single image by the proposed approach, HumanNOVA. It benefits from both our generated large-scale data and feed-forward model design. Our data generation pipeline expands training data by 20 times, as visualized in the top-left. With this data, HumanNOVA achieves superior performance while maintaining rapid inference among existing methods, as shown in the top-right. Once trained, it is universal without the need for test-time fine-tuning or adaptation. Qualitative results show that HumanNOVA produces more precise photorealistic reconstructions compared to the state-of-the-art SiTH method, as shown at the bottom.

Abstract

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions.

Approach

Data generation pipeline. HumanNOVA builds a scalable data generation pipeline to expand the diversity and quantity of training data. We leverage both existing rigged human assets animated with diverse daily poses and multi-camera human captures fitted to generate diverse training views. This enables large-scale training with diverse identities, poses, appearances, and viewpoints.

Model architecture. Given a real-world input image, we first estimate its corresponding simplified human mesh. Image and mesh are fed into the multi-modal encoder to extract features, which are utilized as the condition for the following mapping network. After that, a Transformer-based mapping network directly maps the features to the 3D triplane representation. From this triplane representation, our framework can render the 2D image given a camera viewpoint.

Qualitative Results

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

Input

Our Result

@inproceedings{hu2026humannova, author = {Hu, Hezhen and Zhao, Wangbo and Guo, Lanqing and Jiang, Hanwen and Liu, Jonathan C. and Fan, Zhiwen and Wang, Kai and Wang, Zhangyang and Pavlakos, Georgios}, title = {{HumanNOVA}: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image}, booktitle = {CVPR}, year = {2026}, }

HumanNOVA: Photorealistic, Universal and Rapid
3D Human Avatar Modeling from a Single Image

Overview

Abstract

Video

Approach

Qualitative Results

BibTeX