Mesh 3
☆ PICO: Reconstructing 3D People In Contact with Objects CVPR'25
Alpár Cseke, Shashank Tripathi, Sai Kumar Dwivedi, Arjun Lakshmipathy, Agniv Chatterjee, Michael J. Black, Dimitrios Tzionas
Recovering 3D Human-Object Interaction (HOI) from single color images is
challenging due to depth ambiguities, occlusions, and the huge variation in
object shape and appearance. Thus, past work requires controlled settings such
as known object shapes and contacts, and tackles only limited object classes.
Instead, we need methods that generalize to natural images and novel object
classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset
of natural images uniquely paired with dense 3D contact on both body and object
meshes. To this end, we use images from the recent DAMON dataset that are
paired with contacts, but these contacts are only annotated on a canonical 3D
body. In contrast, we seek contact labels on both the body and the object. To
infer these given an image, we retrieve an appropriate 3D object mesh from a
database by leveraging vision foundation models. Then, we project DAMON's body
contact patches onto the object via a novel method needing only 2 clicks per
patch. This minimal human input establishes rich contact correspondences
between bodies and objects. (2) We exploit our new dataset of contact
correspondences in a novel render-and-compare fitting method, called PICO-fit,
to recover 3D body and object meshes in interaction. PICO-fit infers contact
for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db
for that object, and uses the contact to iteratively fit the 3D body and object
meshes to image evidence via optimization. Uniquely, PICO-fit works well for
many object categories that no existing method can tackle. This is crucial to
enable HOI understanding to scale in the wild. Our data and code are available
at https://pico.is.tue.mpg.de.
comment: Accepted in CVPR'25. Project Page: https://pico.is.tue.mpg.de
☆ DiMeR: Disentangled Mesh Reconstruction Model
Lutao Jiang, Jiantao Lin, Kanghao Chen, Wenhang Ge, Xin Yang, Yifan Jiang, Yuanhuiyi Lyu, Xu Zheng, Yingcong Chen
With the advent of large-scale 3D datasets, feed-forward 3D generative
models, such as the Large Reconstruction Model (LRM), have gained significant
attention and achieved remarkable success. However, we observe that RGB images
often lead to conflicting training objectives and lack the necessary clarity
for geometry reconstruction. In this paper, we revisit the inductive biases
associated with mesh reconstruction and introduce DiMeR, a novel disentangled
dual-stream feed-forward model for sparse-view mesh reconstruction. The key
idea is to disentangle both the input and framework into geometry and texture
parts, thereby reducing the training difficulty for each part according to the
Principle of Occam's Razor. Given that normal maps are strictly consistent with
geometry and accurately capture surface variations, we utilize normal maps as
exclusive input for the geometry branch to reduce the complexity between the
network's input and output. Moreover, we improve the mesh extraction algorithm
to introduce 3D ground truth supervision. As for texture branch, we use RGB
images as input to obtain the textured mesh. Overall, DiMeR demonstrates robust
capabilities across various tasks, including sparse-view reconstruction,
single-image-to-3D, and text-to-3D. Numerous experiments show that DiMeR
significantly outperforms previous methods, achieving over 30% improvement in
Chamfer Distance on the GSO and OmniObject3D dataset.
comment: Project Page: https://lutao2021.github.io/DiMeR_page/
☆ 3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models 3DV
Video try-on replaces clothing in videos with target garments. Existing
methods struggle to generate high-quality and temporally consistent results
when handling complex clothing patterns and diverse body poses. We present
3DV-TON, a novel diffusion-based framework for generating high-fidelity and
temporally consistent video try-on results. Our approach employs generated
animatable textured 3D meshes as explicit frame-level guidance, alleviating the
issue of models over-focusing on appearance fidelity at the expanse of motion
coherence. This is achieved by enabling direct reference to consistent garment
texture movements throughout video sequences. The proposed method features an
adaptive pipeline for generating dynamic 3D guidance: (1) selecting a keyframe
for initial 2D image try-on, followed by (2) reconstructing and animating a
textured 3D mesh synchronized with original video poses. We further introduce a
robust rectangular masking strategy that successfully mitigates artifact
propagation caused by leaking clothing information during dynamic human and
garment movements. To advance video try-on research, we introduce HR-VVT, a
high-resolution benchmark dataset containing 130 videos with diverse clothing
types and scenarios. Quantitative and qualitative results demonstrate our
superior performance over existing methods. The project page is at this link
https://2y7c3.github.io/3DV-TON/
comment: Project page: https://2y7c3.github.io/3DV-TON/