Mesh 4
☆ AnyI2V: Animating Any Conditional Image with Motion Control ICCV 2025
Recent advancements in video generation, particularly in diffusion models,
have driven notable progress in text-to-video (T2V) and image-to-video (I2V)
synthesis. However, challenges remain in effectively integrating dynamic motion
signals and flexible spatial constraints. Existing T2V methods typically rely
on text prompts, which inherently lack precise control over the spatial layout
of generated content. In contrast, I2V methods are limited by their dependence
on real images, which restricts the editability of the synthesized content.
Although some methods incorporate ControlNet to introduce image-based
conditioning, they often lack explicit motion control and require
computationally expensive training. To address these limitations, we propose
AnyI2V, a training-free framework that animates any conditional images with
user-defined motion trajectories. AnyI2V supports a broader range of modalities
as the conditional image, including data types such as meshes and point clouds
that are not supported by ControlNet, enabling more flexible and versatile
video generation. Additionally, it supports mixed conditional inputs and
enables style transfer and editing via LoRA and text prompts. Extensive
experiments demonstrate that the proposed AnyI2V achieves superior performance
and provides a new perspective in spatial- and motion-controlled video
generation. Code is available at https://henghuiding.com/AnyI2V/.
comment: ICCV 2025, Project Page: https://henghuiding.com/AnyI2V/
☆ Parametric shape models for vessels learned from segmentations via differentiable voxelization
Alina F. Dima, Suprosanna Shit, Huaqi Qiu, Robbie Holland, Tamara T. Mueller, Fabio Antonio Musio, Kaiyuan Yang, Bjoern Menze, Rickmer Braren, Marcus Makowski, Daniel Rueckert
Vessels are complex structures in the body that have been studied extensively
in multiple representations. While voxelization is the most common of them,
meshes and parametric models are critical in various applications due to their
desirable properties. However, these representations are typically extracted
through segmentations and used disjointly from each other. We propose a
framework that joins the three representations under differentiable
transformations. By leveraging differentiable voxelization, we automatically
extract a parametric shape model of the vessels through shape-to-segmentation
fitting, where we learn shape parameters from segmentations without the
explicit need for ground-truth shape parameters. The vessel is parametrized as
centerlines and radii using cubic B-splines, ensuring smoothness and continuity
by construction. Meshes are differentiably extracted from the learned shape
parameters, resulting in high-fidelity meshes that can be manipulated post-fit.
Our method can accurately capture the geometry of complex vessels, as
demonstrated by the volumetric fits in experiments on aortas, aneurysms, and
brain vessels.
comment: 15 pages, 6 figures
☆ Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning
Due to visual ambiguities and inter-person occlusions, existing human pose
estimation methods cannot recover plausible close interactions from in-the-wild
videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot
accurately distinguish human semantics in such challenging scenarios. In this
work, we find that human appearance can provide a straightforward cue to
address these obstacles. Based on this observation, we propose a dual-branch
optimization framework to reconstruct accurate interactive motions with
plausible body contacts constrained by human appearances, social proxemics, and
physical laws. Specifically, we first train a diffusion model to learn the
human proxemic behavior and pose prior knowledge. The trained network and two
optimizable tensors are then incorporated into a dual-branch optimization
framework to reconstruct human motions and appearances. Several constraints
based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to
assist the optimization. With the proxemics prior and diverse constraints, our
method is capable of estimating accurate interactions from in-the-wild videos
captured in complex environments. We further build a dataset with pseudo
ground-truth interaction annotations, which may promote future research on pose
estimation and human behavior understanding. Experimental results on several
benchmarks demonstrate that our method outperforms existing approaches. The
code and data are available at https://www.buzhenhuang.com/works/CloseApp.html.
☆ Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk
We introduce Mesh Silksong, a compact and efficient mesh representation
tailored to generate the polygon mesh in an auto-regressive manner akin to silk
weaving. Existing mesh tokenization methods always produce token sequences with
repeated vertex tokens, wasting the network capability. Therefore, our approach
tokenizes mesh vertices by accessing each mesh vertice only once, reduces the
token sequence's redundancy by 50\%, and achieves a state-of-the-art
compression rate of approximately 22\%. Furthermore, Mesh Silksong produces
polygon meshes with superior geometric properties, including manifold topology,
watertight detection, and consistent face normals, which are critical for
practical applications. Experimental results demonstrate the effectiveness of
our approach, showcasing not only intricate mesh generation but also
significantly improved geometric integrity.
comment: 9 pages main text, 14 pages appendix, 23 figures