Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

1UCLA, 2MIT, 3Stanford, 4UT-Austin, 5DEVCOM ARL

*Co-first authors

CVPR 2025

We introduce a universal framework designed to extend any functionality from 2D vision foundation model into 4D realm, using only monocular video input, opening the door to a brand new semantic, editable, and promptable explicit 4D scene representation with agentic AI.

Abstract

Recent advancements in 2D and multi-modal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single, task-dependent representation. Additionally, to the best of our knowledge, we are the first method to distill and lift the video foundation models (e.g. SAM2, InternVideo2) features into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.


Overview

We build upon MoSca and infer 2D priors on the input monocular video to separate the scene into static background and dynamic foreground elements. The static background is represented by a set of static 3D Gaussians (GS) with latent features, while the dynamic foreground is modeled with dynamic 3D Gaussians guided by Motion Scaffolds, a set of nodes {vi} representing 3D motion trajectories over time and base latent features hi. We compute the motion and latent features of dynamic 3D Gaussians by interpolating their K-Nearest Scaffold nodes. Given a target timestep t, we warp dynamic Gaussians from different timesteps based on their motions, fuse them with static Gaussians, and rasterize their colors and latent features into an RGB image and a D-dimensional feature map, respectively. The feature map is then fed into task-dependent decoders 𝔡s to support tasks such as 2D semantic segmentation (SAM), 3D editing (CLIP-LSeg), and 4D query-based reasoning (InternVideo2).


Method Pipeline

Novel view segmentation (SAM2) results

Overview of the 4D SAM-based mask assignment and propagation across a dynamic scene. (Left) A 4D scene representation and corresponding 4D SAM feature field. (Right) (a) Mask assignment demonstrates the segmentation of various scene elements, including buildings, roads, and vehicles, at time t=0 (b) Mask propagation across subsequent time frames (from t=10 to t=30 ) illustrates how the assigned masks are consistently tracked and updated as objects move through the scene, enabling accurate temporal correspondence. This approach ensures robust segmentation and tracking in dynamic 4D environments.


Method Pipeline

Novel view segmentation (CLIP) results

Visualization of semantic understanding and view synthesis in a 4D scene with a flower pot. (Top left) The 4D scene captures the object and its environment. (Top right) Novel views generated by the model demonstrate realistic perspective shifts of the flower pot. (Bottom left) The 4D CLIP feature field representation encodes semantic information about the scene. (Bottom right) Semantic masks show the segmentation of the flower pot, wall, window, and curtain over time (t=10,55,95), demonstrating consistent semantic labeling and tracking across frames. This approach enables robust 4D scene understanding and segmentation.


Method Pipeline

Language-guided 4D scene editing results

Language-guided 4D scene editing results. Demonstration of high-level object-specific manipulations using natural language prompts. (Top Row) One time frame of novel view angle generated by the model. (Bottom Row) Same frame of the entire reconstructed 4D scene. (Left) "Extract the swan," where the swan is isolated from its background. (Center) ”Delete the camel,” results in the removal of the camel from the scene and reconstruction of the background. (Right) ”Change the cow color to black and white” modifies the cow’s color to grayscale.


Method Pipeline

Novel view synthesis

Q & A Illustration of local and global novel view synthesis from an input video frame. (Top Left) Original input frame, featuring a dog on a window ledge. (Top Center, Right) Local novel view and its corresponding feature representation, capturing fine details and localized perspective shifts. (Bottom) global novel view and its feature representation, showing the full scene’s spatial context and structure.


Method Pipeline

BibTeX

@article{zhou2024feature4x,
      author= {Shijie Zhou and Hui Ren and Yijia Weng and Shuwang Zhang and Zhen Wang and Dejia Xu and Zhiwen Fan and Suya You and Zhangyang Wang and Leonidas Guibas and Achuta Kadambi},
      title = {Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields},
      year  = {2024},
    }