EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

Anonymous Authors
Appendix

Interactive Video Demo

Prompts:

Abstract

The recent advancements in video diffusion models have created a strong basis for developing world models with practical value. The upcoming challenge is to investigate how an agent can leverage this foundational model for understanding, interacting with, and planning within observed environments. This requires incorporating additional controllability into the model, transforming it into a versatile game engine that can be dynamically manipulated and controlled. To this end, we investigated the three key conditioning factors: camera, context frame, and text, and identified the current model design's shortcomings. More specifically, the fusion of camera embedding and features results in camera control being influenced by video features. On the other hand, while the injection of textual information compensates for unobserved spatiotemporal structures, it also intrudes into the already observed parts. To address these two issues, we propose the Spacetime Epipolar Attention Layer, which ensures that the egomotion generated by the model strictly adheres to the camera's movement. Additionally, we integrate the injection of text and context frame in a mutually exclusive manner to avoid the intrusion problem. Through extensive experiments, we demonstrate that our new model EgoSim achieves excellent results on both the RealEstate and EpicKitchen datasets, enabling free exploration and meaningful imagination based on observation.

3D Camera Control for Dynamic Scenes

Frame & Caption Input
Reference Trajectory Video
Camera Controlled Generation
Caption: a sailboat sailing in rough seas with a dramatic sunset, waves are surging
Caption: pouring honey onto some slices of bread
Caption: rotating view, small house
Caption: time-lapse of a blooming flower with leaves and a stem
Caption: fireworks display

Precise Camera Control

First GIF Second GIF

Interact With World

First GIF Second GIF Thrid GIF Fourth GIF Fifth GIF Sixth GIF Seventh GIF Eighth GIF Nineth GIF Tenth GIF