Stefano Soatto, UCLA
Host: Alex Wong
Title: Toward Foundational Models of Physical Scenes: From Large Language Models to World Models and Back.
Now that a significant fraction of human knowledge has been shared through the Internet, scraped and squashed into the weights of Large Language Models (LLMs), do we still need embodiment and interaction with the physical world to build representations? Is there a dichotomy between LLMs and “large world models”? What is the role of visual perception in learning such models? Can perceptual agents trained by passive observation learn world models suitable for control?
To begin tackling these questions, I will first address the issue of controllability of LLMs. LLMs are stochastic dynamical systems, for which the notion of controllability is well established: The state (“of mind”) of an LLM can be trivially steered by a suitable choice of input given enough time and memory. However, the space of interest for control of an LLM is not that of words, but that of “meanings” expressible as sentences that a human could have spoken and would understand. Unfortunately, unlike controllability, the notions of meaning and understanding are not usually formalized in a way that is relatable to LLMs in use today.
I will propose a simplistic definition of meaning that reflects the functional characteristics of a trained LLM. I will show that a well-trained LLM establishes a topology in the space of meanings, represented by equivalence classes of trajectories of underlying dynamical model (LLM). Then, I will describe both necessary and sufficient conditions for controllability in such a space of meanings.
I will then highlight the relation between meanings induced by a trained LLM upon the set of sentences that could be uttered, and “physical scenes” underlying sets of images that could be observed. In particular, a physical scene can be defined uniquely and inferred as an abstract concept without the need for embodiment, a view aligned with J. Koenderink’s characterization of images as “controlled hallucinations.”
Lastly, I will show that popular models ostensibly used to represent the 3D scene (Neural Radiance Fields, or NeRFs) can at most represent the images on which they are trained, but not the underlying physical scene. However, composing a NeRF with a Latent Diffusion Model or other inductively-trained generative model yields a viable representation of the physical scene. Such a model class, which can be learned through passive observations, is a first albeit rudimentary Foundational Model of physical scenes in the sense of being sufficient for any downstream inference task based on visual data.
Stefano Soatto is a Professor of Computer Science at the University of California, Los Angeles and a Vice President at Amazon Web Services, where he leads the AI Labs. He received his Ph.D. in Control and Dynamical Systems from the California Institute of Technology in 1996. Prior to joining UCLA he was Associate Professor of Biomedical and Electrical Engineering at Washington University in St. Louis, Assistant Professor of Mathematics at the University of Udine, and Postdoctoral Scholar in Applied Science at Harvard University. Before discovering the joy of engineering at the University of Padova under the guidance of Giorgio Picci, Soatto studied classics, participated in the Certamen Ciceronianum, co-founded the Jazz Fusion quintet Primigenia, skied competitively and rowed single-scull for the Italian National Rowing Team. Many broken bones later, he now considers a daily run around the block an achievement.
Soatto received the Siemens Prize with the Best Paper Award at CVPR in 1998 (with the late Roger Brockett), the Marr Prize at ICCV 1999 (with Jana Kosecka, Yi Ma, and Shankar Sastry), the Best Paper Award at ICRA 2015 (with Konstantine Tsotsos and Joshua Hernandez). He is a Fellow of the IEEE of the ACM.
At Amazon, Soatto is now responsible for the research and development leading to products such as Amazon Kendra (search), Amazon Lex (conversational bots), Amazon Personalize (recommendation), Amazon Textract (document analysis), Amazon Rekognition (computer vision), Amazon Transcribe (speech recognition), Amazon Forecast (time series), Amazon CodeWhisperer (code generation), and most recently Amazon Bedrock (Foundational Models as a service) and Titan (GenAI). Prior to joining AWS, he was Senior Advisor of NuTonomy, the first to launch an autonomous taxi service in Singapore (now Motional), and a consultant for Qualcomm since the inception of its AR/VR efforts. In 2004-5, He co-led the UCLA/Golem Team in the second DARPA Grand Challenge (with Emilio Frazzoli and Amnon Shashua).