
Online shopping reaches billions daily, each user seeking the immersive experience of physical retail. Observing, interacting, and thoroughly inspecting products forms a crucial connection. Replicating this tactile, intuitive discovery online presents a significant challenge. While digital tools offer solutions, their development at scale often proves costly and time-consuming for businesses. To overcome these hurdles, we engineered advanced generative AI techniques, capable of producing high-quality, shoppable 3D product visualizations from minimal input, sometimes as few as three product images. Today, we unveil our latest innovation, powered by Google’s premier video generation model, Veo. This groundbreaking technology now fuels interactive 3D product views across Google Shopping, enriching the customer journey.
We developed new generative AI techniques to create high quality and shoppable 3D product visualizations from as few as three product images. Today, we're excited to share the latest advancement, powered by Google’s state-of-the-art video generation model, Veo. This technology is already enabling the generation of interactive 3D views for a wide range of product categories on Google Shopping
Examples of 3D product visualizations generated from photos.
First generation: Neural Radiance Fields (NeRFs) Revolutionize Product Visualization
In 2022, Google researchers united to pioneer immersive product visualization technologies. Our initial focus leveraged Neural Radiance Fields (NeRF). This approach learned a product's 3D structure to render novel perspectives, such as 360° rotations, from a minimum of five product images. Key challenges included identifying the most informative images, isolating products from backgrounds, predicting 3D structures, inferring camera positions from limited object-centric images, and optimizing the 3D product model. That same year, we announced this significant breakthrough, launching interactive 360° shoe visualizations on Google Search. However, noisy input data, particularly inaccurate camera poses and ambiguity from sparse views, presented limitations. Reconstructing intricate designs like sandals and heels proved difficult with only a few images due to their thin structures and complex geometry. These challenges spurred our inquiry: could recent advancements in generative diffusion models enhance our 3D reconstruction capabilities?
In 2022, researchers from across Google came together to develop technologies to make product visualization more immersive. The initial efforts focused on using Neural Radiance Fields (NeRF) to learn a 3D representation of products to render novel views (i.e., novel view synthesis), like 360° spins, from five or more images of the product. This required solving many sub-problems, including selecting the most informative images, removing unwanted backgrounds, predicting 3D priors, estimating camera positions from a sparse set of object-centric images, and optimizing a 3D representation of the product.
That same year, we announced this breakthrough and launched the first milestone, interactive 360° visualizations of shoes on Google Search. While this technology was promising, it suffered from noisy input signals (e.g., inaccurate camera poses) and ambiguity from sparse input views. This challenge became apparent when attempting to reconstruct sandals and heels, whose thin structures and more complex geometry was tricky to reconstruct from just a handful of images.
This led us to wonder: could the recent advancements in generative diffusion models help us improve the learned 3D representation?
Second generation: Scaling Achieved with View-Conditioned Diffusion Priors
In 2023, we introduced a pivotal second-generation approach employing a view-conditioned diffusion prior to overcome the limitations of our initial efforts. A view-conditioned model permits specific requests, such as querying the model for the shoe's front view when provided with an image of its top. This enables the view-conditioned diffusion model to accurately predict the product's appearance from any angle, even with limited photographic input. Practically, we implement a variation of score distillation sampling (SDS), first detailed in DreamFusion. During training, we render the 3D model from a random camera perspective and then utilize the view-conditioned diffusion model alongside existing posed images to generate a target rendering from that same viewpoint. A score is computed by comparing the rendered image against the generated target, directly guiding the optimization process to refine the 3D model's parameters, thereby enhancing its quality and realism. This second-generation methodology facilitated substantial scaling, enabling us to generate 3D representations for a vast array of footwear featured daily on Google Shopping. Today, shoppers on Google discover interactive 360° visualizations for sandals, heels, boots, and other footwear categories, with the majority produced by this advanced technology!
In 2023, we introduced a second-generation approach which used a view-conditioned diffusion prior to address the limitations of the first approach. Being view-conditioned means that you can give it an image of the top of a shoe and ask the model “what does the front of this shoe look like?” In this way, we can use the view-conditioned diffusion model to help predict what the shoe looks like from any viewpoint, even if we only have photos of limited viewpoints.
In practice, we employ a variant of score distillation sampling (SDS), first proposed in DreamFusion. During training, we render the 3D model from a random camera view. We then use the view-conditioned diffusion model and the available posed images to generate a target from the same camera view. Finally, we calculate a score by comparing the rendered image and the generated target. This score directly informs the optimization process, refining the 3D model's parameters and enhancing its quality and realism.
This second-generation approach led to significant scaling advantages enabling us to generate 3D representations for many of the shoes viewed daily on Google Shopping. Today, you can find interactive 360° visualizations for sandals, heels, boots, and other footwear categories when you shop on Google, the majority of which are created by this technology!

The second-generation approach used a view-conditioned diffusion model based on the TryOn architecture. The diffusion model acts as a learned prior using score distillation sampling proposed in DreamFusion to improve the quality and fidelity of novel views.
Third generation: Generalizing Capabilities with Veo Integration
Our most recent breakthrough integrates Veo, Google's advanced video generation model. Veo’s core strength lies in its exceptional ability to produce videos that meticulously capture intricate interactions among light, materials, textures, and geometry. Its robust diffusion-based architecture and adaptability for fine-tuning across diverse multi-modal tasks make it perfectly suited for novel view synthesis. To fine-tune Veo for generating consistent 360° product videos from images, we assembled a vast dataset comprising millions of high-quality, synthetic 3D assets. We then rendered these assets from numerous camera angles and under varied lighting conditions. Finally, we constructed a dataset of paired images and videos, training Veo to generate seamless 360° spins conditioned on one or more input images. This method demonstrated remarkable generalization across a wide spectrum of product categories, including furniture, apparel, and electronics. Veo not only produced novel views that accurately reflected the provided product images but also skillfully rendered complex lighting and material effects, such as reflective surfaces—a persistent challenge for earlier generations. This third-generation approach eliminated the need for precise camera pose estimation from sparse product images, simplifying the process and enhancing reliability. While the fine-tuned Veo offers remarkable power, enabling realistic 3D representations from a single image, like all generative 3D technologies, it may hallucinate details from unseen angles. For instance, it might infer the object's back when only a front view is provided. However, increasing the number of input images directly improves Veo's capacity for generating high-fidelity, high-quality novel views. Our practical experience shows that as few as three images, capturing the majority of the object's surfaces, are sufficient to significantly enhance 3D image quality and minimize hallucinations.
Our latest breakthrough builds on Veo, Google's state-of-the-art video generation. A key strength of Veo is its ability to generate videos that capture complex interactions between light, material, texture, and geometry. Its powerful diffusion-based architecture and its ability to be finetuned on a variety of multi-modal tasks enable it to excel at novel view synthesis.
To finetune Veo to transform product images into a consistent 360° video, we first curated a dataset of millions of high quality, 3D synthetic assets. We then rendered the 3D assets from various camera angles and lighting conditions. Finally, we created a dataset of paired images and videos and supervised Veo to generate 360° spins conditioned on one or more images.
We discovered that this approach generalized effectively across a diverse set of product categories, including furniture, apparel, electronics and more. Veo was not only able to generate novel views that adhered to the available product images, but it was also able to capture complex lighting and material interactions (i.e., shiny surfaces), something which was challenging for the first- and second-generation approaches.
The third-generation approach builds on Veo to generate 360° spins from one or more product images.
Furthermore, this third-generation approach eliminated the need to estimate precise camera poses from sparse object-centric product images, simplifying the problem and boosting reliability. The fine-tuned Veo model is exceptionally powerful—a single image can generate a realistic, 3D object representation. However, like all generative 3D technologies, Veo must infer details from unseen perspectives, such as an object's rear when only a front view is provided. Increasing the number of input images directly enhances Veo's capability to produce high-fidelity, high-quality novel views. In practice, we found that three images, covering most object surfaces, suffice to significantly improve 3D image quality and reduce speculative details.
Conclusion: Future Outlook in 3D Generative AI for E-commerce
The past few years have witnessed extraordinary advancements in 3D generative AI, progressing from NeRF technology to sophisticated view-conditioned diffusion models and now integrating Veo. Each innovation has played a critical role in making online shopping more tangible and interactive. We remain committed to pushing the boundaries in this field, striving to make online retail increasingly delightful, informative, and engaging for every user.
