Interactive streetscape tools, prevalent in major mapping services, have fundamentally changed virtual exploration. However, screen readers cannot interpret street view imagery, and alt text remains unavailable. This gap presents a critical opportunity to create inclusive immersive streetscape experiences through multimodal AI and advanced image understanding. Integrating these capabilities could empower services like Google Street View, with its vast collection of over 220 billion images across 110+ countries, to become significantly more accessible for the blind and low-vision community, offering unparalleled visual immersion and unlocking new avenues for exploration.
In the groundbreaking paper “StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI,” presented at UIST’25, we unveil StreetReaderAI. This proof-of-concept prototype redefines accessible street view by leveraging context-aware, real-time AI and highly accessible navigation controls. StreetReaderAI’s design is a testament to collaborative iteration by a team of blind and sighted accessibility researchers, building upon established accessible first-person gaming and navigation tools like Shades of Doom, BlindSquare, and SoundScape. Key functionalities include:
- Real-time AI-generated descriptions of immediate surroundings, including roads, intersections, and points of interest.
- Dynamic, conversational interaction with a multimodal AI agent to discuss scenes and local geography.
- Accessible panning and movement between panoramic images via intuitive voice commands or keyboard shortcuts.
StreetReaderAI delivers context-aware street view scene descriptions by integrating geographic data sources and the user’s current field-of-view into Gemini. For the complete audio-visual experience, including sound, please access this YouTube video: YouTube video.
StreetReaderAI utilizes Gemini Live to facilitate real-time, interactive conversations about the scene and surrounding geographic features. For the complete audio-visual experience, including sound, please access this YouTube video: YouTube video.
Navigate StreetReaderAI with Precision
StreetReaderAI provides an immersive, first-person exploration experience, functioning akin to a video game where audio serves as the primary interface.
StreetReaderAI enables seamless navigation through both keyboard and voice commands. Users control their view using the left and right arrow keys. As the user pans, StreetReaderAI offers immediate audio feedback, announcing the current heading with cardinal or intercardinal directions (e.g., “Now facing: North” or “Northeast”). It also indicates potential forward movement and proximity to nearby landmarks or places.
To advance, users initiate “virtual steps” with the up arrow or move backward using the down arrow. As the user traverses the virtual streetscape, StreetReaderAI reports travel distance and critical geographic information, such as nearby points of interest. Users can also leverage “jump” or “teleport” functionalities for rapid relocation to new positions.
StreetReaderAI: Your Intelligent Virtual Guide
The foundation of StreetReaderAI comprises two core AI subsystems powered by Gemini: the AI Describer and AI Chat. Both subsystems process a static prompt and an optional user profile, augmented by dynamic data concerning the user’s current location, including nearby points of interest, road details, and the active Street View image.
AI Describer: Context-Aware Scene Analysis
The AI Describer acts as a sophisticated context-aware scene analysis tool. It synthesizes dynamic geographic information related to the user’s virtual position with a detailed analysis of the current Street View image to generate real-time audio descriptions.
This subsystem operates in two distinct modes: a “default” prompt focuses on navigation and safety considerations for blind pedestrians, while a “tour guide” prompt delivers enriched tourism-specific information, such as historical and architectural context. Additionally, Gemini predicts probable follow-up questions relevant to the current scene and local geography, anticipating the needs of blind or low-vision travelers.

This diagram illustrates how AI Describer integrates multimodal data to generate context-aware scene descriptions.
AI Chat: Interactive Scene Exploration
AI Chat enhances the AI Describer by enabling users to pose questions about their current view, past observations, and surrounding geography. The chat agent leverages Google's Multimodal Live API, supporting real-time interaction, function calls, and session-specific memory retention. This system tracks and transmits each pan or movement interaction, alongside the user’s current view and geographic context (e.g., nearby places, current heading).
AI Chat’s remarkable capability lies in its ability to maintain a temporary “memory” of the user’s session—the context window accommodates up to 1,048,576 input tokens, equivalent to over 4,000 input images. As AI Chat receives the user’s view and location with every virtual step, it continuously gathers information about the user’s environment and context. For instance, a user might pass a bus stop, turn a corner, and then inquire, “Wait, where was that bus stop?” The agent can access its prior context, analyze the current geographic input, and respond, “The bus stop is behind you, approximately 12 meters away.”
Evaluating StreetReaderAI with Blind Users
To rigorously assess StreetReaderAI, we conducted an in-person laboratory study involving eleven blind screen reader users. Participants familiarized themselves with StreetReaderAI and then used it to explore various locations and evaluate potential walking routes to designated destinations.
During one study task, a blind participant used StreetReaderAI to navigate towards a bus stop and inquire about its amenities, such as benches and a shelter. For the complete audio-visual experience, including sound, please access this YouTube video: YouTube video.
Participants expressed overwhelmingly positive feedback for StreetReaderAI, assigning a median usefulness score of 7 (SD=0.9) on a Likert scale from 1–7 (where 1 signified ‘not at all useful’ and 7 ‘very useful’). They specifically praised the synergy between virtual navigation and AI, the seamless AI Chat interface, and the value of the provided information. Qualitative feedback consistently emphasized StreetReaderAI’s significant advancement in navigation accessibility, noting the deficiency of existing street view tools in this regard. The interactive AI chat feature was also described as making street and place-related conversations both engaging and highly beneficial.
Throughout the study, participants navigated over 350 panoramas and made more than 1,000 AI requests. Notably, AI Chat usage was six times higher than AI Describer, revealing a clear preference for personalized, conversational inquiries. While participants found StreetReaderAI valuable and effectively integrated virtual world navigation with AI interactions, areas for enhancement were identified: users occasionally encountered difficulties with precise orientation, discerning the accuracy of AI responses, and understanding the boundaries of AI knowledge.
In a specific study task, participants were tasked with, “Find out about an unfamiliar playground to plan a trip with your two young nieces.” This video clip demonstrates the diverse range of questions asked and StreetReaderAI’s responsive capabilities. For the complete audio-visual experience, including sound, please access this YouTube video: YouTube video.
Study Results: Understanding User Inquiries
As the inaugural study of an accessible street view system, our research provides the first comprehensive analysis of the types of questions blind individuals pose regarding streetscape imagery. We meticulously analyzed all 917 AI Chat interactions, tagging each with up to three categories from an emergent list of 23 distinct question types. The four most prevalent question types identified were:
- Spatial orientation: 27.0% of participants prioritized understanding object location and distance, posing queries such as, “How far is the bus stop from where I'm standing?” and “Which side are the garbage cans next to the bench?”
- Object existence: 26.5% of participants inquired about the presence of critical features like sidewalks, obstacles, and doors; examples include, “Is there a crosswalk here?”
- General description: 18.4% of participants initiated AI Chat by requesting a summary of their current view, frequently asking, “What's in front of me?”
- Object/place location: 14.9% of participants sought the location of specific items or places, asking questions like, “Where is the nearest intersection?” or “Can you help me find the door?”
Assessing StreetReaderAI Accuracy
Given StreetReaderAI’s substantial reliance on AI, ensuring response accuracy is a paramount challenge. Among the 816 questions posed to AI Chat by participants:
- 703 responses (86.3%) were accurate.
- 32 responses (3.9%) were inaccurate.
- The remaining responses were categorized as partially correct (26; 3.2%) or AI refusal to answer (54; 6.6%).
Of the 32 incorrect responses:
- 20 (62.5%) were false negatives, meaning the AI incorrectly stated an object (e.g., a bike rack) did not exist when it was present.
- 12 (37.5%) involved misidentifications (e.g., mistaking a yellow speed bump for a crosswalk) or other errors stemming from AI Chat’s inability to yet perceive the target within the street view.
Further research is essential to evaluate StreetReaderAI’s performance across diverse contexts and beyond controlled laboratory settings.
Future Directions for StreetReaderAI
StreetReaderAI represents a significant and promising initial step toward making streetscape tools universally accessible. Our study meticulously details *what* information blind users seek from and inquire about streetscape imagery, demonstrating the profound potential of multimodal AI in addressing these needs.
Several compelling opportunities exist to build upon this foundational work:
- Advancing Geo-visual Agents: We envision developing a more autonomous AI Chat agent capable of independent exploration. For example, a user could request, “What’s the next bus stop down this road?” The agent would then autonomously navigate the Street View network, locate the stop, analyze its features (like benches and shelters), and report back comprehensive details.
- Enhancing Route Planning Support: Currently, StreetReaderAI does not fully support end-to-end origin-to-destination routing. Imagine a query like, “What’s the walk like from the nearest subway station to the library?” A future AI agent could conduct a virtual pre-walk of the route, analyzing every Street View image to generate a blind-friendly summary. This would include identifying potential obstacles and pinpointing the exact location of the library’s entrance.
- Developing Richer Audio Interfaces: The primary output of StreetReaderAI is synthesized speech. We are actively exploring more sophisticated, non-verbal feedback mechanisms, including spatialized audio and fully immersive 3D audio soundscapes derived directly from visual data.
Although currently a “proof-of-concept” research prototype, StreetReaderAI powerfully illustrates the transformative potential of making immersive streetscape environments accessible to everyone.
Acknowledgements
This research was meticulously conducted by Jon E. Froehlich, Alexander J. Fiannaca, Nimer Jaber, Victor Tsaran, Shaun K. Kane, and Philip Nelson. We extend our gratitude to Project Astra and the Google Geo teams for their invaluable feedback, and to our dedicated participants. Diagram icons are sourced from Noun Project, including: “prompt icon” by Firdaus Faiz, “command functions” by Kawalan Icon, “dynamic geo-context” by Didik Darmanto, and “MLLM icon” by Funtasticon.