Inspiration
In 2007, the smartphone became an extension of the human body. But an extension is still a tool — you pick it up, interact with it, put it down. It doesn't think with you. It doesn't watch out for you. What if it didn't wait?
285 million people worldwide live with blindness or low vision. The tools available to them today require a sighted volunteer on the other end (Be My Eyes), only work on specific platforms (Seeing AI — Microsoft only), or provide basic object detection without context. None of them have a real-time voice conversation with you about what they see.
We asked: what if a blind person could just talk to their phone and have a companion that walks with them? Not tap buttons. Not wait for a volunteer. Just speak naturally — whether you want to chat, need help reasoning through a problem, want to know what's around you, or need someone watching out for danger before you even ask.
The blind community needs it first. Everyone needs it eventually. Travelers in foreign countries can't read signs. Elderly people with declining vision can't read pill bottles. Field workers with occupied hands can't check a screen. Anyone who needs an extra pair of eyes needs Echo Walk.
What it does
Echo Walk turns any smartphone into an AI companion that sees, speaks, remembers, and protects. Everything is hands-free — voice IS the interface, not a feature bolted on.
See the world through conversation:
- "What do you see?" — captures the camera, describes the scene, speaks it back
- "Read that sign" — exact transcription of signs, labels, menus, documents
- "Is the path clear?" — analyzes obstacles, stairs, curbs, vehicles, and distances
- "What am I holding?" — identifies objects by type, brand, color, and text
Navigate and control by voice:
- "Open camera" / "Go to settings" — voice-controlled screen switching, no tapping
- "Make the voice faster" — changes any of 9 settings hands-free
- "Where am I?" — reads GPS coordinates
Remember what matters:
- "Remember I'm at Gate B12" — stores facts in conversation memory
- "Remember this medicine bottle" — photographs it, stores a visual fingerprint using Nova Multimodal Embeddings
- "Is this the same pill bottle?" — compares the camera view against saved visual memories using cosine similarity
Stay safe:
- Continuous vision watches in the background and only speaks when something changes or is dangerous
- "URGENT: Stairs ahead, about 3 feet away" — with a haptic buzz and spatial audio from the direction of the hazard
- Emergency SOS — say "help me" or triple-tap anywhere. Captures GPS + camera, analyzes surroundings, speaks your location calmly
The blind user's scroll:
- "What have we talked about?" — summarizes the entire conversation, because blind users can't scroll back through a chat
How we built it
Three Amazon Nova models collaborate in real time — each with a distinct role:
Nova 2 Sonic — The Voice. Listens to the user, speaks responses, and decides which of 14 tools to call based on natural language. No intent detection code, no regex, no decision trees — Nova Sonic handles all understanding.
Nova 2 Lite — The Eyes + Brain. Sees through the phone camera (5 context types: describe, read text, identify object, check path, analyze people), reads signs, spots hazards, and handles deep reasoning tasks.
Nova 2 Multimodal Embeddings — The Memory. Generates 1024-dimensional vectors from images for visual object recognition. "Remember this medicine bottle" → stores the embedding. "Is this the same one?" → cosine similarity comparison.
Audio pipeline: WebRTC DataChannel connects the browser directly to the ECS task (~50-100ms latency). AudioWorklets process capture at 16kHz and playback at 24kHz on dedicated audio threads so the UI never stutters. A StereoPannerNode provides spatial audio — "obstacle on your left" literally plays from your left ear.
Tool architecture: Nova Sonic orchestrates 14 tools across three categories:
- Server-only (sync):
get_time,navigate,get_settings,update_setting,remember,recall - Server-async:
reason,summarize(delegates to Nova Lite) - Round-trip:
capture_image,get_location,emergency,remember_object,recognize_object(requests data from the browser, processes it server-side, returns to Nova)
Infrastructure: 28 AWS resources from a single CloudFormation template — ECS Fargate (auto-scaling 2-10 tasks), ALB, CloudFront with OAC, Cognito auth, CodeBuild, ECR, 4 CloudWatch alarms. One config file (parameters.yaml), four deployment scripts, ~15 minutes to production.
Frontend: Vanilla JS — no React, no build step, no bundler. PWA with service worker caching 18 assets. 51 ARIA attributes, 22 roles, 10 aria-live regions. First-time onboarding where Echo Walk introduces itself by voice.
Challenges we ran into
WebRTC on the server. Browser-to-browser WebRTC is well-documented. Server-side WebRTC peers are not. We used @roamhq/wrtc to create a server-side peer that receives raw PCM audio from the browser's AudioWorklet and pipes it directly to Nova Sonic's bidirectional HTTP/2 stream. Getting the DataChannel negotiation, ICE candidates, and audio format alignment working took significant debugging.
Nova Sonic speaks formatting characters. Nova Sonic is a speech model — everything it outputs becomes audio. When the system prompt or tool results contained \n, **bold**, or markdown, Nova would literally say "backslash n" or "asterisk asterisk bold asterisk asterisk" out loud. We had to explicitly instruct the model to never use any text formatting characters since everything is spoken aloud.
Camera race conditions. When Nova Sonic calls capture_image, the camera might not be active yet. The browser needs to request camera permission, start the video stream, and wait for the first frame — all before capturing. We implemented auto-start with a polling loop that waits up to 2 seconds for the video to produce frames.
Transcript deduplication. Nova Sonic sometimes re-sends the same transcript fragment multiple times during streaming. A simple "last transcript" check wasn't enough — we needed a rolling Set of the 20 most recent transcripts to properly deduplicate without missing legitimate repeated phrases.
Barge-in with spatial audio. When the user interrupts Echo Walk mid-sentence, we need to immediately stop playback, clear the audio queue, and reset the spatial panner — all without creating audio artifacts. The AudioWorklet's separate thread made this tricky since the main thread and audio thread needed coordinated state.
Accomplishments that we're proud of
Three models, one conversation. The user never knows three AI models are collaborating. They just talk. Nova Sonic decides when to see (Nova Lite), when to remember (Embeddings), when to think (Nova Lite for reasoning), and when to just respond. The orchestration is invisible.
Voice-first, not voice-added. Every feature was designed for someone who can't see the screen. Settings are voice-controlled. Navigation is voice-controlled. Memory exists because blind users can't scroll — "What have we talked about?" is their scroll, "What did I tell you?" is their search bar. Emergency SOS is triple-tap anywhere because there's no button to find. When continuous vision can't see clearly, or loses connection, or the camera is blocked — it tells you. Silence from something you depend on is dangerous.
50-100ms audio latency. WebRTC DataChannel bypasses CloudFront and the load balancer entirely. AudioWorklets process audio on dedicated threads. The result feels like a real conversation, not a request-response cycle.
14 tools, zero intent detection. We never wrote a single line of intent detection code. Nova Sonic reads the tool descriptions and figures out which tool to call from natural language. Adding a new capability means defining a tool spec and a handler — Nova handles all understanding.
Production-ready infrastructure. This isn't a localhost demo. 28 CloudFormation resources, auto-scaling, health checks, 4 CloudWatch alarms, Cognito auth, CloudFront CDN, least-privilege IAM roles, ECR image scanning. One config file, four scripts to deploy.
What we learned
Nova Sonic is a platform, not just a voice model. The tool calling architecture means Nova Sonic isn't limited to what we built — it's extensible. Define a new tool with a description, and Nova figures out when to use it. This makes Echo Walk a platform that can grow without rewriting the AI layer.
Accessibility drives better design for everyone. Building voice-first for blind users forced us to think about every interaction without assuming a screen. The result is an app that's equally useful for a traveler with earbuds in a foreign city, an elderly person reading a pill bottle, or a field worker with occupied hands.
WebRTC DataChannel is underused. Most voice AI apps use WebSocket or HTTP for audio. WebRTC DataChannel gives you sub-100ms latency with built-in congestion control, and it works through NATs and firewalls. The tradeoff is complexity — but for real-time voice, the latency difference is the difference between a conversation and a command interface.
Multimodal embeddings enable visual memory without a database. By storing 1024-dimensional vectors in session memory and using cosine similarity, we built object recognition without any vector database, ML pipeline, or training. Nova Multimodal Embeddings does the heavy lifting — we just compare numbers.
What's next for Echo Walk
Navigation integration. Partner with mapping APIs to provide turn-by-turn walking directions — "Take me to the nearest pharmacy" with real-time obstacle avoidance along the route.
Multi-language support. Nova Sonic supports multiple languages. A traveler in Tokyo could speak English and have signs read and translated in real time.
Persistent visual memory. Currently visual memories are session-scoped. Persistent storage would let Echo Walk recognize your front door, your office, your medication across sessions.
Wearable integration. Smart glasses or a chest-mounted camera would eliminate the "phone in hand" requirement, giving continuous vision without any user action.
Community object library. Users could contribute labeled visual memories — "this is a crosswalk button" — building a shared knowledge base that helps all Echo Walk users.
Tools are the new apps. Open-source the tool architecture. Developers build tools for their communities — transit schedules, prescription reading, currency recognition, museum guides. Instead of ten apps for ten tasks, ten tools in one voice. The app store becomes a tool library.
Built With
- amazon-web-services
- api
- audio
- cloudformation
- cloudfront
- codebuild
- cognito
- ecs
- fargate
- multimodal
- node.js
- nova-2-lite
- nova2sonic
- s3
- socket.io
- web
- webrtc
Log in or sign up for Devpost to join the conversation.