WonderTale | Devpost

Multimodal Agent Coordination
System Architecture

WonderTale is an AI storytelling companion that turns any topic a child speaks into a personalized, multimodal adventure — in real-time, through a continuous voice conversation.

A child says "Tell me about volcanoes!" and WonderTale:

Researches the topic using Google Search to find real, verified facts
Writes a personalized story where the child is the hero, weaving those facts into the narrative
Illustrates each scene with custom AI-generated artwork
Narrates with an expressive voice using character voices and dramatic pacing
Displays accessibility-formatted text (dyslexia-friendly fonts, ADHD-paced segments)
Offers two AI-generated story branches so the child steers the adventure
Quizzes comprehension with fun questions tied to the real facts

All of this happens simultaneously — the voice narration is never blocked by image generation, and the child can interrupt at any time to ask questions or change direction.

Features & Functionality

Voice-First Interaction

Bidirectional audio streaming via Gemini Live API with the Aoede voice
Children speak naturally — barge-in support lets them interrupt mid-narration
Affective dialog detection adapts tone based on the child's emotional state
Session resumption on disconnect (tokens valid ~2 hours)
Context window compression for sessions beyond 10 minutes

Fact-Grounded Storytelling

Every story starts with a Research Agent that queries Google Search for real-world facts
Facts are woven into the narrative by the Story Architect Agent — the child learns real science, history, or geography through adventure
The child is always the hero of their own story

Real-Time Illustration

A Prompt Architect Agent translates story paragraphs into rich artistic prompts
Gemini Image Generation produces custom illustrations per scene
Images are delivered via a parallel side-channel — narration never waits for rendering
Illustrations stored on Cloudflare R2 for the Story Library

Accessibility — Designed with Neurodiversity in Mind

Dyslexia mode — OpenDyslexic font, increased letter/word spacing, relaxed line height
ADHD pacing — short segments with animated progress indicators, word-by-word text reveal
Autism structure — predictable narrative scaffolding, emotion labels, consistent story patterns
Full audio narration + image alt-text for visual impairment
Parent Dashboard for managing all accessibility settings per child

Interactive Storytelling

Story Choices — two AI-generated narrative branches at the end of each chapter
Wand button — tap during narration to steer the story in a new direction
Discover Panel — recap research facts, view illustrations, and take a comprehension quiz
Multi-chapter stories that loop: child responds → next chapter generates → new illustration, choices, and quiz

Story Library & Persistence

Completed stories persisted to PostgreSQL with all segments, illustrations, research facts, quiz questions, and choices
Library screen for replaying any past story without a new AI session
Story thumbnails generated from the first illustration

Subscription System

Basic ($5/mo) and Plus ($20/mo) tiers with a free 30-day Plus trial
Tier enforcement at the WebSocket layer (audio mode gating) and tool layer (daily story limits, illustration caps)

Technologies Used

Google AI & Cloud (Core)

Technology	Role
Google ADK	Multi-agent orchestration framework — `run_live` for BIDI audio sessions, `run_async` for text sub-agents, FunctionTools for tool calling, SessionService for state management
Gemini Live API	Bidirectional audio streaming with `gemini-2.5-flash-native-audio-preview` — voice input/output, affective dialog, barge-in detection, session resumption
Gemini 2.5 Flash	Powers all text-mode sub-agents — Research Agent (temp 0.3), Story Architect (temp 0.9), Choices Agent (temp 0.85), Quiz Agent (temp 0.4), Illustration Architect
Gemini Image Generation	`gemini-2.5-flash-image` for per-scene illustration generation
Google Cloud Run	Serverless WebSocket hosting for the production deployment
Google Cloud SQL	PostgreSQL database for stories, profiles, subscriptions

Data Sources

Google Search — real-world facts for every story topic, accessed via ADK's built-in google_search tool
Google Trends (via pytrends) — trending topics from 3 timezone regions (Asia, Europe, Americas), filtered and reformatted by Gemini Flash into child-safe story suggestions, refreshed on a cron schedule

Architecture

The system runs three concurrent coroutines per WebSocket connection:

upstream — microphone PCM audio → Gemini Live API
downstream — Gemini Live events → client (audio, transcriptions, tool signals)
drain — per-session asyncio.Queue → client (illustrations, text, choices, quiz)

The drain coroutine operates on a side-channel independent from the BIDI audio stream. This is the key design principle: narration is never blocked by illustration generation. The audio stream and media queue operate independently, so the child hears the story while illustrations, choices, and quiz questions arrive in parallel.

Six specialized agents coordinate through the Orchestrator:

Orchestrator Agent (Gemini Live) — the voice interface, manages the conversation and calls tools
Research Agent (Gemini Flash + Google Search) — verifies facts before any story is told
Story Architect (Gemini Flash) — writes personalized adventures with the child as hero
Prompt Architect (Gemini Flash) — translates story paragraphs into artistic image prompts
Choices Agent (Gemini Flash) — generates two narrative branches for interactive storytelling
Quiz Agent (Gemini Flash) — creates comprehension questions tied to the research facts

After the Story Architect completes, four parallel streams launch simultaneously:

The Orchestrator narrates a vivid spoken summary (audio BIDI)
Accessibility text segments push through the media queue (50ms stagger)
The Prompt Architect → Gemini Image pipeline generates and uploads illustrations
The Choices Agent (3s delay) and Quiz Agent (6s delay) produce interactive content

Findings & Learnings

The Gemini Live API is powerful but requires creative engineering

The native audio model (gemini-2.5-flash-native-audio-preview) delivers an incredible voice experience — expressive narration, character voices, emotional detection. But we discovered it is significantly less reliable at triggering function calls from audio input alone compared to text input. Our solution was a multi-layered reinforcement pattern:

Text shadow injection — when the child speaks a topic, we intercept the transcription and inject a parallel text turn into the BIDI stream
System-framing — wrapping the topic as a [SYSTEM: ...] instruction, because plain user text turns were sometimes acknowledged vocally but the tool call was skipped
Retry nudges — if the model completes an excitement sentence without calling the tool, we inject up to 3 follow-up system nudges
Audio suppression — incoming PCM chunks are temporarily dropped during text injection to prevent VAD from resetting the model's attention

This was the single most time-consuming challenge of the project and represents days of iteration.

ADK BIDI event timing is non-deterministic

We discovered that for text-injected turns in native audio mode, ADK's get_function_calls() and get_function_responses() events can arrive approximately 13 seconds after the tool has already completed — or sometimes never at all. This broke our drain gate mechanism (which waited for the tool_result:generate_story event before sending media to the client).

The fix was a dual-gate signaling pattern: the tool itself directly opens the drain gate via signal_story_ready() as the primary path, while the BIDI event serves as a redundant backup. This resilience pattern was critical for a smooth user experience.

Thinking budget matters — even for sub-agents

When using gemini-2.5-flash as our Prompt Architect Agent, the model's internal chain-of-thought consumed most of the token budget, leaving only ~88 characters for the actual image prompt output. Setting thinking_budget=0 for this deterministic translation task (story text → art prompt) immediately fixed prompt quality. Lesson: not every agent needs to "think" — some just need to translate.

Output transcription events are progressive rewrites, not additive

The output_transcription events from Gemini Live arrive as progressive rewrites — each event replaces the previous text rather than appending to it. Naively forwarding every event to the client caused duplicated and garbled transcript text. Only forwarding finished=True events produces clean, one-sentence-per-utterance transcriptions.

Context window compression drops tool results on session resume

After ~10 minutes, Gemini Live's context window compression kicks in. When a child disconnects and reconnects (session resume), the compression may have dropped the original research_topic result. Without those facts, the model couldn't form a generate_story call for chapter 2. The fix: on resume, we fetch the research facts from the database and re-inject them as a [SYSTEM CONTEXT] message.

The side-channel architecture was the right call

Our decision to separate media delivery (illustrations, text, choices, quiz) from the audio BIDI stream via an asyncio.Queue side-channel proved essential. Image generation takes 5-15 seconds per illustration — if this blocked narration, the child would experience dead silence. The three-coroutine model (upstream, downstream, drain) ensures the voice conversation flows naturally while rich media arrives independently. This architecture would generalize well to any multimodal AI application where different output modalities have different latencies.

Graceful degradation everywhere

Every layer of the illustration pipeline has independent error handling: the Prompt Architect falls back to a template prompt, image generation failures don't crash the story, R2 upload failures fall back to base64 inline images, and MOCK_IMAGES=true mode generates solid-color PNGs using only struct and zlib (no Pillow) for zero-cost local development. This let us develop and iterate on the full UI flow without any API quota.

Built With

cloudflare-r2
google-adk
google-cloud-run
google-cloud-sql
python
react

Updates

Ravi Sankar Bommu started this project — Mar 09, 2026 10:36 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.