WonderTale is an AI storytelling companion that turns any topic a child speaks into a personalized, multimodal adventure — in real-time, through a continuous voice conversation.

A child says "Tell me about volcanoes!" and WonderTale:

  1. Researches the topic using Google Search to find real, verified facts
  2. Writes a personalized story where the child is the hero, weaving those facts into the narrative
  3. Illustrates each scene with custom AI-generated artwork
  4. Narrates with an expressive voice using character voices and dramatic pacing
  5. Displays accessibility-formatted text (dyslexia-friendly fonts, ADHD-paced segments)
  6. Offers two AI-generated story branches so the child steers the adventure
  7. Quizzes comprehension with fun questions tied to the real facts

All of this happens simultaneously — the voice narration is never blocked by image generation, and the child can interrupt at any time to ask questions or change direction.


Features & Functionality

Voice-First Interaction

  • Bidirectional audio streaming via Gemini Live API with the Aoede voice
  • Children speak naturally — barge-in support lets them interrupt mid-narration
  • Affective dialog detection adapts tone based on the child's emotional state
  • Session resumption on disconnect (tokens valid ~2 hours)
  • Context window compression for sessions beyond 10 minutes

Fact-Grounded Storytelling

  • Every story starts with a Research Agent that queries Google Search for real-world facts
  • Facts are woven into the narrative by the Story Architect Agent — the child learns real science, history, or geography through adventure
  • The child is always the hero of their own story

Real-Time Illustration

  • A Prompt Architect Agent translates story paragraphs into rich artistic prompts
  • Gemini Image Generation produces custom illustrations per scene
  • Images are delivered via a parallel side-channel — narration never waits for rendering
  • Illustrations stored on Cloudflare R2 for the Story Library

Accessibility — Designed with Neurodiversity in Mind

  • Dyslexia mode — OpenDyslexic font, increased letter/word spacing, relaxed line height
  • ADHD pacing — short segments with animated progress indicators, word-by-word text reveal
  • Autism structure — predictable narrative scaffolding, emotion labels, consistent story patterns
  • Full audio narration + image alt-text for visual impairment
  • Parent Dashboard for managing all accessibility settings per child

Interactive Storytelling

  • Story Choices — two AI-generated narrative branches at the end of each chapter
  • Wand button — tap during narration to steer the story in a new direction
  • Discover Panel — recap research facts, view illustrations, and take a comprehension quiz
  • Multi-chapter stories that loop: child responds → next chapter generates → new illustration, choices, and quiz

Story Library & Persistence

  • Completed stories persisted to PostgreSQL with all segments, illustrations, research facts, quiz questions, and choices
  • Library screen for replaying any past story without a new AI session
  • Story thumbnails generated from the first illustration

Subscription System

  • Basic ($5/mo) and Plus ($20/mo) tiers with a free 30-day Plus trial
  • Tier enforcement at the WebSocket layer (audio mode gating) and tool layer (daily story limits, illustration caps)

Technologies Used

Google AI & Cloud (Core)

Technology Role
Google ADK Multi-agent orchestration framework — run_live for BIDI audio sessions, run_async for text sub-agents, FunctionTools for tool calling, SessionService for state management
Gemini Live API Bidirectional audio streaming with gemini-2.5-flash-native-audio-preview — voice input/output, affective dialog, barge-in detection, session resumption
Gemini 2.5 Flash Powers all text-mode sub-agents — Research Agent (temp 0.3), Story Architect (temp 0.9), Choices Agent (temp 0.85), Quiz Agent (temp 0.4), Illustration Architect
Gemini Image Generation gemini-2.5-flash-image for per-scene illustration generation
Google Cloud Run Serverless WebSocket hosting for the production deployment
Google Cloud SQL PostgreSQL database for stories, profiles, subscriptions

Data Sources

  • Google Search — real-world facts for every story topic, accessed via ADK's built-in google_search tool
  • Google Trends (via pytrends) — trending topics from 3 timezone regions (Asia, Europe, Americas), filtered and reformatted by Gemini Flash into child-safe story suggestions, refreshed on a cron schedule

Architecture

The system runs three concurrent coroutines per WebSocket connection:

  • upstream — microphone PCM audio → Gemini Live API
  • downstream — Gemini Live events → client (audio, transcriptions, tool signals)
  • drain — per-session asyncio.Queue → client (illustrations, text, choices, quiz)

The drain coroutine operates on a side-channel independent from the BIDI audio stream. This is the key design principle: narration is never blocked by illustration generation. The audio stream and media queue operate independently, so the child hears the story while illustrations, choices, and quiz questions arrive in parallel.

Six specialized agents coordinate through the Orchestrator:

  1. Orchestrator Agent (Gemini Live) — the voice interface, manages the conversation and calls tools
  2. Research Agent (Gemini Flash + Google Search) — verifies facts before any story is told
  3. Story Architect (Gemini Flash) — writes personalized adventures with the child as hero
  4. Prompt Architect (Gemini Flash) — translates story paragraphs into artistic image prompts
  5. Choices Agent (Gemini Flash) — generates two narrative branches for interactive storytelling
  6. Quiz Agent (Gemini Flash) — creates comprehension questions tied to the research facts

After the Story Architect completes, four parallel streams launch simultaneously:

  • The Orchestrator narrates a vivid spoken summary (audio BIDI)
  • Accessibility text segments push through the media queue (50ms stagger)
  • The Prompt Architect → Gemini Image pipeline generates and uploads illustrations
  • The Choices Agent (3s delay) and Quiz Agent (6s delay) produce interactive content

Findings & Learnings

The Gemini Live API is powerful but requires creative engineering

The native audio model (gemini-2.5-flash-native-audio-preview) delivers an incredible voice experience — expressive narration, character voices, emotional detection. But we discovered it is significantly less reliable at triggering function calls from audio input alone compared to text input. Our solution was a multi-layered reinforcement pattern:

  1. Text shadow injection — when the child speaks a topic, we intercept the transcription and inject a parallel text turn into the BIDI stream
  2. System-framing — wrapping the topic as a [SYSTEM: ...] instruction, because plain user text turns were sometimes acknowledged vocally but the tool call was skipped
  3. Retry nudges — if the model completes an excitement sentence without calling the tool, we inject up to 3 follow-up system nudges
  4. Audio suppression — incoming PCM chunks are temporarily dropped during text injection to prevent VAD from resetting the model's attention

This was the single most time-consuming challenge of the project and represents days of iteration.

ADK BIDI event timing is non-deterministic

We discovered that for text-injected turns in native audio mode, ADK's get_function_calls() and get_function_responses() events can arrive approximately 13 seconds after the tool has already completed — or sometimes never at all. This broke our drain gate mechanism (which waited for the tool_result:generate_story event before sending media to the client).

The fix was a dual-gate signaling pattern: the tool itself directly opens the drain gate via signal_story_ready() as the primary path, while the BIDI event serves as a redundant backup. This resilience pattern was critical for a smooth user experience.

Thinking budget matters — even for sub-agents

When using gemini-2.5-flash as our Prompt Architect Agent, the model's internal chain-of-thought consumed most of the token budget, leaving only ~88 characters for the actual image prompt output. Setting thinking_budget=0 for this deterministic translation task (story text → art prompt) immediately fixed prompt quality. Lesson: not every agent needs to "think" — some just need to translate.

Output transcription events are progressive rewrites, not additive

The output_transcription events from Gemini Live arrive as progressive rewrites — each event replaces the previous text rather than appending to it. Naively forwarding every event to the client caused duplicated and garbled transcript text. Only forwarding finished=True events produces clean, one-sentence-per-utterance transcriptions.

Context window compression drops tool results on session resume

After ~10 minutes, Gemini Live's context window compression kicks in. When a child disconnects and reconnects (session resume), the compression may have dropped the original research_topic result. Without those facts, the model couldn't form a generate_story call for chapter 2. The fix: on resume, we fetch the research facts from the database and re-inject them as a [SYSTEM CONTEXT] message.

The side-channel architecture was the right call

Our decision to separate media delivery (illustrations, text, choices, quiz) from the audio BIDI stream via an asyncio.Queue side-channel proved essential. Image generation takes 5-15 seconds per illustration — if this blocked narration, the child would experience dead silence. The three-coroutine model (upstream, downstream, drain) ensures the voice conversation flows naturally while rich media arrives independently. This architecture would generalize well to any multimodal AI application where different output modalities have different latencies.

Graceful degradation everywhere

Every layer of the illustration pipeline has independent error handling: the Prompt Architect falls back to a template prompt, image generation failures don't crash the story, R2 upload failures fall back to base64 inline images, and MOCK_IMAGES=true mode generates solid-color PNGs using only struct and zlib (no Pillow) for zero-cost local development. This let us develop and iterate on the full UI flow without any API quota.

Built With

Share this project:

Updates