Benchy
AI-powered performance analysis and optimization for your codebase.
Inspiration
Personal experience: At past hackathons, we found that while spending a lot of time debugging was basically a given, a surprisingly large amount of time was also spent trying to optimize our project. Even if our project worked and was super polished, it would be meaningless if the demo didn't run in time.
Every developer has stared at a slow API endpoint or a memory-leaking loop, knowing something is wrong but not knowing where to start. Performance optimization is one of the most impactful things you can do for a codebase but also one of the most time-consuming, requiring profiling tools, benchmarking, domain knowledge, and countless changes that can take days to fully test.
We wanted to ask: what if an AI could do all of that for you, end-to-end, in minutes?
That question became Benchy, our autonomous AI agent that clones your GitHub repo, maps its execution graph, benchmarks every hotspot in a sandboxed environment, generates optimized code, verifies correctness, and opens a pull request. No setup. No profiling expertise required. Just smooth sailing.
What it does
Benchy is a fully autonomous performance engineer. You feed it a codebase, and it gives you mathematically proven, faster code. You can use it in three different ways, powered by a single pipeline:
- The Web App: Paste a GitHub URL, configure your optimization bias (Speed, Memory, or Balanced), and watch a real-time stream as Benchy clones, maps, benchmarks, and optimizes your code, culminating in a gorgeous dashboard and an automated Pull Request.
- The VS Code Extension: Right-click to "Optimize Workspace" or "Optimize Current File." Benchy streams its progress to your notification bar and opens side-by-side diff tabs so you can accept AI-verified optimizations instantly.
How we built it
To pull off an architecture this complex, we built a highly parallelized, deterministic engine that strictly manages AI execution. The system is broken down into four core pillars:
Orchestration & The AI Pipeline
At the heart of Benchy is a FastAPI backend managed by Railtracks. By wrapping our workflow in @rt.function_node, Railtracks handles state persistence and enforces timeouts for runaway processes.
To prevent the LLM from hallucinating code paths, we integrated Tree-sitter to deterministically extract the codebase's Abstract Syntax Tree (AST). Once parsed, Gemini 3 Flash acts as a triage agent, analyzing the AST to group files into priority chunks. Next, Gemini 3.1 Pro rewrites the bottlenecked code. Every single AI interaction is wrapped in PydanticAI to guarantee structured, validated JSON outputs. Using asyncio.gather, these operations, including mapping the execution graph, rewriting code, and generating summaries run in parallel all at the same time.
Sandboxed Benchmarking & Verification
We could not rely on AI simply "guessing" that code was faster; we needed mathematical proof. To execute arbitrary code safely, we integrated Modal cloud sandboxes. Every chunk of code Benchy analyzes is shipped to a secure, ephemeral Modal container where benchmarks are executed to gather real avg_time_ms and memory metrics.
This powers our Optimization Retry Loop: Benchy checks the newly optimized code for correctness and performance regressions. If the new code breaks or runs slower, Benchy attempts a new strategy up to two times. If it still fails, it safely reverts to the original code.
Frontend & Real-Time Visualization
The web interface (authenticated via NextAuth for GitHub repo access) provides deep visibility into the pipeline. Because optimization is a long-running process, we built a Server-Sent Events (SSE) stream (rt.broadcast()) that pipes backend progress directly to the frontend. The results dashboard is highly visual: an interactive React Flow graph maps the AST with severity heatmaps, animated counters track overall speedup, and a 5-axis Recharts radar chart compares before-and-after metrics. And to top it all off, the landing page features a spinning 3D ASCII Benchy boat rendered in React Three Fiber.
Multi-Surface Integration
We engineered the backend to be entirely decoupled from the UI, allowing us to build two additional developer tools. We shipped a VS Code extension (.vsix) that gathers local files, auto-detects Python/JS, and triggers native side-by-side diff tabs right in the editor. Furthermore, we packaged the execution flow into an MCP (Model Context Protocol) Server, exposing an analyze_local_code tool over stdio transport so AI assistants like Cursor and Claude can invoke Benchy mid-conversation.
Challenges we ran into
- The "Smart but Reckless" LLM Problem: Early on, the AI would write incredibly fast code that completely broke the core logic. Building the Optimization Retry Loop combined with sandboxed execution on Modal was incredibly difficult, but it completely solved the regression issue.
- Deterministic Execution Mapping: Relying purely on AI to map dependencies resulted in made-up files. Switching to Tree-sitter for deterministic AST parsing anchored the AI to reality.
- Long-Running Web Requests: An optimization run can take minutes. Standard HTTP requests would time out. Implementing Server-Sent Events (SSE) from the Railtracks backend to the React frontend was challenging but vital for UX.
Accomplishments that we're proud of
- One Pipeline, Three Access Surfaces: We didn't just build a web app. We built a core engine that powers a Web dashboard, a native VS Code Extension, and an MCP Server tool.
- Provable AI: Benchy doesn't just say "this code looks better." It compiles it, runs it in a cloud sandbox, measures the exact execution time and memory footprint, and proves the speedup mathematically before opening a Pull Request.
- The Benchy Score Dashboard: The visual representation of the performance data, specifically the severity heatmap on the React Flow graph and the before/after Radar charts, make the complex performance metrics easy to digest.
What we learned
- An LLM is only as good as the sandbox you put it in. Giving Gemini 3.1 Pro the ability to compile, run, fail, and try again resulted in exponentially better code than a one-shot prompt.
- Abstract Syntax Trees are broken. Parsing code deterministically before feeding it to an LLM saves an immense amount of token context and prevents hallucinations. They also allow the LLM to understand how different parts of the program interact with each other, allowing for better changes without destroying the entire program.
What's next for Benchy
Currently, Benchy supports Python and JavaScript/TypeScript. We want to expand our Tree-sitter and benchmark harnesses to support compiled languages like Rust and Go. We're also looking to add functionality such that your favourite agents can use Benchy, through the implementation of a MCP server. We tried to create it, but didn't end up getting it to work. Finally, we want to store a "fast function optimizations" database, keeping track of specific niche optimizations that are automatically applied when detected. This will help with optimizing those niche lines of code that fly under the radar.

Log in or sign up for Devpost to join the conversation.