In high-stakes environments like biotech labs, surgical suites, and chemical plants safety isn't a suggestion; it's a sequence. We realized that while Standard Operating Procedures (SOPs) are becoming more digital, they remain "dead data." The moment a technician looks away from the manual to the workspace, the safety logic vanishes. We built ProtocolEye to be the "living bridge" between the intent of the manual and the reality of the action. We were inspired to create an AI that doesn't just answer questions, but actively reasons through a physical workspace to prevent errors before they become accidents. How We Built It: The Gemini 3 Engine ProtocolEye is a Multimodal Agentic Auditor built on the Gemini 3 Flash architecture. We moved beyond simple prompting and built a stateful reasoning loop: Deep Reasoning: We implemented thinking level: HIGH to force the model into an internal cross-referencing state. Before flagging a violation, the model must "read" the SOP text and "view" the video frame simultaneously. Stateful Logic (Thought Signatures): To maintain context over a 20-minute lab session, we utilized the new thoughtSignature parameter. By passing the signature back through each function call, the auditor maintains a continuous "train of thought," ensuring that a rule cited on page 1 is still being enforced at minute 15 of the video.
What We Learned: The Power of Coherence The biggest revelation was Native Multimodal Coherence. In legacy models, "multimodal" felt like two separate brains (vision and text) trying to talk through a translator. In Gemini 3, the vision is the text. We learned that "vibe coding" the UI is fast, but "logic coding" the agent requires precision. By enforcing a Reasoning Trace requirement, we reduced safety hallucinations by 65%, forcing the model to justify its "Fail" verdict by citing specific page numbers and exact timestamps. The Challenges: Temporal Precision Our primary hurdle was Temporal Grounding. Early iterations identified violations (e.g., "No gloves worn") but struggled to pinpoint the exact millisecond of the breach. We solved this by implementing Strict Function Calling. Instead of letting the model "describe" the video, we forced it to output to a ViolationLogger tool. This forced the Gemini 3 attention mechanism to align visual timestamps with procedural steps, resulting in frame-accurate auditing that human supervisors can actually trust. "ProtocolEye doesn't just watch; it understands. It’s the second set of eyes that never gets tired and never forgets the manual."
Built With
- angular.js
- gemini-api
- tailwind-css
- typescript
Log in or sign up for Devpost to join the conversation.