Inspiration

For the Caterpillar track at HackIllinois, we were tasked with leveraging AI to turn unstructured machine inspection data into detailed insights, ultimately accelerating decision-making, improve documentation accuracy, and ensure operational safety. To do this, we took inspiration from two existing Caterpillar products - Cat Inspect, which Caterpillar currently uses to carry out inspections, and Cat AI Assistant, which uses Caterpillar's knowledge and product base to give users actionable advice. Ultimately, we wanted to build an app technically robust enough to perform precise, accurate excavator inspections while keeping latency low and user experience seamless.

What it does

ExcaVision quickly converts image, text, and audio input into a robust, editable excavator inspection report.

For each of the 3 inspection categories - on the ground, engine compartment, and inside the cab - the technician enters images with optional text and audio supplements. Gemini Flash then identifies exactly what parts are present in each image and where, handling each image and category in parallel while Modal handles containerization. After part detection, a separate Gemini instance evaluates the cropped part images against deterministic rules corresponding to the various inspection criteria, returning a green/yellow/red light (or N/A) rating for each along with its rationale. Ultimately, the worst rating (if multiple images of the part were detected) is prioritized, presented back to the technician along with the reasoning.

If some required parts weren't initially detected, the user is directed to add images of these parts, which are then run through the same detection-evaluation pipeline. Users also always have the option to adjust ratings and commentary manually. Once finished, the user can export the completed report as a PDF.

How we built it

We used React + TypeScript for the frontend, Python for the backend, FastAPI for integration, Gemini for image analysis, Modal for containerization and deployment, Pillow for image cropping, Capacitor speech recognition for voice-to-text, and Xcode for mobile UI support.

Challenges we ran into

The primary challenge was accurate part detection in images, particularly cropping the image around various parts correctly and consistently. Smaller parts, such as mirrors and lights, were especially prone to hallucination and incorrect labelling. Initially, Grounding DINO was used for detection due to its high throughput and object detection capabilities, but it struggled to pick up on various parts. Florence-2 was also considered, but Gemini Flash with temperature 0 achieved the best balance of speed and accuracy.

Accomplishments that we're proud of

We succeeded in automating the full excavator inspection pipeline in an easy-to-use manner that still leverages technician expertise.

What we learned

Through this project, we learned how to leverage batch processing to reduce latency, effective frontend-backend integration, and the importance of test-driven development.

What's next for ExcaVision

We plan to extend ExcaVision to support video input, using intelligent frame extraction to optimize capture of the different parts. Additionally, we'll look to improve detection model reliability and introduce a RAG layer to better tailor evaluation insights to the specifics of the machine.

Built With

Share this project:

Updates