Yolo object detection project by Michael Wen

I wrote an app to detect objects in a photo here using Yolo, all by myself. Here's the web UI:

Here's the system architecture:

About This Project

This application is a full-stack computer vision system built using modern web and machine learning technologies. The frontend is developed with Vite + React for fast development and minimal bundle size, while the backend is powered by Python, FastAPI, YOLO, and Uvicorn, all running inside Docker for maximum portability and compatibility.

Object detection is performed using the YOLO (You Only Look Once) family of models, allowing the system to identify and localize multiple objects such as people, vehicles, animals, and everyday street scenes in a single pass. Users can dynamically switch between different YOLO model sizes to balance speed and accuracy, and optionally toggle confidence scores in the detection output.

To provide a rich and varied set of test images without storing large datasets locally, the application integrates with the Unsplash API, dynamically fetching public images based on selected categories such as people, streets, markets, and traffic. This approach keeps the application lightweight while still enabling realistic, real-world object detection scenarios.

Disk space optimization was a major design consideration throughout this project. Instead of running multiple containers, both this YOLO object detection backend and a separate digit-and-letter classification service (CNN-based) were consolidated into a single Docker container. This optimization reduced total disk usage on the VPS from over 30GB to approximately 4GB.

Additional disk savings were achieved by developing locally and only deploying the final production build to the server. The frontend is built ahead of time and only the compiled dist/ directory is copied to the VPS, avoiding unnecessary development dependencies in production.

The Docker image itself follows a multi-stage Builder / Runtime pattern, ensuring that only the minimal runtime dependencies are included in the final image. During deployment, memory constraints on the VPS revealed an out-of-memory (OOM) issue when loading multiple YOLO models simultaneously. This was resolved by implementing lazy loading of YOLO models, ensuring that each model is only loaded into memory when it is actually needed.

Together, these design choices result in a system that is fast, memory-efficient, disk-efficient, and production-ready, while still delivering powerful real-time object detection capabilities through a clean and user-friendly web interface.

Any comments? Feel free to participate below in the Facebook comment section.

Enjoy the following random pages..

This website is Michael's official homepage.

This website is for Jeantour, a travel agency in Taipei, Taiwan.

This program detects where eyes are in a photo.

This software allows hundreds of participants to hold a meeting online.

Post your comment below.
Anything is okay.
I am serious.