Yolo live object detection app by Michael Wen

I wrote an app to detect various objects in a live streaming feed here using Yolo, all by myself. Here's the web UI:

Here are some screenshots:




Ezoic...............

When I first got streaming to work locally, I assumed exposing it as a public URL would be straightforward. I couldn’t have been more wrong.

The real challenge turned out to be getting WebSocket-based streaming to work over HTTPS, due to strict browser security constraints. What initially seemed like a simple deployment task quickly became a deep dive into how browsers, TLS, reverse proxies, and WebSockets actually interact.

I spent a lot of time jumping through hoops—tweaking nginx configs, experimenting with headers, reading specs, and debugging opaque browser errors—until I finally pinpointed the root cause: Ezoic, which I was using as both the nameserver and HTTP proxy. While Ezoic worked fine for normal HTTP traffic, it interfered with WebSocket upgrade headers in subtle ways. Despite multiple attempts, I couldn’t reliably bypass Ezoic for WebSocket traffic on the same domain.

In the end, the solution was to serve the streaming backend from a separate domain that bypasses Ezoic entirely, while keeping the rest of the frontend and REST APIs unchanged. I set up nginx as a reverse proxy for the WebSocket endpoint, learned exactly how the WebSocket handshake works (including the Upgrade and Connection headers), and gained a much deeper understanding of how HTTPS, proxies, and backend servers like FastAPI and Uvicorn fit together.

What I thought would be a small deployment step turned into a crash course in real-world web infrastructure. Frustrating at times—but also pretty cool in hindsight.

What This Project Is About

This project is a real-time object detection web app built around YOLOv8, designed to work even on a low-resource VPS running CentOS 7. The goal was simple on paper: take a live video source (webcam or external camera), stream it to a backend, run object detection continuously, and show the results live in the browser.

In practice, almost every layer had constraints — browser security rules, WebSocket quirks, Docker memory limits, HTTPS requirements, and performance bottlenecks — which turned this into a very hands-on systems project rather than “just ML.”

Why I Built It This Way

Instead of doing everything locally, I intentionally deployed this to a VPS with limited disk and memory. This forced careful decisions around:

• minimizing memory usage
• avoiding unnecessary frame processing
• keeping Docker images small
• handling real-world browser security restrictions

The result is not just a demo, but a system that mirrors real production constraints.

High-Level Architecture

This app follows a clean separation between frontend and backend, connected by WebSockets.

Frontend (Vite + React)
• Runs entirely in the browser
• Captures video using the MediaDevices API
• Displays live video preview
• Sends frames to backend over WebSocket
• Receives detection results asynchronously
• Provides UI controls for detection frequency and stopping detection

Backend (FastAPI + YOLOv8)
• Runs inside Docker on the VPS
• Accepts WebSocket connections
• Decodes image frames
• Runs YOLOv8n inference
• Sends bounding box results back to frontend

Nginx (Reverse Proxy)
• Serves frontend over HTTPS
• Terminates TLS
• Proxies secure WebSocket (wss://) to FastAPI
• Keeps long-lived WebSocket connections alive

Vertical Architecture Flow

User’s Browser

Camera → MediaDevices API

React UI (Vite build)

WebSocket (wss://)

Nginx (HTTPS termination)

FastAPI WebSocket endpoint

YOLOv8n model inference

Detection results (JSON)

React renders boxes + labels

Everything flows vertically and asynchronously, which keeps the UI responsive even when inference is slow.

Biggest Challenges Faced

1. Browser Security (HTTPS + WSS)
Modern browsers will not allow camera access or WebSockets unless the page is served securely. Initially, this completely broke `getUserMedia()` and WebSocket connections.

Fix:
• Set up HTTPS using Let’s Encrypt
• Terminated TLS at Nginx
• Switched all WebSocket connections to `wss://`
• Ensured frontend never uses `ws://` on HTTPS pages

This alone solved multiple “mysterious” browser errors.

WebSocket Proxy Issues

Even after HTTPS worked, WebSockets kept failing silently.

Root cause:
Nginx requires very specific headers for WebSocket upgrades. Even small mistakes (like lowercase `upgrade`) break everything.

Fix:
• Explicit `proxy_http_version 1.1`
• `Upgrade` and `Connection Upgrade` headers
• Disabled buffering
• Increased read timeout to support long streams

Once this was fixed, WebSockets became stable.

Performance and Choppy Video

Running YOLO on every frame was unrealistic on a low-end VPS. It caused:

• High CPU usage
• Low FPS
• WebSocket congestion
• Laggy UI

Fix:
Instead of detecting every frame, detection runs on a fixed interval:

• 1 second (default)
• 2 seconds
• 5 seconds

Video streaming remains smooth, while detection runs independently. This separation dramatically improves perceived performance without increasing resource usage.

Memory Management Concerns

Streaming images continuously raises questions about memory leaks.

Key insight:
Python automatically releases memory for temporary objects once they go out of scope. Variables like:

• decoded image bytes
• NumPy arrays
• OpenCV frames

are eligible for garbage collection after each loop iteration.

As long as references are not stored globally, memory usage remains stable.

Docker Constraints and Design

Running everything inside Docker added another layer of complexity.

Decisions made:
• Use `python:3.11-slim` to reduce image size
• Multi-stage builds to avoid shipping build tools
• Reuse a single FastAPI app instead of multiple containers
• Import new endpoints as separate modules instead of duplicating apps

This keeps disk usage and RAM consumption under control.

Frontend UI Design Choices

The UI intentionally stays minimal but functional.

Features:
• Camera selection
• Start / Stop detection button
• Detection interval selector (1s / 2s / 5s)
• Rolling status log showing system events
• About section explaining architecture and challenges

Everything is optimized for mobile viewing and slow networks.

Why YOLOv8n

YOLOv8n was chosen because:

• Small model size
• Fast inference compared to larger variants
• Acceptable accuracy for live streaming
• Works well on CPU-only servers

On a constrained VPS, this is the best balance between speed and usefulness.

What This Project Demonstrates

This project isn’t just about object detection.

It demonstrates:

• real-time streaming system design
• WebSocket-based ML pipelines
• browser security compliance
• production-style deployment
• performance optimization under constraints

It’s the kind of project where the hardest parts aren’t the ML — they’re everything around it.

Final Thoughts

Building this forced me to think like a backend engineer, frontend engineer, DevOps engineer, and ML engineer at the same time.

Most problems were not obvious at first, but each fix improved the system’s stability and clarity.

In the end, this became a solid example of how to run real-time computer vision on real infrastructure, not just in a notebook or local demo.

Any comments? Feel free to participate below in the Facebook comment section.
Post your comment below.
Anything is okay.
I am serious.