Edge AI: Deploying Machine Learning Models at the Edge for Real-Time Intelligence

Running AI at the edge isn’t just about latency—it’s about privacy, reliability, and cost. This guide covers the complete journey from model optimization to deploying AI on edge devices.

IoT Devices Photo by Alexandre Debiève on Unsplash

Why Edge AI?

Latency: Process data in milliseconds, not seconds.

Privacy: Data never leaves the device.

Reliability: Works offline—no cloud dependency.

Cost: No per-inference cloud charges.

Edge AI Architecture

┌─────────────────────────────────────────────────────────┐
│                    Cloud Layer                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  │
│  │   Model     │  │   Model     │  │   Analytics     │  │
│  │  Training   │  │  Registry   │  │   Dashboard     │  │
│  └─────────────┘  └─────────────┘  └─────────────────┘  │
└─────────────────────────────────────────────────────────┘
                           │
                    Model Updates
                           ▼
┌─────────────────────────────────────────────────────────┐
│                   Edge Gateway                          │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────┐  │
│  │   Model     │  │  Inference  │  │   Data          │  │
│  │   Cache     │  │   Engine    │  │   Aggregation   │  │
│  └─────────────┘  └─────────────┘  └─────────────────┘  │
└─────────────────────────────────────────────────────────┘
                           │
                      Inference
                           ▼
┌─────────────────────────────────────────────────────────┐
│                   Edge Devices                          │
│  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────────────┐ │
│  │ Camera │  │ Sensor │  │ Robot  │  │ Industrial PLC │ │
│  └────────┘  └────────┘  └────────┘  └────────────────┘ │
└─────────────────────────────────────────────────────────┘

Model Optimization Pipeline

Step 1: Quantization

Reduce model size by 75% with minimal accuracy loss:

import tensorflow as tf

# Post-training quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Full integer quantization for maximum compression
def representative_dataset():
    for data in calibration_data:
        yield [data.astype(np.float32)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

quantized_model = converter.convert()

# Save
with open('model_quantized.tflite', 'wb') as f:
    f.write(quantized_model)

Step 2: Pruning

Remove unnecessary weights:

import tensorflow_model_optimization as tfmot

# Define pruning schedule
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
    initial_sparsity=0.0,
    final_sparsity=0.5,
    begin_step=0,
    end_step=1000
)

# Apply pruning
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(
    model,
    pruning_schedule=pruning_schedule
)

# Train with pruning
model_for_pruning.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

callbacks = [
    tfmot.sparsity.keras.UpdatePruningStep(),
    tfmot.sparsity.keras.PruningSummaries(log_dir='./logs')
]

model_for_pruning.fit(train_data, epochs=10, callbacks=callbacks)

# Strip pruning wrappers for deployment
final_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

Step 3: ONNX Export for Cross-Platform Deployment

import torch
import onnx
import onnxruntime as ort

# Export PyTorch model to ONNX
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=13,
    do_constant_folding=True,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

# Optimize ONNX model
from onnxruntime.transformers import optimizer
optimized_model = optimizer.optimize_model(
    "model.onnx",
    model_type='bert',  # or 'gpt2', 'vit', etc.
    num_heads=12,
    hidden_size=768
)
optimized_model.save_model_to_file("model_optimized.onnx")

Circuit Board Photo by Robin Glauser on Unsplash

Edge Runtime Options

TensorFlow Lite

For mobile and embedded devices:

// C++ inference on edge device
#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/model.h"

class EdgeInference {
public:
    EdgeInference(const char* model_path) {
        model = tflite::FlatBufferModel::BuildFromFile(model_path);
        
        tflite::ops::builtin::BuiltinOpResolver resolver;
        tflite::InterpreterBuilder(*model, resolver)(&interpreter);
        
        interpreter->AllocateTensors();
        
        // Enable multi-threading
        interpreter->SetNumThreads(4);
    }
    
    std::vector<float> Infer(const std::vector<float>& input) {
        // Copy input
        float* input_tensor = interpreter->typed_input_tensor<float>(0);
        std::copy(input.begin(), input.end(), input_tensor);
        
        // Run inference
        interpreter->Invoke();
        
        // Get output
        float* output_tensor = interpreter->typed_output_tensor<float>(0);
        int output_size = interpreter->output_tensor(0)->bytes / sizeof(float);
        
        return std::vector<float>(output_tensor, output_tensor + output_size);
    }

private:
    std::unique_ptr<tflite::FlatBufferModel> model;
    std::unique_ptr<tflite::Interpreter> interpreter;
};

ONNX Runtime

For cross-platform deployment:

import onnxruntime as ort
import numpy as np

class ONNXEdgeModel:
    def __init__(self, model_path: str):
        # Configure for edge device
        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        sess_options.intra_op_num_threads = 4
        sess_options.inter_op_num_threads = 2
        
        # Use appropriate execution provider
        providers = ['CPUExecutionProvider']
        
        # Add GPU if available
        if 'CUDAExecutionProvider' in ort.get_available_providers():
            providers.insert(0, 'CUDAExecutionProvider')
        
        self.session = ort.InferenceSession(
            model_path,
            sess_options,
            providers=providers
        )
        
        self.input_name = self.session.get_inputs()[0].name
        self.output_name = self.session.get_outputs()[0].name
    
    def predict(self, input_data: np.ndarray) -> np.ndarray:
        return self.session.run(
            [self.output_name],
            {self.input_name: input_data}
        )[0]

Real-World Edge AI Application

Object Detection Pipeline

import cv2
import numpy as np
from collections import deque
import threading
import queue

class EdgeObjectDetector:
    def __init__(self, model_path: str, confidence_threshold: float = 0.5):
        self.model = ONNXEdgeModel(model_path)
        self.confidence_threshold = confidence_threshold
        self.frame_queue = queue.Queue(maxsize=2)
        self.result_queue = queue.Queue(maxsize=2)
        self.running = False
        
    def preprocess(self, frame: np.ndarray) -> np.ndarray:
        # Resize to model input size
        input_frame = cv2.resize(frame, (640, 640))
        # Normalize
        input_frame = input_frame.astype(np.float32) / 255.0
        # HWC to CHW
        input_frame = np.transpose(input_frame, (2, 0, 1))
        # Add batch dimension
        return np.expand_dims(input_frame, axis=0)
    
    def postprocess(self, output: np.ndarray, original_shape: tuple):
        detections = []
        for detection in output[0]:
            confidence = detection[4]
            if confidence > self.confidence_threshold:
                x, y, w, h = detection[:4]
                class_id = int(detection[5])
                
                # Scale to original image size
                x *= original_shape[1] / 640
                y *= original_shape[0] / 640
                w *= original_shape[1] / 640
                h *= original_shape[0] / 640
                
                detections.append({
                    'bbox': [int(x-w/2), int(y-h/2), int(w), int(h)],
                    'confidence': float(confidence),
                    'class_id': class_id
                })
        return detections
    
    def inference_thread(self):
        while self.running:
            try:
                frame, original_shape = self.frame_queue.get(timeout=0.1)
                input_tensor = self.preprocess(frame)
                output = self.model.predict(input_tensor)
                detections = self.postprocess(output, original_shape)
                self.result_queue.put(detections)
            except queue.Empty:
                continue
    
    def start(self):
        self.running = True
        self.thread = threading.Thread(target=self.inference_thread)
        self.thread.start()
    
    def stop(self):
        self.running = False
        self.thread.join()
    
    def detect(self, frame: np.ndarray):
        if not self.frame_queue.full():
            self.frame_queue.put((frame, frame.shape))
        
        try:
            return self.result_queue.get_nowait()
        except queue.Empty:
            return None

Deployment with Docker

FROM arm64v8/python:3.11-slim

# Install dependencies
RUN apt-get update && apt-get install -y \
    libopencv-dev \
    && rm -rf /var/lib/apt/lists/*

# Install Python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application
COPY model_quantized.onnx /app/model.onnx
COPY app.py /app/

WORKDIR /app

# Resource limits for edge device
ENV OMP_NUM_THREADS=2
ENV ONNXRUNTIME_NUM_THREADS=2

CMD ["python", "app.py"]

Docker Compose for edge deployment:

version: '3.8'
services:
  edge-ai:
    build: .
    restart: unless-stopped
    devices:
      - /dev/video0:/dev/video0  # Camera access
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '2'
    environment:
      - MODEL_PATH=/app/model.onnx
      - CONFIDENCE_THRESHOLD=0.6
    volumes:
      - ./logs:/app/logs

Performance Benchmarks

Device	Model	FPS	Latency	Power
Raspberry Pi 5	YOLOv8n (INT8)	15	67ms	5W
Jetson Orin Nano	YOLOv8s (FP16)	45	22ms	15W
Intel NUC	YOLOv8m (FP32)	30	33ms	25W
Apple M2 (Neural Engine)	YOLOv8l	60	17ms	10W

Best Practices

Profile First: Measure before optimizing
Quantize Aggressively: INT8 often sufficient
Batch When Possible: Even small batches help
Cache Models: Load once, infer many
Monitor Resources: Memory leaks kill edge devices

Conclusion

Edge AI enables real-time, private, and reliable AI applications. Start with model optimization, choose the right runtime for your hardware, and deploy with proper resource management.

Deploying AI at the edge? Share your experiences in the comments!

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)