Edge AI: Deploying Machine Learning Models at the Edge for Real-Time Intelligence
on Edge ai, Machine learning, Iot, Tensorflow lite, Onnx, Edge computing
Edge AI: Deploying Machine Learning Models at the Edge for Real-Time Intelligence
Running AI at the edge isn’t just about latency—it’s about privacy, reliability, and cost. This guide covers the complete journey from model optimization to deploying AI on edge devices.
Photo by Alexandre Debiève on Unsplash
Why Edge AI?
Latency: Process data in milliseconds, not seconds.
Privacy: Data never leaves the device.
Reliability: Works offline—no cloud dependency.
Cost: No per-inference cloud charges.
Edge AI Architecture
┌─────────────────────────────────────────────────────────┐
│ Cloud Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Model │ │ Model │ │ Analytics │ │
│ │ Training │ │ Registry │ │ Dashboard │ │
│ └─────────────┘ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────┘
│
Model Updates
▼
┌─────────────────────────────────────────────────────────┐
│ Edge Gateway │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │
│ │ Model │ │ Inference │ │ Data │ │
│ │ Cache │ │ Engine │ │ Aggregation │ │
│ └─────────────┘ └─────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────┘
│
Inference
▼
┌─────────────────────────────────────────────────────────┐
│ Edge Devices │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────────────┐ │
│ │ Camera │ │ Sensor │ │ Robot │ │ Industrial PLC │ │
│ └────────┘ └────────┘ └────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────┘
Model Optimization Pipeline
Step 1: Quantization
Reduce model size by 75% with minimal accuracy loss:
import tensorflow as tf
# Post-training quantization
converter = tf.lite.TFLiteConverter.from_saved_model('model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Full integer quantization for maximum compression
def representative_dataset():
for data in calibration_data:
yield [data.astype(np.float32)]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
quantized_model = converter.convert()
# Save
with open('model_quantized.tflite', 'wb') as f:
f.write(quantized_model)
Step 2: Pruning
Remove unnecessary weights:
import tensorflow_model_optimization as tfmot
# Define pruning schedule
pruning_schedule = tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.0,
final_sparsity=0.5,
begin_step=0,
end_step=1000
)
# Apply pruning
model_for_pruning = tfmot.sparsity.keras.prune_low_magnitude(
model,
pruning_schedule=pruning_schedule
)
# Train with pruning
model_for_pruning.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
callbacks = [
tfmot.sparsity.keras.UpdatePruningStep(),
tfmot.sparsity.keras.PruningSummaries(log_dir='./logs')
]
model_for_pruning.fit(train_data, epochs=10, callbacks=callbacks)
# Strip pruning wrappers for deployment
final_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning)
Step 3: ONNX Export for Cross-Platform Deployment
import torch
import onnx
import onnxruntime as ort
# Export PyTorch model to ONNX
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=13,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
# Optimize ONNX model
from onnxruntime.transformers import optimizer
optimized_model = optimizer.optimize_model(
"model.onnx",
model_type='bert', # or 'gpt2', 'vit', etc.
num_heads=12,
hidden_size=768
)
optimized_model.save_model_to_file("model_optimized.onnx")
Photo by Robin Glauser on Unsplash
Edge Runtime Options
TensorFlow Lite
For mobile and embedded devices:
// C++ inference on edge device
#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/kernels/register.h"
#include "tensorflow/lite/model.h"
class EdgeInference {
public:
EdgeInference(const char* model_path) {
model = tflite::FlatBufferModel::BuildFromFile(model_path);
tflite::ops::builtin::BuiltinOpResolver resolver;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
interpreter->AllocateTensors();
// Enable multi-threading
interpreter->SetNumThreads(4);
}
std::vector<float> Infer(const std::vector<float>& input) {
// Copy input
float* input_tensor = interpreter->typed_input_tensor<float>(0);
std::copy(input.begin(), input.end(), input_tensor);
// Run inference
interpreter->Invoke();
// Get output
float* output_tensor = interpreter->typed_output_tensor<float>(0);
int output_size = interpreter->output_tensor(0)->bytes / sizeof(float);
return std::vector<float>(output_tensor, output_tensor + output_size);
}
private:
std::unique_ptr<tflite::FlatBufferModel> model;
std::unique_ptr<tflite::Interpreter> interpreter;
};
ONNX Runtime
For cross-platform deployment:
import onnxruntime as ort
import numpy as np
class ONNXEdgeModel:
def __init__(self, model_path: str):
# Configure for edge device
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads = 4
sess_options.inter_op_num_threads = 2
# Use appropriate execution provider
providers = ['CPUExecutionProvider']
# Add GPU if available
if 'CUDAExecutionProvider' in ort.get_available_providers():
providers.insert(0, 'CUDAExecutionProvider')
self.session = ort.InferenceSession(
model_path,
sess_options,
providers=providers
)
self.input_name = self.session.get_inputs()[0].name
self.output_name = self.session.get_outputs()[0].name
def predict(self, input_data: np.ndarray) -> np.ndarray:
return self.session.run(
[self.output_name],
{self.input_name: input_data}
)[0]
Real-World Edge AI Application
Object Detection Pipeline
import cv2
import numpy as np
from collections import deque
import threading
import queue
class EdgeObjectDetector:
def __init__(self, model_path: str, confidence_threshold: float = 0.5):
self.model = ONNXEdgeModel(model_path)
self.confidence_threshold = confidence_threshold
self.frame_queue = queue.Queue(maxsize=2)
self.result_queue = queue.Queue(maxsize=2)
self.running = False
def preprocess(self, frame: np.ndarray) -> np.ndarray:
# Resize to model input size
input_frame = cv2.resize(frame, (640, 640))
# Normalize
input_frame = input_frame.astype(np.float32) / 255.0
# HWC to CHW
input_frame = np.transpose(input_frame, (2, 0, 1))
# Add batch dimension
return np.expand_dims(input_frame, axis=0)
def postprocess(self, output: np.ndarray, original_shape: tuple):
detections = []
for detection in output[0]:
confidence = detection[4]
if confidence > self.confidence_threshold:
x, y, w, h = detection[:4]
class_id = int(detection[5])
# Scale to original image size
x *= original_shape[1] / 640
y *= original_shape[0] / 640
w *= original_shape[1] / 640
h *= original_shape[0] / 640
detections.append({
'bbox': [int(x-w/2), int(y-h/2), int(w), int(h)],
'confidence': float(confidence),
'class_id': class_id
})
return detections
def inference_thread(self):
while self.running:
try:
frame, original_shape = self.frame_queue.get(timeout=0.1)
input_tensor = self.preprocess(frame)
output = self.model.predict(input_tensor)
detections = self.postprocess(output, original_shape)
self.result_queue.put(detections)
except queue.Empty:
continue
def start(self):
self.running = True
self.thread = threading.Thread(target=self.inference_thread)
self.thread.start()
def stop(self):
self.running = False
self.thread.join()
def detect(self, frame: np.ndarray):
if not self.frame_queue.full():
self.frame_queue.put((frame, frame.shape))
try:
return self.result_queue.get_nowait()
except queue.Empty:
return None
Deployment with Docker
FROM arm64v8/python:3.11-slim
# Install dependencies
RUN apt-get update && apt-get install -y \
libopencv-dev \
&& rm -rf /var/lib/apt/lists/*
# Install Python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and application
COPY model_quantized.onnx /app/model.onnx
COPY app.py /app/
WORKDIR /app
# Resource limits for edge device
ENV OMP_NUM_THREADS=2
ENV ONNXRUNTIME_NUM_THREADS=2
CMD ["python", "app.py"]
Docker Compose for edge deployment:
version: '3.8'
services:
edge-ai:
build: .
restart: unless-stopped
devices:
- /dev/video0:/dev/video0 # Camera access
deploy:
resources:
limits:
memory: 512M
cpus: '2'
environment:
- MODEL_PATH=/app/model.onnx
- CONFIDENCE_THRESHOLD=0.6
volumes:
- ./logs:/app/logs
Performance Benchmarks
| Device | Model | FPS | Latency | Power |
|---|---|---|---|---|
| Raspberry Pi 5 | YOLOv8n (INT8) | 15 | 67ms | 5W |
| Jetson Orin Nano | YOLOv8s (FP16) | 45 | 22ms | 15W |
| Intel NUC | YOLOv8m (FP32) | 30 | 33ms | 25W |
| Apple M2 (Neural Engine) | YOLOv8l | 60 | 17ms | 10W |
Best Practices
- Profile First: Measure before optimizing
- Quantize Aggressively: INT8 often sufficient
- Batch When Possible: Even small batches help
- Cache Models: Load once, infer many
- Monitor Resources: Memory leaks kill edge devices
Conclusion
Edge AI enables real-time, private, and reliable AI applications. Start with model optimization, choose the right runtime for your hardware, and deploy with proper resource management.
Deploying AI at the edge? Share your experiences in the comments!
이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)
