Deploying AI at the Edge: Model Compression and Hardware-Aware Optimization
Large AI models often struggle to meet the latency, memory, and power constraints required for real-world edge deployments. This talk explores practical techniques for making modern AI models efficient enough to run on-device using model distillation, quantization, and hardware-aware optimization strategies. Attendees will learn how to reduce model size and inference costs while maintaining accuracy, covering approaches such as post-training quantization and efficient runtime optimization across modern AI frameworks and accelerators. The session will also highlight real-world tradeoffs between performance, memory footprint, and power efficiency when deploying AI applications on edge devices.