Real-time ML: Why speed is becoming a key quality metric

Avatar img-thumbnail img-circle
By

in AI Adventures, Interviews

Machine learning technologies are evolving rapidly, especially in areas where real-time performance is critical — from augmented reality and voice interfaces to autonomous systems and mobile devices. Response time is no longer just a parameter; it’s a key quality metric and a significant competitive edge.

That’s why we spoke with Vladislav Agafonov, an expert in real-time machine learning and human-computer interaction. As a Senior Machine Learning Engineer at Meta Reality Labs UK, he focuses on building ML systems that power the next generation of low-latency, user-responsive interfaces for AR and wearable devices. In this interview, he shares practical insights into model optimization, latency-sensitive inference, and designing ML systems that feel intuitive and immediate to the user.

Vladislav Agafonov

Vladislav Agafonov: In today’s ML landscape, speed impacts multiple critical dimensions at once: experience, infrastructure cost, and the ability to run models on resource-constrained devices. Responses under 300 ms preserve the illusion of natural interaction, while longer delays break the flow and erode user confidence. The optimizations required to reach such speeds also improve efficiency, enabling more inferences per watt and reducing demands on cloud or battery-powered environments.

This becomes essential in use cases with tight real-time constraints, like collision avoidance or live speech translation, where missing a 10–50 ms window not only renders the system ineffective but also irreversibly spoils the whole experience. So speed is no longer a luxury metric; it’s what makes modern machine learning usable, scalable, and truly impactful.

VA: The truth is, it has already become a standalone axis of quality because it directly drives monetization. Amazon’s lost‑revenue metric and Google’s twenty‑percent traffic drop at half‑second delays turned latency into a business KPI long ago. Since then, the rise of edge AI scenarios has only tightened the screws. The response time already rivals accuracy in product OKRs, statements such as “p99 under 100 ms” are now commonplace, and hardware launches tout TOPS precisely because that number maps straight onto latency SLOs. As models reach a “good‑enough” level of accuracy, delay stays the most visible variable to users and the most decisive competitive edge.

VA: Real‑time response matters most wherever an ML decision has to slot into a tight human or machine feedback loop. Think of voice assistants, where even a 300‑millisecond pause breaks the illusion of natural conversation, or mixed‑reality headsets that must segment and relight every camera frame before the next 16‑millisecond display refresh, or phone camera pipelines that denoise and HDR‑stack images between shutter press and display so the user never notices processing. In autonomous driving perception, drone obstacle avoidance, industrial robots co‑working with people, the entire sense‑plan‑act cycle often operates within a 10‑ to 50‑millisecond window, so any spike stalls the vehicle or, worse, causes a collision. Latency is just as unforgiving in finance and cybersecurity, where millisecond‑level delays can lose whole trades or let an intrusion propagate before a block rule deploys. 

VA: As with many things, it depends on the context. In video processing, you have to hit the display refresh window, 16 ms for 60 fps or 33 ms for 30 fps. Users immediately notice stutter or rolling shutter artifacts if the image isn’t ready in time.

For voice interfaces, the benchmark is natural conversational timing. It feels alive if the wake-word model responds within ~300 ms and the assistant starts replying within 700 ms. Beyond one second, users start wondering if the device misheard, and once the pause hits 1.8 seconds or more, most people lose patience.

With LLM interfaces, the key is showing the first token quickly, ideally under 400 ms, so the user sees that the system is responding. After that, maintaining a generation rate of about 15 tokens per second keeps the interaction fluid. If it is any slower, the user starts interrupting or disengaging.

VA: When you deploy an ML model to a mobile or embedded device — like a smartphone, smartwatch, drone, or microcontroller — you’ll almost always run into the same five constraints.

First, power and heat: you only have a few watts to work with before the battery drains or the device overheats and the OS throttles performance. Second, hard real-time deadlines: miss 33 ms and you drop a frame at 30fps; miss 10 ms and a wake word might fail to trigger. Third, tight memory and bandwidth: most phones have 2–4 GB of RAM shared by the OS, GPU, and ML workloads. If your model relies heavily on external DRAM, latency and power usage will spike. Fourth, accuracy under compression: aggressive optimization via quantization, pruning, or distillation always comes with a tradeoff in precision. Fifth, privacy and connectivity: in many scenarios, the user is offline or sensitive data must stay on-device. That rules out cloud fallback — everything must work locally and well.

VA: Today, we have a great range of hardware options, both on-device and in the cloud.

On smartphones, we rely on Apple’s ANE (in the latest A and M series chips), Qualcomm’s Hexagon NPU, and Google’s Tensor G4. These chips deliver 35–70 TOPS while consuming very little power, which allows for real-time inference without overheating or rapid battery drain. Arm’s Ethos-U85 and low-power Hexagon DSPs are great options for ultra-compact devices like wearables or microcontrollers. They deliver up to 4 TOPS on just tens of milliwatts, which is enough to power always-on voice detection, simple gesture recognition, or 30fps object tracking that used to require traditional signal processing.

In the cloud, we still rely on TPUs and inference-focused GPUs like Google’s TPU or NVIDIA’s L4. These 400+ TOPS-class chips handle millions of global queries with response times under 50 milliseconds. At the edge or at scale, choosing proper hardware is essential, especially when latency isn’t just a simple metric anymore.

Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments