Real-time ML: Why speed is becoming a key quality metric ⋆ IT Logs

Machine learning technologies are evolving rapidly, especially in areas where real-time performance is critical — from augmented reality and voice interfaces to autonomous systems and mobile devices. Response time is no longer just a parameter; it’s a key quality metric and a significant competitive edge.

That’s why we spoke with Vladislav Agafonov, an expert in real-time machine learning and human-computer interaction. As a Senior Machine Learning Engineer at Meta Reality Labs UK, he focuses on building ML systems that power the next generation of low-latency, user-responsive interfaces for AR and wearable devices. In this interview, he shares practical insights into model optimization, latency-sensitive inference, and designing ML systems that feel intuitive and immediate to the user.

IT Logs: Why is speed becoming one of the key quality metrics in machine learning today?

Vladislav Agafonov: In today’s ML landscape, speed impacts multiple critical dimensions at once: experience, infrastructure cost, and the ability to run models on resource-constrained devices. Responses under 300 ms preserve the illusion of natural interaction, while longer delays break the flow and erode user confidence. The optimizations required to reach such speeds also improve efficiency, enabling more inferences per watt and reducing demands on cloud or battery-powered environments.

This becomes essential in use cases with tight real-time constraints, like collision avoidance or live speech translation, where missing a 10–50 ms window not only renders the system ineffective but also irreversibly spoils the whole experience. So speed is no longer a luxury metric; it’s what makes modern machine learning usable, scalable, and truly impactful.

Will response time become a new standard of quality in ML alongside accuracy?

VA: The truth is, it has already become a standalone axis of quality because it directly drives monetization. Amazon’s lost‑revenue metric and Google’s twenty‑percent traffic drop at half‑second delays turned latency into a business KPI long ago. Since then, the rise of edge AI scenarios has only tightened the screws. The response time already rivals accuracy in product OKRs, statements such as “p99 under 100 ms” are now commonplace, and hardware launches tout TOPS precisely because that number maps straight onto latency SLOs. As models reach a “good‑enough” level of accuracy, delay stays the most visible variable to users and the most decisive competitive edge.

In which products or scenarios is ML response time especially critical?

VA: Real‑time response matters most wherever an ML decision has to slot into a tight human or machine feedback loop. Think of voice assistants, where even a 300‑millisecond pause breaks the illusion of natural conversation, or mixed‑reality headsets that must segment and relight every camera frame before the next 16‑millisecond display refresh, or phone camera pipelines that denoise and HDR‑stack images between shutter press and display so the user never notices processing. In autonomous driving perception, drone obstacle avoidance, industrial robots co‑working with people, the entire sense‑plan‑act cycle often operates within a 10‑ to 50‑millisecond window, so any spike stalls the vehicle or, worse, causes a collision. Latency is just as unforgiving in finance and cybersecurity, where millisecond‑level delays can lose whole trades or let an intrusion propagate before a block rule deploys.

How do users perceive latency in ML-driven experiences? Where is the line between “fast” and “too slow”?

VA: As with many things, it depends on the context. In video processing, you have to hit the display refresh window, 16 ms for 60 fps or 33 ms for 30 fps. Users immediately notice stutter or rolling shutter artifacts if the image isn’t ready in time.

For voice interfaces, the benchmark is natural conversational timing. It feels alive if the wake-word model responds within ~300 ms and the assistant starts replying within 700 ms. Beyond one second, users start wondering if the device misheard, and once the pause hits 1.8 seconds or more, most people lose patience.

With LLM interfaces, the key is showing the first token quickly, ideally under 400 ms, so the user sees that the system is responding. After that, maintaining a generation rate of about 15 tokens per second keeps the interaction fluid. If it is any slower, the user starts interrupting or disengaging.

What are the most common constraints you encounter when implementing real-time inference on mobile and embedded devices?

VA: When you deploy an ML model to a mobile or embedded device — like a smartphone, smartwatch, drone, or microcontroller — you’ll almost always run into the same five constraints.

First, power and heat: you only have a few watts to work with before the battery drains or the device overheats and the OS throttles performance. Second, hard real-time deadlines: miss 33 ms and you drop a frame at 30fps; miss 10 ms and a wake word might fail to trigger. Third, tight memory and bandwidth: most phones have 2–4 GB of RAM shared by the OS, GPU, and ML workloads. If your model relies heavily on external DRAM, latency and power usage will spike. Fourth, accuracy under compression: aggressive optimization via quantization, pruning, or distillation always comes with a tradeoff in precision. Fifth, privacy and connectivity: in many scenarios, the user is offline or sensitive data must stay on-device. That rules out cloud fallback — everything must work locally and well.

Which hardware accelerators (such as NPUs, DSPs, TPUs) are most applicable to real-time ML today, and how do they impact performance?

VA: Today, we have a great range of hardware options, both on-device and in the cloud.

On smartphones, we rely on Apple’s ANE (in the latest A and M series chips), Qualcomm’s Hexagon NPU, and Google’s Tensor G4. These chips deliver 35–70 TOPS while consuming very little power, which allows for real-time inference without overheating or rapid battery drain. Arm’s Ethos-U85 and low-power Hexagon DSPs are great options for ultra-compact devices like wearables or microcontrollers. They deliver up to 4 TOPS on just tens of milliwatts, which is enough to power always-on voice detection, simple gesture recognition, or 30fps object tracking that used to require traditional signal processing.

In the cloud, we still rely on TPUs and inference-focused GPUs like Google’s TPU or NVIDIA’s L4. These 400+ TOPS-class chips handle millions of global queries with response times under 50 milliseconds. At the edge or at scale, choosing proper hardware is essential, especially when latency isn’t just a simple metric anymore.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Analytics" category.
cookielawinfo-checkbox-functional	1 year	The GDPR Cookie Consent plugin sets the cookie to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Necessary" category.
cookielawinfo-checkbox-others	1 year	Set by the GDPR Cookie Consent plugin, this cookie stores user consent for cookies in the category "Others".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie stores the user consent for cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.

Cookie	Duration	Description
__cf_bm	30 minutes	Cloudflare set the cookie to support Cloudflare Bot Management.
mailchimp_landing_site	1 month	MailChimp sets the cookie to record which page the user first visited.

Cookie	Duration	Description
_fbp	3 months	Facebook sets this cookie to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising after visiting the website.
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_gat_gtag_UA_*	1 minute	Google Analytics sets this cookie to store a unique user ID.
_gid	1 day	Google Analytics sets this cookie to store information on how visitors use a website while also creating an analytics report of the website's performance. Some of the collected data includes the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded YouTube videos and registers anonymous statistical data.

Cookie	Duration	Description
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
cookies.js	session	No description available.
mnp_d894de8c	1 year	Description is currently not available.

Real-time ML: Why speed is becoming a key quality metric

By Bojan Stojkovski

By Bojan Stojkovski

IT Logs: Why is speed becoming one of the key quality metrics in machine learning today?

Will response time become a new standard of quality in ML alongside accuracy?

In which products or scenarios is ML response time especially critical?

How do users perceive latency in ML-driven experiences? Where is the line between “fast” and “too slow”?

What are the most common constraints you encounter when implementing real-time inference on mobile and embedded devices?

Which hardware accelerators (such as NPUs, DSPs, TPUs) are most applicable to real-time ML today, and how do they impact performance?

related topics

similar articles

AllWeb Digital 2025 returns to Tirana on November 6

AI Olympiad in Bulgaria to host 160 students from 35 countries

Meet the $14M-funded Romanian startup shaping the future of faith-based technology

Google unveils the new AI model Gemini with a fake demo

career compass

market

Promoted Job Listing

Standard Job Listing

PR Article

Submit an Event

Brand

Price

Availability

Discount