Why MQTT acceleration needs more than standard NIC offloads (and how SmartNICs do it differently)

When MQTT deployments scale to thousands of clients with high message fan-out, the broker's CPU becomes a bottleneck. This happens not because of inefficient packet handling, but because of the sheer volume of work involved in matching topics to subscribers and distributing messages. Standard NIC mechanisms address only the packet-handling side, reducing CPU load and freeing up more processing time for the MQTT algorithms, but they don't touch the core bottleneck.

Standard NIC offloads (LRO/GRO, TSO/GSO, checksum offloading) can reduce packet processing overhead, decrease latency, and improve overall throughput. But these are general-purpose optimizations that operate at the packet and TCP segment level. They're not designed to address MQTT-specific bottlenecks like topic matching, subscription management, and message fan-out distribution.

There simply aren't MQTT-aware offloads in standard NICs.

This is where SmartNICs create new possibilities. By providing programmable compute resources at the network interface, they enable a layered offloading approach: implement TCP/IP stack functionality in the SmartNIC (addressing the transport layer efficiently), then add MQTT-specific acceleration logic on top that understands the protocol and can optimize Publish distribution.

Because in most systems, the painful part is broker-side Publish processing and distribution: taking one Publish and getting it to every matching subscriber quickly and reliably.

This article explains why standard NIC offloads aren't sufficient for MQTT performance, what makes MQTT acceleration uniquely tricky from a TCP perspective, and how SmartNIC-based approaches enable layered TCP+MQTT offloading that addresses both transport-layer and application-layer bottlenecks.

What “MQTT performance” usually means

MQTT is lightweight by design. The protocol is simple: publishers send messages with topics, subscribers register topic filters, and the broker matches topics to filters and delivers messages to the right subscribers.

Where it gets heavy is scale.

As the number of clients grows and subscriptions become broader (high fan-out topics, shared prefixes, lots of wildcards), the broker starts behaving less like a simple “message relay” and more like a routing engine. For every Publish, it typically has to:

Decode and validate the incoming message
Match the topic against subscription filters
Decide who should receive it (often a long list)
Send copies of the message to many active client connections

Those third and fourth steps, the replication and distribution, tend to use up your CPU and memory as the system grows.

So when people say “MQTT latency,” they often mean something very specific:

How long does it take for a Publish to reach the last subscriber that should receive it?

That’s a practical metric because it captures the entire process, including broker processing time, topic matching time, fan-out work, and any transport-layer delays introduced by the system under load.

Traditional hardware offloads: what they do (and what they don’t)

There’s a reason engineers reach for classic offloads first: they work extremely well for what they’re designed to do, which is to make packet handling and TCP mechanics cheaper.

Here are the most common ones.

LRO / GRO (receive-side aggregation)

Large Receive Offload (LRO) and Generic Receive Offload (GRO) aggregate incoming packets into larger buffers before handing them upward in the networking stack.

The benefit of LRO/GRO is simple: fewer packets to process means fewer per-packet costs from interrupts, headers, memory allocations, and scheduling overhead.

TSO / GSO (transmit-side segmentation)

TCP Segmentation Offload (TSO) and Generic Segmentation Offload (GSO) move segmentation work out of the CPU path. Instead of the OS constructing many small TCP segments, it can pass a larger chunk down and let segmentation happen later (often in hardware).

Again, the benefit is lower CPU overhead on packet construction and transmission.

These receive and transmit optimizations improve packet-level efficiency, which helps with overall throughput. But MQTT performance at scale depends on the entire TCP/IP stack and the MQTT application logic sitting on top of it.

TOE (TCP Offload Engine)

A TCP Offload Engine implements a complete TCP/IP stack in the NIC, which is tasked with handling connection establishment, reliability, flow control, and all the stateful TCP mechanics. This differs from LRO/GRO/TSO/GSO, which only optimize specific packet operations.

TOE is the key component for implementing the full MQTT offloading, because MQTT runs on top of TCP. You need a complete TCP implementation to act as an active participant in MQTT message distribution. The question isn't whether you need TCP offloading… you do. The question is how to implement it effectively.

Why standard NIC offloads aren't sufficient for MQTT acceleration

1) Packet-level offloads (LRO/GRO/TSO/GSO) optimization is not enough at scale

Standard receive and transmit offloads reduce packet handling overhead, which improves throughput efficiency. But MQTT bottlenecks at scale are typically at the application layer, not the packet layer:

Which subscriptions match this topic?
How many clients need this Publish?
How quickly can messages be replicated and distributed to all matching subscribers?

You can make packet processing cheaper, but you're still spending most of your time on topic matching and fan-out distribution. That's why enabling these offloads produces modest CPU savings but doesn't significantly improve end-to-end Publish latency.

In other words, if your bottleneck is the broker's application-level work, generic packet optimizations won't remove it.

2) Traditional hardware TOE implementations have operational limitations

TCP offloading is necessary for MQTT acceleration since you need a complete, stateful TCP stack to participate in MQTT exchanges. Traditional hardware TOE implementations provide this, but they introduce operational challenges:

Moving security-sensitive transport logic into fixed-function NIC hardware
Difficult to update or patch
Limited visibility for debugging
Hardware constraints (state table sizes, connection limits) become reliability concerns
Inflexible when traffic patterns change

At scale, MQTT deployments constantly change, brokers get upgraded, traffic patterns shift, subscriptions churn, and edge cases emerge. When your TCP offload is locked in hardware with limited flexibility, troubleshooting and adapting become difficult.

That's why fast patching and good visibility matter so much. If your offload path is hard to update or opaque to troubleshoot, it can create bigger problems than it solves.

MQTT acceleration requires the offload to participate actively in message distribution

To meaningfully reduce Publish distribution latency, you need MQTT-aware logic that can:

Capture incoming Publish messages
Maintain a fast subscription map (who's subscribed to what)
Replicate and distribute Publish messages to matching subscribers
Forward non-Publish traffic (Subscribe, Unsubscribe, Keepalive) to the broker
Stay synchronized with subscription state changes

When the offload starts delivering messages to subscribers, it takes on part of the broker's core workload, such as accepting Publish traffic, determining which subscribers match, and pushing the message out to many connections quickly. That's where a lot of MQTT cost can be found at scale.

It also changes what "correctness" means for the offload. Clients experience a single continuous connection and expect it to behave predictably. Message ordering, acknowledgments, retransmissions, flow control, and connection stability all need to continue working the way clients expect, even during spikes and edge cases. Once the offload is actively sending data, it has to maintain those properties, otherwise, performance gains can turn into instability and difficult debugging sessions.

The TCP reality check: why MQTT offload gets hard fast

At this point, a very common question shows up:

“If the SmartNIC sends messages for the broker, why not just do that transparently?”

Because TCP was not designed for three writers.

TCP assumes two endpoints, each maintaining precise sequence and acknowledgment state. If an offload injects packets into a connection “on behalf of” the server without perfect alignment, bad things happen.

Here are the two failures you hit early:

SEQ/ACK desynchronization

If the offload sends valid TCP packets that the client accepts, the client will advance its expected sequence numbers.

But the broker (the real endpoint) didn’t send those bytes, so it hasn’t advanced its own sequence numbers.

Now the broker and client disagree about what comes next.

The client starts dropping packets from the broker. The broker sees unexpected acknowledgments or missing acknowledgments. Eventually, the connection collapses.

ACK matching and stream ownership

Even if you manage to translate sequence numbers, you still have to handle acknowledgments correctly.

From the client’s perspective, it’s acknowledging a stream of bytes, not “these bytes came from the broker” and “those bytes came from the offload.”

So the offload must track which ranges of bytes were injected, which were broker-originated, and how ACKs map to each side’s state.

And then come the messy parts

Once you’re operating at speed and scale, TCP behaviors that are normally invisible become very visible:

Delayed ACK behavior changes timing
Retransmissions create duplicate data patterns
Small-message coalescing packs multiple MQTT messages into a single TCP payload
Messages can be split across payload boundaries

This is why many “clever” transparent acceleration ideas work in a minimal test, but then break when you increase publishers, subscribers, or throughput.

So the problem becomes:

Can we offload MQTT in a way that remains correct under TCP constraints?

What SmartNICs enable: Layered TCP + MQTT offloading with flexibility

SmartNICs and DPUs change the equation because they provide programmable compute resources close to the network interface. This enables a fundamentally better approach to MQTT acceleration:

You implement TCP offloading in software running on the SmartNIC, rather than fixed-function hardware. This gives you full TCP/IP stack functionality (the foundation that MQTT requires) while maintaining:

Updateability - patch bugs and improve implementations like any software

Flexibility - adapt to changing traffic patterns and requirements

Observability - standard debugging and monitoring tools work

Security - security updates follow normal software processes

Then, on top of that TCP foundation, you add MQTT-specific acceleration logic that understands the protocol and can optimize Publish distribution.

From a deployment perspective, this still offloads work from the host CPU. From an engineering perspective, you get the benefits of hardware offloading (dedicated processing, low latency) without the constraints of fixed-function hardware accelerator.

The CodiLime team's research demonstrates this layered approach: their SmartNIC-based solution integrates the lwIP network stack (providing complete TCP functionality) running on the DPU's ARM cores, with MQTT message distribution logic implemented on top. This architecture addresses both the TCP layer requirements and the MQTT-specific performance bottlenecks.

Three practical design patterns for SmartNIC-based MQTT offload

There are multiple ways teams approach this. Think of them as a spectrum from “fast to prototype” to “harder but correct.”

Pattern 1: TCP monitoring with translation (fastest to prototype)

In this approach, the SmartNIC sits in the path and monitors existing TCP connections. It intercepts and modifies traffic and can inject data when needed.

To avoid immediate connection breakage, it maintains two internal “views” of the connection and translates sequence/ack state as packets move through it.

Why teams try it:

It’s a relatively direct way to prove the concept.
You can focus on the MQTT fast path first.

Why it often stops scaling:

The deeper you go, the more you find yourself rebuilding parts of a transport stack.
Real TCP behaviors (retransmissions, coalescing, partial messages) quickly exceed what a “monitor + translation” design can robustly handle.

This approach is great for proving that MQTT distribution can be accelerated, but it tends to be fragile as complexity rises.

Pattern 2: Terminate TCP on the SmartNIC (a proper endpoint)

This is the pragmatic “stop fighting TCP” move.

Instead of trying to inject data into someone else’s TCP connection, the SmartNIC becomes a genuine endpoint:

Clients connect to the SmartNIC.
The SmartNIC maintains its own connection to the broker.

Now you have two normal TCP connections rather than one connection with a third actor injecting bytes.

This resolves most of the painful correctness issues because sequence/ack state is handled cleanly per connection.

It also makes it easier to implement MQTT-aware logic in a controlled environment:

Receive Publish messages
Match topic to subscribers
Distribute quickly to clients connected to the SmartNIC
Synchronize subscription updates with the broker

From a performance perspective, this can still be a win because the “extra hop” can be negligible compared to the distribution savings, especially in cases when most traffic is Publish handling.

Pattern 3: Socket interception / bypass the host network stack (deep integration)

If you want to push further, you can bypass the host kernel networking path entirely by exposing a SmartNIC-based stack to the broker application.

This can be done via:

Intercepting socket calls
Or integrating a dedicated networking library into the broker

This category can deliver significant gains, but it’s a larger engineering surface area and tends to be more intrusive to maintain across broker upgrades and platform variations.

FAQ:

Does hardware offloading improve MQTT performance?

It can, but not all offloading improves the metric you care about. Traditional offloads often help throughput and CPU efficiency in TCP/IP processing. MQTT performance issues at scale often come from Publish processing and broker distribution work, which those offloads don’t eliminate.

Does TOE improve MQTT latency?

TCP offload capability is essential for MQTT acceleration, and you need a TCP/IP stack to participate in MQTT message distribution. The question is about implementation approach: traditional hardware TOE has limitations (difficult to update, limited flexibility, debugging challenges), while SmartNIC-based implementations provide TCP offloading through programmable software on the SmartNIC. This gives you the necessary TCP foundation with better flexibility and maintainability, which you then build MQTT-specific acceleration on top of.

Do GRO/TSO help MQTT?

They can help reduce per-packet overhead and improve throughput efficiency, especially in high-volume traffic. But if your main bottleneck is “the broker is spending most of its time distributing messages,” GRO/TSO won’t change that fundamental cost.

Why is MQTT offload harder than it sounds?

Because meaningful acceleration often involves delivering Publish messages faster than the broker can, which turns the offload into an active participant in TCP streams. TCP’s sequence/ack rules, retransmissions, and message coalescing make naive “inject packets and hope” designs fail under real load.

What should an MQTT-specific offload actually do?

A practical MQTT distribution offload needs to:

recognize and capture Publish traffic,
maintain an up-to-date subscription map,
distribute publishes to relevant subscribers quickly,
forward non-Publish traffic to the broker,
stay synchronized with changing subscription state,
and preserve TCP correctness.

Why are SmartNICs a better fit than traditional offloads?

SmartNICs enable a layered approach: implement TCP offloading in programmable software on the SmartNIC (solving the updateability and flexibility issues of hardware TOE), then add MQTT-aware acceleration logic on top of that TCP foundation. You get both the necessary TCP capabilities AND the MQTT-specific optimizations, with the flexibility to iterate and improve both layers. Traditional commodity NIC offloads (LRO/GRO/TSO/GSO) only optimize specific packet operations and can't address MQTT-layer bottlenecks.

Conclusion: traditional offloads help, but MQTT needs MQTT-aware acceleration

Standard NIC offloads (LRO/GRO/TSO/GSO) optimize packet handling, which helps throughput efficiency. But MQTT acceleration at scale requires more: you need complete TCP offloading capability as the foundation, plus MQTT-aware distribution logic on top. Traditional hardware TOE implementations provide the TCP layer but have operational limitations (updateability, flexibility, debugging).

SmartNICs offer a better path: implement TCP offloading in programmable software on the SmartNIC's processors, then build MQTT acceleration on top of that foundation. This layered approach addresses both the TCP requirements and the MQTT-specific bottlenecks, while maintaining the flexibility to evolve as your deployment needs change.

When you need MQTT to feel "instant" at scale, you need a solution that accelerates the hot path: capturing publishes, matching subscriptions, and distributing messages quickly, without breaking TCP.

If you want to explore this topic more deeply, especially the tradeoffs between monitoring, terminating TCP on the SmartNIC, and deeper socket interception approaches, our ebook goes into those design choices in detail, including what breaks first and how teams work around it. Download it for free here.

Services

Knowledge