Skip to main content

Scaling the MQTT Gateway

This page documents the approved design for scaling the MQTT Gateway from ~50-100 to 1000 concurrent ESP32 device connections.


Current State

  • Single Node.js process on a 2 CPU / 8 GB RAM droplet
  • 1 EMQX broker (Docker), 1 UDP socket, 1 MQTT client, 4 Opus worker threads
  • ~200 MB RAM usage with few connections
  • Max capacity: ~50-100 concurrent devices

Target Architecture

Droplet 1 - Gateway Box (8 CPU / 32 GB)
+-- EMQX (Docker, same instance)
+-- mqtt-gateway instance 1 (UDP 8881, 4 Opus workers)
+-- mqtt-gateway instance 2 (UDP 8882, 4 Opus workers)
+-- mqtt-gateway instance 3 (UDP 8883, 4 Opus workers)
+-- mqtt-gateway instance 4 (UDP 8884, 4 Opus workers)

Droplet 2 - API Box (2 CPU / 4 GB)
+-- manager-api-node
+-- manager-web (static build)

Cloud Services (existing):
+-- LiveKit Cloud Scale plan - 5000 concurrent connections
+-- Cerebrium (AI agents, hosted separately)
+-- DigitalOcean PostgreSQL

Why Multi-Instance (Not Single Process)

Node.js runs JavaScript on a single CPU core. At 1000 devices, the main thread must handle ~33,400 audio frames/sec (encode + decode), plus UDP I/O, encryption, MQTT routing, and LiveKit events — totaling ~3,300 ms of work per second. One core only has 1,000 ms. So we split across 4 cores via 4 processes.

PM2 manages all 4 instances. Linux kernel auto-assigns each process to a different CPU core.


How It Works

Device Connection Flow (no firmware change)

1. Device --> manager-api: POST /toy/ota/  (gets MQTT broker IP:1883)
2. Device --> EMQX: MQTT connect (port 1883, same for all devices)
3. EMQX distributes "hello" via shared subscription to one gateway instance
4. Gateway instance replies with its unique UDP port in hello response
5. Device streams UDP audio to that specific instance's port

The device always connects to EMQX on port 1883. EMQX handles load distribution. The device does not know which gateway instance it is talking to.

EMQX Shared Subscriptions

Current single-instance subscription:

broker.subscribe("device-server", handler)

Changed to shared subscription:

broker.subscribe("$share/gateway/device-server", handler)

EMQX auto-distributes messages round-robin across all subscribers in the gateway group. Devices still publish to device-server — no firmware change needed.

Reply Topic Handling (Critical)

When a gateway instance owns a device, it subscribes to that device's p2p reply topic:

devices/p2p/{MAC_ADDRESS}

This does not use shared subscription — only the instance that owns the device should receive its replies. Each instance only subscribes to reply topics for its own connected devices.

Per-Instance Configuration

Env VarInstance 1Instance 2Instance 3Instance 4
INSTANCE_ID1234
UDP_PORT8881888288838884
WORKER_COUNT4444

PM2 Ecosystem Config

module.exports = {
apps: [
{
name: "gateway-1",
script: "app.js",
cwd: "./main/mqtt-gateway",
env: { INSTANCE_ID: "1", UDP_PORT: "8881", WORKER_COUNT: "4" }
},
{
name: "gateway-2",
script: "app.js",
cwd: "./main/mqtt-gateway",
env: { INSTANCE_ID: "2", UDP_PORT: "8882", WORKER_COUNT: "4" }
},
{
name: "gateway-3",
script: "app.js",
cwd: "./main/mqtt-gateway",
env: { INSTANCE_ID: "3", UDP_PORT: "8883", WORKER_COUNT: "4" }
},
{
name: "gateway-4",
script: "app.js",
cwd: "./main/mqtt-gateway",
env: { INSTANCE_ID: "4", UDP_PORT: "8884", WORKER_COUNT: "4" }
}
]
};

Implementation Phases

Phase 1: Fix Memory Leaks and Bottlenecks (prerequisite)

These must be done before multi-instance, otherwise each instance leaks.

FixFileChange
Increase cipher cachestreaming-crypto.jsmaxCacheSize = 20 to 300 (250 devices per instance)
Cap frame bufferlivekit-bridge.jsLimit frameBuffer to 10 frames (28,800 bytes)
Clean up event listenerslivekit-bridge.jsCall room.removeAllListeners() before room.disconnect() in close()
Clear pending MCP requestslivekit-bridge.jsClear pendingMcpRequests Map and volumeAdjustmentQueue array in close()
Async ghost cleanupmqtt-gateway.jsProcess rooms in batches of 20 with await between batches

Phase 2: Multi-Instance Code Changes

ChangeFileDescription
Parameterize UDP portapp.js, mqtt-gateway.jsRead UDP_PORT from env, default 8881
Shared subscriptionsemqx-broker.jsPrefix $share/gateway/ on device-to-server topics only
Unique MQTT client IDemqx-broker.jsmqtt-gateway-${INSTANCE_ID}-${Date.now()}
Worker pool per instanceapp.jsEach instance creates WorkerPoolManager(4)

Phase 3: Infrastructure Migration

  1. Provision gateway droplet: DigitalOcean 8 CPU / 32 GB RAM
  2. Move EMQX Docker to new droplet, same port 1883
  3. Deploy 4 gateway instances via PM2
  4. Open firewall: UDP 8881-8884, TCP 1883
  5. Update manager-api OTA response to point to new IP
  6. Move manager-api to separate smaller droplet (2 CPU / 4 GB)

Phase 4: Monitoring and Validation

ItemDescription
Health endpointHTTP per instance (ports 9001-9004): connection count, worker utilization, memory, uptime
Connection alertsLog totals every 30 s; alert if any instance exceeds 300
Load testingSimulate 100, 250, 500, 1000 connections; measure latency, memory, CPU

Capacity Planning

ConnectionsInstancesRAM per InstanceTotal RAMCPU Threads
2501~6 GB6 GB5 (1 main + 4 workers)
5002~6 GB12 GB10
7503~6 GB18 GB15
10004~6 GB24 GB20

Leaves ~8 GB headroom on 32 GB server for EMQX + OS + spikes.

Opus Throughput

WorkersCapacity (fps)Handles DevicesNotes
48,000~500Single instance
16 (4x4)32,000~2,0004 instances, plenty of headroom

Realistic load at 1000 devices (30% talk time): ~10,000 fps. 16 workers at 32,000 fps = 31% utilization. RAM is the real bottleneck, not Opus CPU.


LiveKit Cloud Plan

Agents are hosted on Cerebrium (not LiveKit Agents), so they join as regular participants. The LiveKit "concurrent agent sessions" limit does not apply.

PlanCostConcurrent ConnectionsWebRTC min/moFits 1000 devices?
Ship$50/mo1,000150,000No — 1000 devices x 2 participants = 2000
Scale$500/mo5,0001,500,000Yes — comfortable

1000 devices x 2 participants x avg 30 min/day x 30 days = ~900,000 WebRTC min/mo (within 1.5M).


Cost Summary

ItemMonthly Cost
Gateway droplet (8 CPU / 32 GB)$192
API droplet (2 CPU / 4 GB)$24
LiveKit Cloud Scale plan$500
Cerebriumexisting
DigitalOcean PostgreSQLexisting
Total~$716 + Cerebrium

Risk Mitigation

RiskMitigation
Instance crashPM2 auto-restarts; devices reconnect via OTA and get assigned to surviving instances
Memory leakFrame buffer cap + listener cleanup + worker session GC (Phase 1 fixes)
Uneven distributionEMQX round-robin is fair; monitor per-instance connection counts
Server failureSingle point of failure at this scale; future: add 2nd server
EMQX overloadEMQX handles 100K+ connections; not a concern at 1000