Performance Tuning¶

Felix is designed for predictable low-latency performance with tunable trade-offs between latency, throughput, and memory usage. This guide provides comprehensive performance tuning guidance based on real benchmarks and production-tested configurations.

Understanding Felix Performance¶

Felix performance is determined by several interconnected factors:

Network transport: QUIC connection and stream configuration
Batching: Message aggregation at publish and delivery stages
Parallelism: Connection pools and worker threads
Buffering: Queue depths and flow control windows
Encoding: Binary wire framing efficiency
Outbound lanes: Subscriber writer lane count and lane sharding policy

Performance Profiles¶

Felix provides three pre-configured profiles as starting points.

Balanced Profile (Default)¶

General-purpose settings for mixed workloads:

Broker configuration:

# Connection pools
pub_conn_pool: 4
pub_streams_per_conn: 2
event_conn_pool: 8
cache_conn_pool: 8
cache_streams_per_conn: 4

# QUIC flow control
event_conn_recv_window: 268435456      # 256 MiB
event_stream_recv_window: 67108864     # 64 MiB
event_send_window: 268435456           # 256 MiB

# Batching
event_batch_max_events: 64
event_batch_max_bytes: 65536
event_batch_max_delay_us: 250
fanout_batch_size: 64

# Queue depths
pub_queue_depth: 1024
subscriber_queue_capacity: 128
subscriber_writer_lanes: 4
subscriber_lane_queue_depth: 8192
max_subscriber_writer_lanes: 8
subscriber_lane_shard: auto

publish_chunk_bytes: 16384

Expected performance (fanout=10, batch=64, payload=4KB):

p50 latency: 400-800 µs
p99 latency: 2-4 ms
Throughput: 150-200k msg/sec per broker
Memory: ~2-4 GB working set

Best for: - Mixed pub/sub and cache workloads - Moderate fanout (1-20 subscribers) - General application development - Starting point for tuning

Latency-Optimized Profile¶

Minimize tail latency at the cost of throughput:

Broker configuration:

# Smaller pools
pub_conn_pool: 2
pub_streams_per_conn: 1
event_conn_pool: 4

# Smaller windows
event_conn_recv_window: 67108864       # 64 MiB
event_stream_recv_window: 16777216     # 16 MiB
event_send_window: 67108864            # 64 MiB

# Minimal batching
event_batch_max_events: 8
event_batch_max_bytes: 32768
event_batch_max_delay_us: 100
fanout_batch_size: 8

# Fast acknowledgements
ack_on_commit: true

# Shallow queues
pub_queue_depth: 512
subscriber_queue_capacity: 64
subscriber_writer_lanes: 2
subscriber_lane_shard: auto

Expected performance:

p50 latency: 150-300 µs
p99 latency: 500-1000 µs
Throughput: 50-80k msg/sec per broker
Memory: ~1-2 GB working set

Best for: - Real-time interactive applications - Trading systems, gaming - Sensor data with immediate processing - Low fanout (1-5 subscribers)

Throughput-Optimized Profile¶

Maximize throughput and burst tolerance:

Broker configuration:

# Large pools
pub_conn_pool: 8
pub_streams_per_conn: 4
event_conn_pool: 16
cache_conn_pool: 16
cache_streams_per_conn: 8

# Large windows
event_conn_recv_window: 536870912      # 512 MiB
event_stream_recv_window: 134217728    # 128 MiB
event_send_window: 536870912           # 512 MiB

# Aggressive batching
event_batch_max_events: 256
event_batch_max_bytes: 1048576
event_batch_max_delay_us: 2000
fanout_batch_size: 256

# Async acknowledgements
ack_on_commit: false

# Deep queues
pub_queue_depth: 4096
subscriber_queue_capacity: 256
subscriber_writer_lanes: 8
max_subscriber_writer_lanes: 8
subscriber_lane_shard: auto

publish_chunk_bytes: 32768

Expected performance:

p50 latency: 1-3 ms
p99 latency: 5-15 ms
Throughput: 300-500k msg/sec per broker
Memory: ~8-16 GB working set

Best for: - High-throughput data pipelines - Log aggregation, metrics collection - High fanout (20-100+ subscribers) - Batch processing workflows

Configuration Parameters¶

Pub/Sub Parameters¶

Connection Pooling¶

event_conn_pool: 8              # QUIC connections for events
pub_conn_pool: 4                # QUIC connections for publishing
pub_streams_per_conn: 2         # Publish streams per connection

Tuning guidance:

Workload	event_conn_pool	pub_conn_pool	streams_per_conn
Light	2-4	2	1-2
Medium	4-8	2-4	2
Heavy	8-16	4-8	2-4
Very heavy	16-32	8-16	4-8

Worker Sizing

Set pub_workers_per_conn ≤ pub_streams_per_conn. Excess workers create contention without benefit.

Flow Control Windows¶

event_conn_recv_window: 268435456      # Per-connection receive window
event_stream_recv_window: 67108864     # Per-stream receive window
event_send_window: 268435456           # Per-connection send window

Memory impact calculation:

Worst-case memory = (conn_window × conn_pool) + 
                    (stream_window × avg_streams × conn_pool)

Example: - conn_pool=8, conn_window=256MB, stream_window=64MB, avg_streams=10 - Memory ≈ (256MB × 8) + (64MB × 10 × 8) = 2GB + 5.1GB = 7.1GB

Tuning guidance:

Low latency, limited bursts: Use smaller windows (64-128 MiB)
High throughput, bursty: Use larger windows (256-512 MiB)
Memory constrained: Reduce pool size before reducing windows

Batching Parameters¶

event_batch_max_events: 64             # Max events per batch
event_batch_max_bytes: 262144          # Max batch size (256 KB)
event_batch_max_delay_us: 250          # Max batching delay (250 µs)
fanout_batch_size: 64                  # Fanout batch size

Batch triggers: Event batch is sent when any condition is met.

Trade-off analysis:

Parameter	↑ Increase Effect	↓ Decrease Effect
`max_events`	Higher throughput, higher latency	Lower latency, lower throughput
`max_delay_us`	Higher throughput, higher latency	Lower latency, lower throughput
`max_bytes`	Fewer frames, more efficiency	More frames, less efficiency
`fanout_batch_size`	Better fanout efficiency	Lower fanout latency

Recommended settings by workload:

# Ultra-low latency
event_batch_max_events: 4
event_batch_max_delay_us: 50

# Low latency
event_batch_max_events: 8
event_batch_max_delay_us: 100

# Balanced (default)
event_batch_max_events: 64
event_batch_max_delay_us: 250

# High throughput
event_batch_max_events: 128
event_batch_max_delay_us: 1000

# Maximum throughput
event_batch_max_events: 256
event_batch_max_delay_us: 2000

Queue Depths¶

pub_queue_depth: 1024                  # Publish pipeline queue
subscriber_queue_capacity: 128         # Per-subscriber broker-core queue
subscriber_lane_queue_depth: 8192      # Per-lane outbound writer queue
pub_workers_per_conn: 4                # Publish workers per connection

Queue depth impact:

Shallow queues (256-512): Lower memory, fail-fast backpressure, lower burst tolerance
Medium queues (1024-2048): Balanced, good burst tolerance
Deep queues (4096-8192): High memory, high burst tolerance, slower failure detection

Memory per queue:

Queue memory ≈ queue_depth × avg_message_size

Example: 128 × 4KB = 512KB per subscriber queue
With 100 subscribers: 100 × 512KB = 50MB

Outbound Writer Lanes and Sharding¶

Writer lanes parallelize outbound subscriber writes while preserving per-subscriber ordering.

subscriber_writer_lanes: 4
max_subscriber_writer_lanes: 8
subscriber_lane_queue_depth: 8192
subscriber_lane_shard: auto  # auto | subscriber_id_hash | connection_id_hash | round_robin_pin

Recent release-mode sweeps (payload=4096, batch=64, pub_conns=4, pub_streams_per_conn=2):

Workload	Key Result
fanout=10	lanes improved throughput from ~54.9k (1 lane) to ~68.2k (4 lanes); gains plateaued at higher lanes
fanout=20	lanes improved throughput from ~38.1k (1 lane) to ~44.8k (16 lanes); p99 improved with tuned lanes
lanes=4 shard sweep	`auto` / `subscriber_id_hash` outperformed `connection_id_hash` / `round_robin_pin` in this workload

Start here: 1. subscriber_lane_shard: auto 2. subscriber_writer_lanes: 4 3. Increase to 8 only if throughput is still lane-bound 4. Avoid assuming larger lane counts always help; watch p99/p999

Cache Parameters¶

cache_conn_pool: 8                     # QUIC connections for cache
cache_streams_per_conn: 4              # Streams per connection
cache_conn_recv_window: 268435456      # 256 MiB per connection
cache_stream_recv_window: 67108864     # 64 MiB per stream

Concurrency calculation:

Max concurrent cache ops = cache_conn_pool × cache_streams_per_conn

Recommended by workload:

Workload	conn_pool	streams_per_conn	Max Concurrency
Low	4	2	8
Medium	8	4	32
High	16	8	128
Very high	32	16	512

Event Frame Encoding¶

Subscription event delivery uses binary EventBatch framing by default.

Benchmark Results¶

Pub/Sub Latency and Throughput (Release, Localhost)¶

Reference stress profile: - payload=4096 - batch=64 - pub_conns=4 - pub_streams_per_conn=2

Observed with writer lanes:

Workload	Result
fanout=10, lanes=1, shard=auto	~54.9k msg/s, p99 ~67.4ms
fanout=10, lanes=4, shard=auto	~68.2k msg/s, p99 ~51.2ms
fanout=20, lanes=1, shard=auto	~38.1k msg/s, p99 ~65.4ms
fanout=20, lanes=16, shard=auto	~44.8k msg/s, p99 ~54.4ms
fanout=10, lanes=4 shard sweep	`auto` / `subscriber_id_hash` best in this workload

Takeaway: - Writer lanes materially improve heavy fanout/payload workloads. - Gains plateau; more lanes can stop helping depending on workload and host.

Cache Performance (Localhost)¶

Configuration: 8 connections, 4 streams/conn, concurrency=32

Operation	Payload	p50	p99	Throughput
put	0 B	158 µs	350 µs	184k ops/sec
put	256 B	179 µs	380 µs	155k ops/sec
put	4 KB	260 µs	480 µs	78k ops/sec
get (hit)	256 B	177 µs	360 µs	166k ops/sec
get (miss)	-	165 µs	340 µs	179k ops/sec

Profiling and Diagnostics¶

Telemetry Feature¶

Enable detailed performance telemetry:

[dependencies]
felix-client = { version = "0.1", features = ["telemetry"] }
felix-broker = { version = "0.1", features = ["telemetry"] }

# Broker config
disable_timings: false                 # Enable timing measurements

Metrics collected:

Per-operation latency histograms (publish, subscribe, cache)
Frame counters (publish frames, event frames, cache frames)
Queue depth samples
Flow control events

Overhead: 5-15% throughput reduction in high-load scenarios.

Production Use

Disable telemetry in production for maximum throughput. Enable only for profiling and debugging specific issues.

Performance Debugging¶

High publish latency:

Check pub_queue_depth - is queue filling up?
Check pub_workers_per_conn - enough workers?
Check broker CPU usage - saturated?
Enable telemetry - where is time spent?

High subscribe latency:

Check subscriber_queue_capacity and lane drop counters - subscribers falling behind?
Check event_batch_max_delay_us - batching too aggressive?
Check QUIC flow control - windows exhausted?
Check subscriber processing time - bottleneck in application?

Low throughput:

Increase event_batch_max_events - more aggressive batching
Increase connection pools - more parallelism
Confirm binary EventBatch decoding path in subscribers
Check network bandwidth - saturated?
Increase pub_workers_per_conn - more publish parallelism

High memory usage:

Reduce flow control windows
Reduce queue depths
Reduce connection pool sizes
Check for slow subscribers - filling buffers?

Production Recommendations¶

Sizing Guidelines¶

Small deployment (< 10k msg/sec):

pub_conn_pool: 2
event_conn_pool: 4
cache_conn_pool: 4
event_batch_max_events: 32
pub_queue_depth: 512
subscriber_queue_capacity: 64
subscriber_writer_lanes: 2

Expected resources: 2 CPU cores, 2-4 GB RAM

Medium deployment (10k-100k msg/sec):

pub_conn_pool: 4
event_conn_pool: 8
cache_conn_pool: 8
event_batch_max_events: 64
pub_queue_depth: 1024
subscriber_queue_capacity: 128
subscriber_writer_lanes: 4

Expected resources: 4-8 CPU cores, 4-8 GB RAM

Large deployment (100k-500k msg/sec):

pub_conn_pool: 8
event_conn_pool: 16
cache_conn_pool: 16
event_batch_max_events: 128
pub_queue_depth: 2048
subscriber_queue_capacity: 256
subscriber_writer_lanes: 8

Expected resources: 16-32 CPU cores, 16-32 GB RAM

Tuning Workflow¶

Start with balanced profile: Use defaults
Measure baseline: Run realistic workload, measure latency/throughput
Identify bottleneck: CPU? Memory? Network? Queue depths?
Tune one parameter: Change single parameter
Re-measure: Verify improvement
Iterate: Repeat until requirements met

Measure, Don't Guess

Performance tuning without measurement leads to worse performance. Always benchmark before and after changes.

Monitoring in Production¶

Key metrics to track:

Publish rate and latency (p50, p99, p999)
Subscribe rate and latency
Queue depths (publish, event)
Lane queue pressure (per-lane enqueue/drop/highwater)
Connection count
CPU and memory usage
Network bandwidth
Dropped event count
Slow subscriber count

Alerting thresholds:

p99 latency > 2× baseline
Queue depth > 80% of max
Dropped events > 0.1% of published
CPU usage > 80%
Memory usage > 85%

Hardware Recommendations¶

CPU¶

Minimum: 2 cores
Recommended: 4-8 cores for medium workloads
High performance: 16-32 cores for high throughput

Felix is CPU-bound for: - JSON encoding/decoding - QUIC encryption - Message batching and fanout

Memory¶

Minimum: 2 GB
Recommended: 4-8 GB for medium workloads
High performance: 16-32 GB for high throughput with large queues

Memory usage scales with: - Connection pool sizes × flow control windows - Queue depths × subscriber count - Cache size

Network¶

Minimum: 1 Gbps
Recommended: 10 Gbps for high throughput
Ideal: 25+ Gbps for very high throughput

QUIC benefits from: - Low latency networks (< 1 ms RTT) - High bandwidth - Low packet loss (< 0.1%)

Disk¶

MVP: Not used (ephemeral only)
Future (durable mode): NVMe SSD for WAL and segments

Best Practices Summary¶

✓ Start with balanced profile, measure, then tune
✓ Size connection pools for your parallelism needs
✓ Use batching for throughput, minimize batching for latency
✓ Validate binary EventBatch decode performance in clients
✓ Monitor queue depths - they reveal backpressure
✓ Disable telemetry in production for maximum throughput
✓ Profile before optimizing - don't guess
✓ Test with realistic workloads, not synthetic benchmarks
✓ Plan for 2-3× headroom above expected load
✓ Document your tuning decisions and benchmark results

Predictable Performance

Felix is designed for predictable p99/p999 latency under load. Tuning trades off between latency, throughput, and memory—but tail latency remains controlled with proper configuration.