Troubleshooting Guide¶
Common issues and solutions when running Felix.
Build and Compilation Issues¶
Rust Version Too Old¶
Symptom: Compilation errors mentioning unstable features or syntax errors.
Solution: Update Rust to 1.92.0 or later.
# Check current version
rustc --version
# Update Rust
rustup update
# Set specific toolchain (if needed)
rustup override set 1.92.0
Missing Dependencies¶
Symptom: Linker errors or missing system libraries.
Solution: Install required system dependencies.
Linux (Debian/Ubuntu):
Linux (Fedora/RHEL):
macOS:
Build Hangs or Takes Forever¶
Symptom: cargo build appears stuck or takes excessive time.
Solution:
-
Check cargo build jobs:
-
Clean build cache:
-
Check disk space:
Compilation Errors After Git Pull¶
Symptom: Build fails after pulling latest changes.
Solution: Clean and rebuild.
Network and Connection Issues¶
Port Already in Use¶
Symptom: Broker fails to start with error.
Solution: Change ports or kill conflicting process.
Option 1 - Change ports:
export FELIX_QUIC_BIND="0.0.0.0:5001"
export FELIX_BROKER_METRICS_BIND="0.0.0.0:8081"
cargo run --release -p broker
Option 2 - Find and kill process:
# Linux/Mac - find process on port 5000
lsof -i :5000
sudo kill -9 <PID>
# Or use ss
ss -tulpn | grep 5000
Connection Refused¶
Symptom: Client cannot connect to broker.
Solutions:
-
Verify broker is running:
-
Check bind address:
-
Check firewall (Linux):
-
Test connectivity:
QUIC Handshake Timeout¶
Symptom: Connection hangs during QUIC handshake.
Solutions:
-
Check network path: Ensure UDP traffic is not blocked.
-
Verify MTU: QUIC sensitive to MTU issues.
-
Check NAT/Load Balancer: Ensure UDP pass-through.
Connection Reset by Peer¶
Symptom: Established connections drop unexpectedly.
Solutions:
-
Check broker logs for crashes or panics.
-
Increase flow-control windows:
-
Check resource limits:
Performance Issues¶
High Latency¶
Symptom: p99/p999 latency much higher than expected.
Solutions:
-
Use release builds:
-
Disable timing collection:
-
Reduce batch delay:
-
Check system load:
-
Profile with perf (Linux):
Low Throughput¶
Symptom: Messages/second much lower than expected.
Solutions:
-
Increase batch sizes:
-
Increase connection pools:
-
Check network bandwidth:
-
Verify CPU affinity:
High Memory Usage¶
Symptom: Broker consuming excessive memory.
Solutions:
-
Check actual memory usage:
-
Reduce flow-control windows:
-
Reduce queue depths:
-
Limit connection pools (client-side):
-
Check for memory leaks:
CPU Saturation¶
Symptom: Broker at 100% CPU, high latency.
Solutions:
-
Scale horizontally: Deploy multiple broker instances.
-
Reduce fanout batch size:
-
Disable telemetry:
-
Profile hot paths:
Runtime Errors¶
Panic or Crash¶
Symptom: Broker exits with panic message.
Solutions:
-
Enable backtraces:
-
Check logs for context before panic.
-
Run with debug symbols:
-
Report issue with full backtrace and reproduction steps.
Out of File Descriptors¶
Symptom: Error opening connections or files.
Solution: Increase file descriptor limit.
# Check current limit
ulimit -n
# Increase temporarily
ulimit -n 65536
# Increase permanently (Linux)
echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf
# Verify
ulimit -n
Queue Full / Backpressure¶
Symptom: Publish operations timing out.
Solutions:
-
Increase queue depth:
-
Increase timeout:
-
Slow down publisher or scale broker capacity.
-
Check subscriber health: Slow subscribers cause backpressure.
Docker and Container Issues¶
Container Won't Start¶
Symptom: Docker container exits immediately.
Solutions:
-
Check image build:
-
Run interactively:
-
Verify entrypoint:
Container Health Check Failing¶
Symptom: Container marked unhealthy.
Solutions:
-
Test health endpoint:
-
Check metrics bind:
-
Increase start period:
Volume Permission Issues¶
Symptom: Cannot write to mounted volume.
Solution: Fix volume permissions.
# Check user in container
docker exec felix-broker id
# Fix ownership
docker exec -u root felix-broker chown -R 10001:nogroup /data
# Or in Dockerfile
RUN chown -R 10001:nogroup /data
Kubernetes Issues¶
Pod Stuck in Pending¶
Symptom: Pod never schedules.
Solutions:
-
Check node resources:
-
Check PVC binding:
-
Check pod affinity:
CrashLoopBackOff¶
Symptom: Pod repeatedly crashes.
Solutions:
-
Check logs:
-
Check events:
-
Increase resources:
Service Not Reachable¶
Symptom: Cannot connect to service.
Solutions:
-
Test from within cluster:
-
Check service endpoints:
-
Verify service selector:
StatefulSet Not Scaling¶
Symptom: Replicas not increasing.
Solutions:
-
Check PVC provisioning:
-
Check storage class:
-
Check events:
Client SDK Issues¶
Connection Pool Exhaustion¶
Symptom: Client operations hang or timeout.
Solution: Increase pool sizes.
export FELIX_EVENT_CONN_POOL="16"
export FELIX_CACHE_CONN_POOL="16"
export FELIX_CACHE_STREAMS_PER_CONN="8"
Stream Errors¶
Symptom: Stream operations fail unexpectedly.
Solutions:
-
Check broker logs for corresponding errors.
-
Verify stream exists:
-
Increase timeouts (implementation-dependent).
Debugging Tools¶
Enable Verbose Logging¶
# Debug all Felix crates
export RUST_LOG="felix=debug"
# Trace specific crate
export RUST_LOG="felix_broker=trace"
# Multiple filters
export RUST_LOG="felix_broker=debug,felix_wire=trace,felix_transport=debug"
Capture Network Traffic¶
# Capture QUIC traffic
sudo tcpdump -i any -w felix.pcap udp port 5000
# Analyze with Wireshark
wireshark felix.pcap
Profile Performance¶
Linux perf:
flamegraph:
Memory profiling:
Test Latency Locally¶
# Run broker
cargo run --release -p broker
# In another terminal, run latency demo
cargo run --release -p broker --bin latency-demo -- \
--binary \
--fanout 1 \
--batch 1 \
--payload 1024 \
--total 5000
Check System Limits¶
# File descriptors
ulimit -n
# Max user processes
ulimit -u
# Memory locked
ulimit -l
# See all limits
ulimit -a
Common Configuration Mistakes¶
Mismatched Window Sizes¶
Issue: Client/broker window mismatch causes stalls.
Solution: Ensure consistent configuration.
# Broker
export FELIX_CACHE_CONN_RECV_WINDOW="268435456"
# Client (matching)
export FELIX_CACHE_SEND_WINDOW="268435456"
Wrong Frame Format Assumptions¶
Issue: Client assumes JSON event frames.
Solution: Ensure clients decode binary EventBatch on subscription event streams.
Insufficient Resources¶
Issue: Resource limits too low for workload.
Solution: Profile and adjust limits.
Getting Help¶
If you're still stuck:
-
Check GitHub Issues: https://github.com/gabloe/felix/issues
-
Search documentation: Use site search or
grepdocs directory -
Collect diagnostic info:
-
Create minimal reproduction: Simplify to smallest failing case
-
Open an issue with:
- Felix version
- Operating system
- Rust version
- Configuration
- Full error message
- Steps to reproduce
Next Steps¶
- Configuration reference: Configuration Reference
- Environment variables: Environment Variables
- FAQ: Frequently Asked Questions