The Trinity Beast — Performance Optimization Guide

Connection pooling, three-tier cache strategies, GC tuning, rate limiter configuration, ALB settings, and cost optimization.

Version: v16 Region: us-east-2 Updated: May 2026

Current Status Overview

✅ Already Optimized

Network & Load Balancing

  • v3.9.3 ALB Connection Tuning: 60s idle timeout, 120s keep-alive, 10s deregistration delay, LOR routing on both target groups, invalid header rejection
  • v3.9.3 NLB Connection Tuning: Cross-zone load balancing enabled, 10s deregistration delay on both UDP target groups, healthy threshold reduced to 2 (1 min recovery vs 2.5 min)
  • CORS: Enabled with minimal overhead

Price Feed Architecture

  • 6x WebSocket Price Feeds: Coinbase, Gemini, Kraken, Gate.io, Crypto.com, OKX — persistent push-based connections, 150 prewarmed assets arrive before requests
  • Per-Container WebSocket Independence: Each container runs its own 6 WS connections, local-only sync.Map writes (no ElastiCache hammering)
  • 6x REST Fallbacks: One per exchange (Coinbase, Gemini, Kraken, Gate.io, Crypto.com, OKX) with health tracking — only used if the corresponding WS feed is stale
  • Per-Exchange Failover: Each exchange has independent health tracking — a stale Kraken WS triggers Kraken REST, not a cascade across all exchanges
  • Response-First Architecture: Background logging and metrics — response sent before any write operations

Compute & Runtime

  • Fargate Tasks: 8 vCPU / 32 GB each — all 3 running APP_REPORT_SERVER across 3 AZs
  • Go Runtime: Using all CPUs with runtime.GOMAXPROCS(runtime.NumCPU())
  • Garbage Collection: GOGC=300 (configurable via env var, up from 200)
  • v3.3 Background Worker Pool: 999 slots (up from 500)
  • v3.3 System Mode Toggle: Demo/Performance/Debug profiles via /admin/system-mode

Cache & Data Layer

  • ElastiCache cache.r7g.2xlarge: 52.8 GB cache memory, 400K+ ops/sec capacity, single node, no replica
  • v3.9.3 ElastiCache Pipelining: All 6 sequential HGetAll loops (4 LRS + 2 UDP) replaced with single-round-trip pipelines via PipelineHGetAll()
  • ElastiCache API Key Cache: 3-layer lookup (sync.Map → ElastiCache → Aurora) with write-through
  • ElastiCache App Config: Application parameters read from ElastiCache first, Aurora fallback
  • Shared Rate Limiting: Atomic Lua script in ElastiCache — all price-serving containers share rate limit counters
  • Real-time Usage Counters: HINCRBY in ElastiCache on every request for instant LRS stats
  • ElastiCache Connection Pooling: 300 pool size, 60 min idle connections
  • Aurora Optimized I/O: Unlimited IOPS, no per-I/O charges, 40% cost savings, 2–18 ACU
  • Database Connection Pooling: Configurable via app params (150 open / 75 idle per container)
  • v3.3 Micro-Batch Aurora Write Smoothing: 300 rows / 100ms (configurable via app params)

UDP Protocol (v8 Engine)

  • v8 SO_REUSEPORT: 8 sockets per protocol — per-socket kernel receive queue eliminates buffer bottleneck
  • v8 recvmmsg Batch Reads: 32 datagrams per syscall (~32× reduction in read syscalls)
  • v8 Pre-Serialized Response Cache: sync.Map of pre-built byte slices (~2× faster for cache hits)
  • v8 32 MB Socket Buffers: Per socket (up from 8 MB in v3.3)
  • v8 1,024 Concurrent Handlers: 8 SO_REUSEPORT sockets × 128 workers per socket
  • UDP 3-Tier Cache: sync.Map → ElastiCache → REST (matches TCP handler)
  • v3.3 Compiled Go Stress Test Client: cmd/stress/ in mono repo

Current Performance Metrics

TCP Peak (Direct)
369,600
Combined Sustained
746,374
TCP Avg Latency
0.3ms
UDP Avg Latency
0.2ms
Cache Hit Rate
99%+
WebSocket Feeds
6 Active
ElastiCache Pool
300 conn
Aurora ACU
2–18

Implemented in v3.3 ✅ Shipped

The following optimizations were implemented and validated during the v3.3 stress test session. Each change was tested under sustained load with the compiled Go stress client.

Optimization Before (v3.0) After (v3.3) Impact
Container CPU 2 vCPU / 8 GB 8 vCPU / 32 GB 4x throughput, no CPU saturation
Aurora ACU ceiling 6 18 Supports 193K req/sec
GC tuning GOGC=200 GOGC=300 Fewer GC pauses under load
Worker pool 500 slots 999 slots More background work capacity
ElastiCache pool 50 connections 300 connections, 60 min idle No pool exhaustion under load
UDP readers 1 per socket 3 per socket Parallel packet intake
UDP buffers OS default (~200KB) 8MB read + 8MB write No packet loss at high throughput
UDP cache No ElastiCache tier Full 3-tier (sync.Map → ElastiCache → REST) Matches TCP cache architecture
Batch writes 500 rows / 10s (bursty) 300 rows / 100ms micro-batch (smooth) Aurora ACU spikes eliminated
Test client Python (GIL-bound, ~200 req/sec UDP) Compiled Go (487K+ req/sec UDP) Accurate server benchmarking

Remaining Optimization Opportunities 🚀 Potential Improvements

1. ALB Connection Settings ✅ DEPLOYED

ALB optimized for connection reuse, faster deregistration, and security hardening. Deployed April 26, 2026.

SettingBeforeAfterImpact
Idle timeout300s (5 min)60sFrees connection slots 5x faster
Client keep-alive3600s (1 hr)120sClients reconnect every 2 min instead of hoarding
Deregistration delay (both TGs)30s10sDeploys drain 20s faster per service
LRS routing algorithmround_robinleast_outstanding_requestsSmarter load distribution, matches LPO
Drop invalid headersdisabledenabledSecurity hardening — malformed headers rejected at ALB

2. ElastiCache Pipelining ✅ DEPLOYED

All sequential HGetAll loops replaced with single-round-trip pipelines across 6 handler locations. Deployed April 26, 2026.

// PipelineHGetAll — one round trip instead of N sequential calls
pipe := client.Pipeline()
cmds := make([]*redis.MapStringStringCmd, len(ids))
for i, id := range ids {
    cmds[i] = pipe.HGetAll(ctx, fmt.Sprintf("usage_log:%s", id))
}
pipe.Exec(ctx) // single round trip for all N hashes

Locations pipelined: LRS Usage Report, LRS Summary Report, LRS Report Usage Detail, LRS Report Usage Summary, UDP Summary, UDP Usage — all 6 sequential loops converted.

Impact: A report returning 50 rows now makes 1 ElastiCache round trip instead of 50. 30-40% latency reduction on LRS reports.

3. Prewarm Optimization SUPERSEDED

This optimization was designed for the REST polling era. It no longer applies — all 150 assets are now served by 6 persistent WebSocket feeds that push prices in real-time.

ExchangeAssetsFeed TypeLatency
CoinbaseBTC, ETH, SOL, DOGE, XRP, LINK, DOT, LTC, AVAX, UNI, PEPE, XLMWebSocket (push)0ms (in-memory)
GeminiAAVE, ADA, MATIC, ATOM, NEAR, ARB, MKR, CRV, GRT, FIL, SHIB, BATWebSocket (push)0ms (in-memory)
KrakenNANO, SC, LSK, KAVA, BICO, RARI, OCEAN, CFG, CQT, ALGO, FET, FLOWWebSocket (push)0ms (in-memory)
Gate.ioBNB, TRX, APT, SEI, INJ, OP, SUI, VET, HBAR, SAND, MANA, FTMWebSocket (push)0ms (in-memory)
Crypto.comTON, WLD, APE, BLUR, IMX, ENS, LDO, SNX, COMP, 1INCH, SUSHI, GALAWebSocket (push)0ms (in-memory)
OKXKAS, TIA, JUP, STRK, PYTH, W, ZRO, PENDLE, ONDO, RENDER, WIF, FLOKIWebSocket (push)0ms (in-memory)

Why it's obsolete: The original proposal called for tiered REST polling intervals (top assets every 5 min, mid every 15 min, low every 30 min) and staggered timing across containers. With 6 WebSocket feeds pushing every trade in real-time, prices arrive before requests — there's nothing to poll and nothing to stagger. PrewarmCache() runs once at startup as a bootstrap, then WebSocket feeds take over permanently. Natural staggering already occurs because each container's 6 WebSocket connections establish at slightly different times during startup.

4. Aurora Scaling Headroom FUTURE

Monitor Aurora ACU usage and adjust max capacity if needed. Current range is 2–18 ACU.

Current Load ACU Range Action
Consistently under 5 ACU 2–18 ACU ✅ Current — right-sized
Spiking to 18 ACU 2–32 ACU ⚠️ Increase max to 32
Sustained at 18 ACU 2–48 ACU 🚨 Increase max to 48

Monitor: CloudWatch metric ServerlessDatabaseCapacity

5. Task Count Scaling FUTURE

Scale ECS tasks horizontally when traffic increases. Costs reflect 8 vCPU / 32 GB containers.

Traffic Level Main Tasks Mirror Tasks LRS Tasks Monthly Cost
Current (Low) 1 1 1 $430
Medium (50K QPS) 2 2 1 $670
High (100K QPS) 3 2 2 $970
Very High (200K QPS) 5 3 2 $1,390

Trigger: When CPU > 70% or latency > 100ms consistently

6. ElastiCache Scaling FUTURE

Current node is cache.r7g.2xlarge (52.8 GB). ElastiCache is a pure cache layer — Aurora is the source of truth.

Node Type Memory Throughput Monthly Cost
cache.r7g.2xlarge (current) 52.8 GB 400K ops/sec $637
cache.r7g.4xlarge 105 GB 800K ops/sec ~$1,274
cache.r7g.2xlarge + replica 52.8 GB × 2 400K ops/sec + read replica ~$1,274

Trigger: When memory > 80% or CPU > 70% consistently

🎯 Recommended Priority

Immediate

All done ✅ — DB pooling, ElastiCache pooling, batch writes, GC tuning, UDP optimizations, worker pool, and system mode toggle all shipped in v3.3.

Short Term (Next 1-2 Weeks)

  1. Monitor v3.3 Metrics - CloudWatch dashboards for Aurora ACU, ElastiCache CPU/memory, ALB latency under real traffic
  2. Tune SQS Pipeline Params - Adjust sqs_flush_ms and sqs_buffer_size via app params if queue depth patterns change
  3. ALB Connection Settings — ✅ Deployed April 26, 2026
  4. ElastiCache Pipelining — ✅ Deployed April 26, 2026

Long Term (Based on Metrics)

  1. Prewarm Strategy — Superseded by 6 real-time WebSocket feeds (150 assets, 0ms latency)
  2. Horizontal Scaling - Add tasks when traffic increases
  3. ElastiCache Upgrade - Move to xlarge when ops/sec approaches 100K sustained

Monitoring & Metrics

Key CloudWatch Metrics to Watch

Aurora Serverless v2

  • ServerlessDatabaseCapacity - Current ACU usage (target: 2-10 ACU normal, up to 18 under stress)
  • DatabaseConnections - Active connections (target: < 450)
  • ReadLatency / WriteLatency - Query performance (target: < 5ms)

ElastiCache

  • CPUUtilization - CPU usage (target: < 70%)
  • DatabaseMemoryUsagePercentage - Memory usage (target: < 80%)
  • CacheHitRate - Cache effectiveness (target: > 85%)
  • NetworkBytesIn / NetworkBytesOut - Throughput

ECS Fargate

  • CPUUtilization - Task CPU usage (target: < 70%)
  • MemoryUtilization - Task memory usage (target: < 85%)

Application Load Balancer

  • TargetResponseTime - Backend latency (target: < 50ms)
  • RequestCount - Traffic volume
  • HealthyHostCount - Available targets (target: = desired count)
  • HTTPCode_Target_5XX_Count - Backend errors (target: 0)

Performance Bottleneck Analysis

Symptom Likely Cause Solution
High latency (> 100ms) All WebSocket feeds down, REST fallback active Check WS connections in logs, verify Gemini/Coinbase WS endpoints
Low cache hit rate (< 95%) WebSocket feeds disconnected or stale Check GEMINI-WS/COINBASE-WS logs, verify network connectivity
High CPU on ECS tasks Too many concurrent requests Scale horizontally (add more tasks)
High memory on ECS tasks Memory leak or large response caching Review code for leaks, increase task memory
Aurora ACU spiking to max Heavy database queries or connections Optimize queries, add connection pooling, increase max ACU
Aurora ACU spiking SQS consumer Lambda batch size too large or too frequent Adjust Lambda batch size or batching window in the SQS event source mapping
ElastiCache CPU high Too many cache operations Pipelining deployed ✅ — upgrade node type if still high
ElastiCache memory high Too much cached data Reduce cache TTL or upgrade node type
ALB 5xx errors Backend tasks unhealthy or overloaded Check task logs, scale horizontally

Service Offerings — Partners & Associates

The Trinity Beast serves three distinct audiences, each with its own delivery path optimized for their use case. All three share the same 6-exchange WebSocket price engine — the difference is how prices reach the consumer.

AudienceDelivery MethodConnectionLatencyRate LimitingCost
Public SubscribersREST API (TCP/UDP)Request/Response0.3ms TCP / 0.2ms UDPPer-tier QPS + burstFree – $3,000 lifetime
PartnersWebSocket (persistent)AWS PrivateLink / VPC Peering<1ms (push)None — unlimitedFree (mission-aligned)
AssociatesWebhook Push (UDP + HTTPS)Public internet~0.1ms UDP / ~50-200ms HTTPSTier-based interval$30 – $540/mo

Partner WebSocket Feed ✅ LIVE

Partners are AWS companies that need live crypto prices for their own products. They connect via AWS PrivateLink (TCP) or VPC Peering (UDP) — private network, no public internet traversal. Each Partner receives a persistent WebSocket connection that pushes every price update in real-time from the local sync.Map cache.

  • Handler: internal/handlers/partner_ws.go
  • Connection: Upgraded HTTP → WebSocket via gorilla/websocket
  • Price source: Local wsPriceCache (sync.Map) — same feeds as LPO, zero network hop
  • Rate limiting: None. Partners are exempt from all QPS/burst/monthly limits.
  • API key cache: 60-minute TTL (vs 5-minute for public tiers) — reduces Aurora lookups
  • Authentication: API key with tier = 'partner' in the rate_limit_template table
  • Why free: We receive price data freely from exchanges via WebSocket — we give it freely to mission-aligned partners

Associate Webhook Push ✅ LIVE

Associates subscribe to receive prices pushed to their endpoints at tier-configured intervals. The BeastWebhook service (4th ECS container, SERVER_TYPE=WEBHOOK_SERVER) runs its own 6 WebSocket feeds and pushes from its local cache — no ALB, no inbound ports, push-only.

  • Handler: internal/handlers/webhook.go + webhook_delivery.go
  • ECS Service: trinity-beast-webhook-service (8 vCPU / 32 GB, no ALB target)
  • Price source: Local wsPriceCache — same 6 WebSocket feeds, independent connections
  • UDP delivery: Fire-and-forget, single packet per asset per cycle. ~0.1ms. Zero retries.
  • HTTPS delivery: Signed POST with HMAC-SHA256 (X-Webhook-Signature). Retries with exponential backoff (base 1000ms, max 3 attempts).
  • Delivery log: Every push logged to webhook_delivery_log table (subscription_id, asset, price, source, method, latency, status)

Webhook Tier Performance Characteristics

TierIntervalMax AssetsPushes/HourPushes/MonthPrice
Starter60s9540~388,800$30/mo
Standard15s307,200~5,184,000$90/mo
Professional6s7545,000~32,400,000$210/mo
Enterprise3s150180,000~129,600,000$540/mo

Pushes/hour = (3600 ÷ interval) × max_assets. Enterprise at full capacity: 180,000 price pushes per hour, 129.6M per month — all from a single container reading its local sync.Map.

Architecture: Why All Three Share One Engine

Every ECS container (including BeastWebhook) maintains its own independent 6-exchange WebSocket connections. Prices flow into the local sync.Map with zero network hops. Whether a price is served via REST API, pushed over a Partner WebSocket, or delivered to an Associate webhook endpoint — it comes from the same in-memory cache, populated by the same real-time feeds.

┌─────────────────────────────────────────────────────────────────┐
│  6 Exchange WebSocket Feeds (per container)                      │
│  Coinbase · Gemini · Kraken · Gate.io · Crypto.com · OKX        │
└──────────────────────────┬──────────────────────────────────────┘
                           │ real-time trade messages
                           ▼
              ┌────────────────────────┐
              │   sync.Map (local)     │  ← 0ms read latency
              │   150 assets × 6 feeds │
              └────┬───────┬───────┬───┘
                   │       │       │
         ┌─────────┘       │       └──────────┐
         ▼                 ▼                   ▼
  ┌─────────────┐  ┌──────────────┐  ┌────────────────┐
  │ REST API    │  │ Partner WS   │  │ Webhook Push   │
  │ (TCP/UDP)   │  │ (persistent) │  │ (UDP + HTTPS)  │
  │ Subscribers │  │ PrivateLink  │  │ Associates     │
  └─────────────┘  └──────────────┘  └────────────────┘

This shared-engine architecture means adding Partners or Associates adds zero load to the price feed infrastructure. The WebSocket connections are already running. The sync.Map is already populated. The only additional work is the outbound push — which is trivial compared to the inbound price ingestion.

Conclusion

Current Assessment

The Trinity Beast Infrastructure v4.7 is battle-tested at scale. Run 17 validated:

  • 746,374 combined RPS sustained for 30 minutes — 1.34 billion requests with zero degradation
  • 369,600 TCP req/sec and 487,900 UDP req/sec (direct) — 100% success through all 13 concurrency levels
  • 0.3ms TCP avg latency, 0.2ms UDP avg latency
  • 943× improvement from v1.0 baseline across 17 test runs in 19 days
  • 8 vCPU / 32 GB containers — scales from 3 (production) to 9 (proven at scale)
  • 2–18 ACU Aurora range — right-sized with micro-batch write smoothing
  • 6 persistent WebSocket price feeds (Coinbase, Gemini, Kraken, Gate.io, Crypto.com, OKX) — 150 prewarmed assets
  • 99%+ cache hit rate — virtually every request served from memory
  • ElastiCache-backed API key validation, shared rate limiting, and real-time usage counters
  • v8 UDP engine: SO_REUSEPORT, recvmmsg batch reads, pre-serialized response cache

Recommendation: The system is production-ready and stress-tested well beyond expected traffic. A 3-year Compute Savings Plan is recommended to lock in cost savings on the 8 vCPU / 32 GB Fargate tasks. The remaining optimization opportunities (prewarm strategy, horizontal scaling) are for future scaling — not critical for current operations.

Run 17 eliminated every bottleneck found during stress testing. v4.7 added the v8 UDP engine (SO_REUSEPORT, recvmmsg), dedicated health servers, and 6-exchange WebSocket feeds — the remaining items are future-proofing for horizontal scale.