The Trinity Beast Infrastructure — CloudWatch Dashboard & Alarm Notifications

Monitoring, Alerting, and Operational Visibility
May 2026 Region: us-east-2 4 Dashboards 21 Alarms 10 Log Groups

1. Overview

The Trinity Beast Infrastructure (TBI) uses Amazon CloudWatch as its centralized monitoring and alerting platform. This guide documents every dashboard, alarm, log group, and notification channel deployed across the system.

Dashboards
4
Alarms
21
Log Groups
10
Retention
30 days

2. Dashboards

Four CloudWatch dashboards provide layered visibility — from real-time application metrics to executive cost summaries.

Dashboard Purpose
Trinity-Beast-Application-Dashboard Primary ops dashboard — LPO, LRS, AWS infra, Lambda, logs
Trinity-Beast-Master-Dashboard Comprehensive view across all services
Trinity-Beast-Cost-Dashboard Live cost intelligence — resource utilization metrics that drive spend, cost-context tables, links to Cost Explorer

3. Application Dashboard — Widget Reference

The Trinity-Beast-Application-Dashboard is the primary operational dashboard. It contains widgets organized into six sections covering every layer of the stack.

LPO Section

LPO Widgets 7 Widgets
Widget Type
LPO Requests (per minute)Metric — line graph
Cache Hit Rate (%)Metric — gauge / number
Avg Latency (ms)Metric — line graph
Cache Hits vs MissesMetric — stacked area
Requests by AssetMetric — bar chart
Requests by Source (Exchange)Metric — bar chart
Errors & Source FailoversMetric — line graph

LRS Section

LRS Widgets 4 Widgets
Widget Type
LRS Total RequestsMetric — line graph
LRS Avg Latency (ms)Metric — line graph
LRS Output Format UsageMetric — bar chart
LRS ErrorsMetric — line graph

AWS Infrastructure Section

Infrastructure Widgets 6 Widgets
Widget Type
ECS CPU Utilization (%)Metric — line graph
ECS Memory Utilization (%)Metric — line graph
ALB Response Time & ErrorsMetric — line graph
ElastiCache CPU & Cache Hit RateMetric — line graph
ElastiCache Memory Usage (%)Metric — gauge / number
Aurora Serverless Capacity (ACU)Metric — line graph

Container Logs Section

Log Widgets 4 Widgets
Widget Type
LPO — Main Service LogsLog query
LRS — Report Service LogsLog query
Mirror Service LogsLog query
Sync Job LogsLog query

Lambda Section

Lambda Widgets 7 Widgets
Widget Type
Lambda InvocationsMetric — line graph
Lambda ErrorsMetric — line graph
Lambda Duration (ms)Metric — line graph
Throttles & ConcurrencyMetric — line graph
Receipts by Handler TypeLog widget
Recent Receipts — Handler DetailLog widget
Receipt Lambda LogsLog query

CloudTrail & VPC Section

Audit & Network Widgets 3 Widgets
Widget Type
CloudTrail — Errors & Access DeniedLog query
CloudTrail — ECS & Infrastructure ChangesLog query
VPC Flow Logs — Rejected Traffic (Trinity VPC)Log query

4. Cost Dashboard

One dedicated cost dashboard provides financial visibility into the Trinity Beast Infrastructure spend.

Trinity-Beast-Cost-Dashboard Cost Intelligence

Live cost intelligence dashboard that combines resource utilization metrics with cost context. Rather than embedding stale dollar figures, it shows the metrics that drive spend — ECS CPU/Memory utilization across all 4 services, Aurora ACU + utilization (with min/max annotations), ElastiCache CPU/memory, Lambda invocations and duration for all 8 functions, Aurora connections, and NAT Gateway/EC2 data transfer. Two cost-context tables explain the monthly baseline by component (with unit costs) and identify the cost levers you can control. Direct links to Cost Explorer, Budgets, and Savings Plans for exact dollar figures.

Replaces: The previous static Trinity-Beast-Cost-Executive-Dashboard and Trinity-Beast-Cost-Detailed-Dashboard (deleted 2026-05-30) — both were 100% hardcoded markdown text frozen at April 2026 figures, referencing services no longer in use.

5. CloudWatch Alarms

17 static-threshold alarms monitor critical infrastructure metrics. All alarms publish to the tbi-ops-notifications SNS topic, which invokes the tbi-ops-notify Lambda for formatted HTML email delivery. Alarms also trigger the tbi-ops-alarm-trigger EventBridge rule for automated self-healing.

Load Balancers (2 Alarms)

ALB & NLB Health OK
Alarm Name Metric Namespace Threshold Period Eval Periods State
Trinity-Beast-ALB-UnhealthyTargets UnHealthyHostCount AWS/ApplicationELB >= 1 60s 3 OK
Trinity-Beast-NLB-UnhealthyTargets UnHealthyHostCount AWS/NetworkELB >= 1 60s 3 OK

ECS Services (6 Alarms)

ECS CPU & Task Count OK
Alarm Name Metric Namespace Threshold Period Eval Periods State Notes
Trinity-Beast-ECS-CPU-High CPUUtilization AWS/ECS (main-service) > 80% 300s 2 OK
Trinity-Beast-ECS-CPU-High-Mirror CPUUtilization AWS/ECS (mirror-service) > 80% 300s 2 OK
Trinity-Beast-ECS-CPU-High-LRS CPUUtilization AWS/ECS (lrs-service) > 80% 300s 2 OK
Trinity-Beast-Main-Service-Count-Low RunningTaskCount ECS/ContainerInsights (main) < 1 300s 2 OK TreatMissing: breaching
Trinity-Beast-Mirror-Service-Count-Low RunningTaskCount ECS/ContainerInsights (mirror) < 1 300s 2 OK TreatMissing: breaching
Trinity-Beast-LRS-Service-Count-Low RunningTaskCount ECS/ContainerInsights (lrs) < 1 300s 2 OK TreatMissing: breaching

Aurora (2 Alarms)

Aurora Serverless v2 OK
Alarm Name Metric Namespace Threshold Period Eval Periods State
Trinity-Beast-Aurora-CPU-High CPUUtilization AWS/RDS (trinity-beast-aurora-cluster) > 80% 300s 2 OK
Trinity-Beast-Aurora-Connections-High DatabaseConnections AWS/RDS (trinity-beast-aurora-cluster) > 80 300s 2 OK

ElastiCache (3 Alarms)

ElastiCache for Valkey Mixed State
Alarm Name Metric Namespace Threshold Period Eval Periods State
Trinity-Beast-ElastiCache-CPU-High CPUUtilization AWS/ElastiCache > 80% 300s 2 OK
Trinity-Beast-ElastiCache-Memory-High DatabaseMemoryUsagePercentage AWS/ElastiCache > 85% 300s 2 OK/ALARM
Trinity-Beast-ElastiCache-Connections-High CurrConnections AWS/ElastiCache > 1000 300s 2 OK

S3 (1 Alarm)

S3 Bucket Size OK
Alarm Name Metric Namespace Threshold Period Eval Periods State
Trinity-Beast-S3-Size-Unusual-Growth BucketSizeBytes AWS/S3 > 10 GB 86400s 1 OK

Security & API (4 Alarms)

WAF, GuardDuty, API Error Rates OK
Alarm Name Metric Namespace Threshold Period Eval Periods State
TrinityBeast-WAF-HighBlockRate BlockedRequests AWS/WAFV2 > 100 300s 1 OK
TrinityBeast-API-5xx-Spike HTTPCode_Target_5XX_Count AWS/ApplicationELB > 10 300s 1 OK
TrinityBeast-API-4xx-Spike HTTPCode_Target_4XX_Count AWS/ApplicationELB > 200 300s 1 OK
TrinityBeast-GuardDuty-Finding finding AWS/GuardDuty > 0 300s 1 OK

6. SNS Notification Routing

All CloudWatch alarms route through a unified AutoOps pipeline. No raw text emails from AWS — every notification is formatted by the tbi-ops-notify Lambda before delivery via SES.

Unified Notification Flow ALL THROUGH AUTOOPS
┌─────────────────────────────────────────────────────────────────────────┐
│                      NOTIFICATION ROUTING                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  CloudWatch Alarm (any of 21 alarms)                                    │
│    ├─→ SNS: tbi-ops-notifications                                       │
│    │     └─→ tbi-ops-notify Lambda                                      │
│    │           └─→ Formatted HTML email (SES)                           │
│    │                 → CoryDeanKalani@CPMP-Site.org                      │
│    │                                                                    │
│    └─→ EventBridge: tbi-ops-alarm-trigger                               │
│          └─→ Step Function: tbi-ops-health-check-heal                   │
│                └─→ Self-heal → verify recovery → notify                 │
│                                                                         │
│  GuardDuty Finding (severity ≥ 7)                                       │
│    └─→ EventBridge: tbi-ops-guardduty-high-finding                      │
│          └─→ tbi-ops-bedrock-analyze Lambda                             │
│                └─→ AI threat assessment → auto-action → notify          │
│                                                                         │
│  Honeypot Hits (every 5 min)                                            │
│    └─→ EventBridge: tbi-ops-honeypot-queue-processor                    │
│          └─→ tbi-ops-honeypot-processor Lambda                          │
│                └─→ WAF IP block → notify                                │
│                                                                         │
│  Bedrock Threat Analysis (every 5 min)                                  │
│    └─→ EventBridge: tbi-ops-bedrock-analyze-schedule                    │
│          └─→ tbi-ops-bedrock-analyze Lambda                             │
│                └─→ Correlate signals → report → notify if HIGH/CRITICAL │
│                                                                         │
│  Support Ticket Submitted                                               │
│    └─→ Application invokes tbi-raima-support Lambda               │
│          └─→ Categorize → draft response → notify                      │
│                                                                         │
│  Daily/Weekly Digest (cron)                                             │
│    └─→ EventBridge: tbi-ops-daily-digest / tbi-ops-weekly-digest        │
│          └─→ tbi-ops-digest Lambda                                      │
│                └─→ Bedrock summary → formatted email                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
tbi-ops-notifications PRIMARY — All Alerts
Topic ARN
arn:aws:sns:us-east-2:211998422884:tbi-ops-notifications
Subscriber
tbi-ops-notify Lambda
Alarms Attached
21
Delivery
Formatted HTML (SES)
Protocol Endpoint Purpose Status
Lambda tbi-ops-notify Formats alert → sends HTML email via SES to CoryDeanKalani@CPMP-Site.org Active

How it works: When any alarm transitions to ALARM or OK, SNS invokes the tbi-ops-notify Lambda. The Lambda parses the alarm payload, formats a branded HTML email with severity badges and context, and sends it via Amazon SES. Subject lines include severity: [INFO], [WARNING], [CRITICAL], [SELF-HEALED].

Sender: CPMP Mission <No-Reply@CPMP-Site.org>
Recipient: CoryDeanKalani@CPMP-Site.org
Format: HTML email with dark theme, severity color coding, alarm details, and recommended actions.

Trinity-Beast-Critical-Alerts LEGACY — No Alarms Attached

Status: This topic is retained for potential future SMS escalation but no alarms currently target it. All 21 alarms were migrated to tbi-ops-notifications on May 15, 2026. The SMS subscription remains active as a backup escalation channel.

Protocol Endpoint Status
Email CoryDeanKalani@CPMP-Site.org Inactive (no triggers)
SMS +16156128200 Inactive (no triggers)

Design Decision (May 2026): All notifications route through a single Lambda (tbi-ops-notify) for consistent formatting, content review, and delivery control. This eliminates raw AWS text emails and ensures every alert arrives as a branded, readable HTML message with actionable context. AWS User Notifications service was disabled — it was sending unformatted alarm summaries that bypassed the AutoOps pipeline.

7. CloudWatch Log Groups

10 log groups capture output from every service layer. All groups are configured with a 30-day retention policy.

Log Group Retention Source
/aws/ecs/trinity-beast 30 days All 3 ECS services (LPO, Mirror, LRS)
/aws/ecs/trinity-beast-sync 30 days Nightly sync job
/ecs/trinity-beast-lpo 30 days Legacy LPO logs
/ecs/trinity-beast-main-task-container-def 30 days Legacy main task logs
/aws/lambda/trinity-beast-receipt 30 days Receipt Lambda
/aws/vpc/trinity-beast-flowlogs 30 days VPC Flow Logs
/aws/cloudtrail/trinity-beast 30 days CloudTrail audit logs
/aws/codebuild/trinity-beast-build 30 days CodeBuild logs
/aws/ecs/containerinsights/trinity-beast-fargate-cluster/performance 30 days Container Insights
RDSOSMetrics 30 days Aurora OS metrics

8. Custom Metrics (TrinityBeast Namespace)

The application publishes custom metrics to two CloudWatch namespaces, providing business-level observability beyond standard AWS metrics.

TrinityBeast/LPO Custom Namespace

Metrics published by the Live Price Oracle service:

Metric Description
RequestsTotal LPO requests received
CacheHitsRequests served from ElastiCache cache
CacheMissesRequests requiring upstream source fetch
ErrorsFailed requests (all error types)
SourceFailoversTimes a primary source failed and secondary was used
AvgLatencyAverage response time in milliseconds
TrinityBeast/LRS Custom Namespace

Metrics published by the Live Report Service:

Metric Description
RequestsTotal LRS report requests
AvgLatencyAverage report generation time in milliseconds
ErrorsFailed report generations
MonthlyLimitExceededRequests rejected due to monthly quota
DailyLimitExceededRequests rejected due to daily quota
AddOnRequestsRequests using add-on quota beyond base plan

9. AutoOps — Autonomous Operations Monitoring

The 5-layer AutoOps system has its own monitoring footprint — 7 Lambda functions, 6 EventBridge rules, 4 anomaly detection alarms, and a dedicated SNS topic. All feed into the Security Dashboard.

Anomaly Detection Alarms (4 Alarms)

ML-Based Anomaly Detection Band Width: 3σ

These alarms use CloudWatch Anomaly Detection (machine learning) to learn normal traffic patterns and alert on deviations. They need ~2 weeks to build a baseline.

Alarm NameMetricDirectionCatches
TrinityBeast-Anomaly-RequestRateALB RequestCount (Sum)Both (↑↓)Traffic drops (outage) or unexpected spikes (attack)
TrinityBeast-Anomaly-LatencyALB TargetResponseTime (Avg)Above only (↑)Slow degradation, DB bottlenecks
TrinityBeast-Anomaly-ErrorRateALB 5xx Count (Sum)Above only (↑)Error spikes beyond normal noise
TrinityBeast-Anomaly-CacheHitRateElastiCache CacheHitRate (Avg)Below only (↓)Cache evictions, Valkey issues

Configuration: 3 evaluation periods, 2 datapoints to alarm, treat missing as notBreaching. All alarms → SNS tbi-ops-notifications → also triggers tbi-ops-alarm-trigger EventBridge rule → Step Function health-check-heal.

AutoOps Lambda Functions (7 Functions)

AutoOps Lambda Metrics 1770 MB each

All 7 functions share the tbi-autonomous-ops-role IAM role. Metrics visible on the Security Dashboard.

FunctionPurposeLog Group
tbi-ops-notifySNS notifications with severity levels/aws/lambda/tbi-ops-notify
tbi-ops-self-healECS task restart, force-deploy/aws/lambda/tbi-ops-self-heal
tbi-ops-waf-actionWAF IP set block/unblock/aws/lambda/tbi-ops-waf-action
tbi-ops-honeypot-processorDrain honeypot queue → WAF block/aws/lambda/tbi-ops-honeypot-processor
tbi-ops-bedrock-analyzeAI threat analysis via Bedrock/aws/lambda/tbi-ops-bedrock-analyze
tbi-raima-supportAI ticket categorization + drafts/aws/lambda/tbi-raima-support
tbi-ops-digestDaily/weekly operational digests/aws/lambda/tbi-ops-digest

EventBridge Rules (6 Rules)

AutoOps Event Routing ALL ENABLED
Rule NameTriggerTarget
tbi-ops-alarm-triggerCloudWatch alarm → ALARMStep Function: health-check-heal
tbi-ops-honeypot-queue-processorrate(5 minutes)Lambda: honeypot-processor
tbi-ops-bedrock-analyze-schedulerate(5 minutes)Lambda: bedrock-analyze
tbi-ops-guardduty-high-findingGuardDuty severity ≥ 7Lambda: bedrock-analyze
tbi-ops-daily-digestcron(0 11 * * ? *) — 6 AM ESTLambda: digest
tbi-ops-weekly-digestcron(0 12 ? * MON *) — Mon 7 AM ESTLambda: digest

AutoOps SNS Topic

Topic: tbi-ops-notifications (arn:aws:sns:us-east-2:211998422884:tbi-ops-notifications)
Subscriber: tbi-ops-notify Lambda (formats + sends via SES to CoryDeanKalani@CPMP-Site.org)
Severity levels in subject: [INFO], [WARNING], [CRITICAL], [SELF-HEALED]
All 21 alarms route here — no raw AWS emails, everything formatted by Lambda.

10. Alarm Response Playbook

When an alarm fires, use the following runbooks to diagnose and resolve the issue. Each category includes the most common root causes and recommended actions.

ALB/NLB Unhealthy Targets Critical

Alarms: Trinity-Beast-ALB-UnhealthyTargets, Trinity-Beast-NLB-UnhealthyTargets

What it means: One or more ECS tasks are failing health checks from the load balancer.

  1. Check ECS service health in the console — are tasks running or in a crash loop?
  2. Review container logs in /aws/ecs/trinity-beast for startup errors or OOM kills
  3. Verify target group health check path and expected response code
  4. Check if a recent deployment introduced a breaking change
  5. If tasks are running but unhealthy, check application health endpoint directly
ECS CPU High Warning

Alarms: Trinity-Beast-ECS-CPU-High, ECS-CPU-High-Mirror, ECS-CPU-High-LRS

What it means: An ECS service is consuming more than 80% CPU over a sustained period.

  1. Check for a traffic spike — correlate with LPO/LRS request metrics on the Application Dashboard
  2. Consider scaling the service — increase desired task count or adjust auto-scaling thresholds
  3. Check for runaway goroutines or infinite loops in recent deployments
  4. Review Container Insights for per-task CPU breakdown
  5. If sustained, evaluate whether the task CPU allocation (vCPU) needs to be increased
Service Count Low Critical

Alarms: Trinity-Beast-Main-Service-Count-Low, Mirror-Service-Count-Low, LRS-Service-Count-Low

What it means: A container has crashed and no tasks are running for the service. These alarms use TreatMissing: breaching, so missing data also triggers the alarm.

  1. Check ECS service events for task stopped reasons (OOM, exit code, health check failure)
  2. Review container logs for the last running task — look for panic, fatal, or OOM messages
  3. Check if the ECR image exists and is pullable (image pull failures)
  4. Verify the task execution role has required permissions
  5. Manually start a new task if the service is not recovering automatically
Aurora CPU High Warning

Alarm: Trinity-Beast-Aurora-CPU-High

What it means: The Aurora Serverless v2 cluster is consuming more than 80% CPU.

  1. Check for slow queries — use Performance Insights or pg_stat_statements
  2. Verify ACU scaling — is the cluster at max ACU and still under pressure?
  3. Check if the nightly sync job is running and creating batch write pressure
  4. Look for missing indexes on frequently queried columns
  5. Consider increasing the max ACU limit if load is legitimate
Aurora Connections High Warning

Alarm: Trinity-Beast-Aurora-Connections-High

What it means: More than 80 active database connections — approaching the connection limit.

  1. Check connection pool settings in the application — are pools sized correctly?
  2. Look for connection leaks — connections opened but never returned to the pool
  3. Verify that the sync job and Lambda are not opening excessive connections
  4. Consider using RDS Proxy if connection pressure is persistent
  5. Check if a recent deployment changed pool configuration
ElastiCache CPU / Memory / Connections Warning

Alarms: Trinity-Beast-ElastiCache-CPU-High, ElastiCache-Memory-High, ElastiCache-Connections-High

What it means: The ElastiCache cluster is under resource pressure — CPU, memory, or connection count is elevated.

  1. Check for a cache stampede — many cache misses causing simultaneous upstream fetches
  2. Review key eviction metrics — if memory is full, keys are being evicted prematurely
  3. Check connection pool settings in the LPO service — are connections being reused properly?
  4. Look for large keys or hot keys that may be causing uneven load
  5. If memory is consistently high, consider scaling to a larger node type or adding shards
  6. Review TTL settings — are cached items living too long and consuming memory?
S3 Unusual Size Growth Low Priority

Alarm: Trinity-Beast-S3-Size-Unusual-Growth

What it means: The S3 bucket has exceeded 10 GB, which may indicate unexpected data accumulation.

  1. Check for unexpected uploads — review S3 access logs or CloudTrail for PutObject events
  2. Look for log file accumulation — are old log exports or reports piling up?
  3. Verify lifecycle policies are in place to expire or transition old objects
  4. Check if the LRS report output is being stored without cleanup
  5. Review bucket versioning — old versions may be consuming space