The Trinity Beast Infrastructure (TBI) uses Amazon CloudWatch as its centralized monitoring and alerting platform. This guide documents every dashboard, alarm, log group, and notification channel deployed across the system.
tbi-ops-notifications routes all alerts through the tbi-ops-notify Lambda for formatted HTML email delivery via SESTrinityBeast/LPO and TrinityBeast/LRS namespacesFour CloudWatch dashboards provide layered visibility — from real-time application metrics to executive cost summaries.
| Dashboard | Purpose |
|---|---|
Trinity-Beast-Application-Dashboard |
Primary ops dashboard — LPO, LRS, AWS infra, Lambda, logs |
Trinity-Beast-Master-Dashboard |
Comprehensive view across all services |
Trinity-Beast-Cost-Dashboard |
Live cost intelligence — resource utilization metrics that drive spend, cost-context tables, links to Cost Explorer |
The Trinity-Beast-Application-Dashboard is the primary operational dashboard. It contains widgets organized into six sections covering every layer of the stack.
| Widget | Type |
|---|---|
| LPO Requests (per minute) | Metric — line graph |
| Cache Hit Rate (%) | Metric — gauge / number |
| Avg Latency (ms) | Metric — line graph |
| Cache Hits vs Misses | Metric — stacked area |
| Requests by Asset | Metric — bar chart |
| Requests by Source (Exchange) | Metric — bar chart |
| Errors & Source Failovers | Metric — line graph |
| Widget | Type |
|---|---|
| LRS Total Requests | Metric — line graph |
| LRS Avg Latency (ms) | Metric — line graph |
| LRS Output Format Usage | Metric — bar chart |
| LRS Errors | Metric — line graph |
| Widget | Type |
|---|---|
| ECS CPU Utilization (%) | Metric — line graph |
| ECS Memory Utilization (%) | Metric — line graph |
| ALB Response Time & Errors | Metric — line graph |
| ElastiCache CPU & Cache Hit Rate | Metric — line graph |
| ElastiCache Memory Usage (%) | Metric — gauge / number |
| Aurora Serverless Capacity (ACU) | Metric — line graph |
| Widget | Type |
|---|---|
| LPO — Main Service Logs | Log query |
| LRS — Report Service Logs | Log query |
| Mirror Service Logs | Log query |
| Sync Job Logs | Log query |
| Widget | Type |
|---|---|
| Lambda Invocations | Metric — line graph |
| Lambda Errors | Metric — line graph |
| Lambda Duration (ms) | Metric — line graph |
| Throttles & Concurrency | Metric — line graph |
| Receipts by Handler Type | Log widget |
| Recent Receipts — Handler Detail | Log widget |
| Receipt Lambda Logs | Log query |
| Widget | Type |
|---|---|
| CloudTrail — Errors & Access Denied | Log query |
| CloudTrail — ECS & Infrastructure Changes | Log query |
| VPC Flow Logs — Rejected Traffic (Trinity VPC) | Log query |
One dedicated cost dashboard provides financial visibility into the Trinity Beast Infrastructure spend.
Live cost intelligence dashboard that combines resource utilization metrics with cost context. Rather than embedding stale dollar figures, it shows the metrics that drive spend — ECS CPU/Memory utilization across all 4 services, Aurora ACU + utilization (with min/max annotations), ElastiCache CPU/memory, Lambda invocations and duration for all 8 functions, Aurora connections, and NAT Gateway/EC2 data transfer. Two cost-context tables explain the monthly baseline by component (with unit costs) and identify the cost levers you can control. Direct links to Cost Explorer, Budgets, and Savings Plans for exact dollar figures.
Replaces: The previous static Trinity-Beast-Cost-Executive-Dashboard and Trinity-Beast-Cost-Detailed-Dashboard (deleted 2026-05-30) — both were 100% hardcoded markdown text frozen at April 2026 figures, referencing services no longer in use.
17 static-threshold alarms monitor critical infrastructure metrics. All alarms publish to the tbi-ops-notifications SNS topic, which invokes the tbi-ops-notify Lambda for formatted HTML email delivery. Alarms also trigger the tbi-ops-alarm-trigger EventBridge rule for automated self-healing.
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-ALB-UnhealthyTargets |
UnHealthyHostCount | AWS/ApplicationELB |
>= 1 | 60s |
3 | OK |
Trinity-Beast-NLB-UnhealthyTargets |
UnHealthyHostCount | AWS/NetworkELB |
>= 1 | 60s |
3 | OK |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State | Notes |
|---|---|---|---|---|---|---|---|
Trinity-Beast-ECS-CPU-High |
CPUUtilization | AWS/ECS (main-service) |
> 80% | 300s |
2 | OK | — |
Trinity-Beast-ECS-CPU-High-Mirror |
CPUUtilization | AWS/ECS (mirror-service) |
> 80% | 300s |
2 | OK | — |
Trinity-Beast-ECS-CPU-High-LRS |
CPUUtilization | AWS/ECS (lrs-service) |
> 80% | 300s |
2 | OK | — |
Trinity-Beast-Main-Service-Count-Low |
RunningTaskCount | ECS/ContainerInsights (main) |
< 1 | 300s |
2 | OK | TreatMissing: breaching |
Trinity-Beast-Mirror-Service-Count-Low |
RunningTaskCount | ECS/ContainerInsights (mirror) |
< 1 | 300s |
2 | OK | TreatMissing: breaching |
Trinity-Beast-LRS-Service-Count-Low |
RunningTaskCount | ECS/ContainerInsights (lrs) |
< 1 | 300s |
2 | OK | TreatMissing: breaching |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-Aurora-CPU-High |
CPUUtilization | AWS/RDS (trinity-beast-aurora-cluster) |
> 80% | 300s |
2 | OK |
Trinity-Beast-Aurora-Connections-High |
DatabaseConnections | AWS/RDS (trinity-beast-aurora-cluster) |
> 80 | 300s |
2 | OK |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-ElastiCache-CPU-High |
CPUUtilization | AWS/ElastiCache |
> 80% | 300s |
2 | OK |
Trinity-Beast-ElastiCache-Memory-High |
DatabaseMemoryUsagePercentage | AWS/ElastiCache |
> 85% | 300s |
2 | OK/ALARM |
Trinity-Beast-ElastiCache-Connections-High |
CurrConnections | AWS/ElastiCache |
> 1000 | 300s |
2 | OK |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
Trinity-Beast-S3-Size-Unusual-Growth |
BucketSizeBytes | AWS/S3 |
> 10 GB | 86400s |
1 | OK |
| Alarm Name | Metric | Namespace | Threshold | Period | Eval Periods | State |
|---|---|---|---|---|---|---|
TrinityBeast-WAF-HighBlockRate |
BlockedRequests | AWS/WAFV2 |
> 100 |
300s |
1 | OK |
TrinityBeast-API-5xx-Spike |
HTTPCode_Target_5XX_Count | AWS/ApplicationELB |
> 10 | 300s |
1 | OK |
TrinityBeast-API-4xx-Spike |
HTTPCode_Target_4XX_Count | AWS/ApplicationELB |
> 200 |
300s |
1 | OK |
TrinityBeast-GuardDuty-Finding |
finding | AWS/GuardDuty |
> 0 | 300s |
1 | OK |
All CloudWatch alarms route through a unified AutoOps pipeline. No raw text emails from AWS — every notification is formatted by the tbi-ops-notify Lambda before delivery via SES.
┌─────────────────────────────────────────────────────────────────────────┐
│ NOTIFICATION ROUTING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ CloudWatch Alarm (any of 21 alarms) │
│ ├─→ SNS: tbi-ops-notifications │
│ │ └─→ tbi-ops-notify Lambda │
│ │ └─→ Formatted HTML email (SES) │
│ │ → CoryDeanKalani@CPMP-Site.org │
│ │ │
│ └─→ EventBridge: tbi-ops-alarm-trigger │
│ └─→ Step Function: tbi-ops-health-check-heal │
│ └─→ Self-heal → verify recovery → notify │
│ │
│ GuardDuty Finding (severity ≥ 7) │
│ └─→ EventBridge: tbi-ops-guardduty-high-finding │
│ └─→ tbi-ops-bedrock-analyze Lambda │
│ └─→ AI threat assessment → auto-action → notify │
│ │
│ Honeypot Hits (every 5 min) │
│ └─→ EventBridge: tbi-ops-honeypot-queue-processor │
│ └─→ tbi-ops-honeypot-processor Lambda │
│ └─→ WAF IP block → notify │
│ │
│ Bedrock Threat Analysis (every 5 min) │
│ └─→ EventBridge: tbi-ops-bedrock-analyze-schedule │
│ └─→ tbi-ops-bedrock-analyze Lambda │
│ └─→ Correlate signals → report → notify if HIGH/CRITICAL │
│ │
│ Support Ticket Submitted │
│ └─→ Application invokes tbi-raima-support Lambda │
│ └─→ Categorize → draft response → notify │
│ │
│ Daily/Weekly Digest (cron) │
│ └─→ EventBridge: tbi-ops-daily-digest / tbi-ops-weekly-digest │
│ └─→ tbi-ops-digest Lambda │
│ └─→ Bedrock summary → formatted email │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Protocol | Endpoint | Purpose | Status |
|---|---|---|---|
| Lambda | tbi-ops-notify |
Formats alert → sends HTML email via SES to CoryDeanKalani@CPMP-Site.org |
Active |
How it works: When any alarm transitions to ALARM or OK, SNS invokes the tbi-ops-notify Lambda. The Lambda parses the alarm payload, formats a branded HTML email with severity badges and context, and sends it via Amazon SES. Subject lines include severity: [INFO], [WARNING], [CRITICAL], [SELF-HEALED].
Sender: CPMP Mission <No-Reply@CPMP-Site.org>
Recipient: CoryDeanKalani@CPMP-Site.org
Format: HTML email with dark theme, severity color coding, alarm details, and recommended actions.
Status: This topic is retained for potential future SMS escalation but no alarms currently target it. All 21 alarms were migrated to tbi-ops-notifications on May 15, 2026. The SMS subscription remains active as a backup escalation channel.
| Protocol | Endpoint | Status |
|---|---|---|
CoryDeanKalani@CPMP-Site.org |
Inactive (no triggers) | |
SMS |
+16156128200 |
Inactive (no triggers) |
Design Decision (May 2026): All notifications route through a single Lambda (tbi-ops-notify) for consistent formatting, content review, and delivery control. This eliminates raw AWS text emails and ensures every alert arrives as a branded, readable HTML message with actionable context. AWS User Notifications service was disabled — it was sending unformatted alarm summaries that bypassed the AutoOps pipeline.
10 log groups capture output from every service layer. All groups are configured with a 30-day retention policy.
| Log Group | Retention | Source |
|---|---|---|
/aws/ecs/trinity-beast |
30 days | All 3 ECS services (LPO, Mirror, LRS) |
/aws/ecs/trinity-beast-sync |
30 days | Nightly sync job |
/ecs/trinity-beast-lpo |
30 days | Legacy LPO logs |
/ecs/trinity-beast-main-task-container-def |
30 days | Legacy main task logs |
/aws/lambda/trinity-beast-receipt |
30 days | Receipt Lambda |
/aws/vpc/trinity-beast-flowlogs |
30 days | VPC Flow Logs |
/aws/cloudtrail/trinity-beast |
30 days | CloudTrail audit logs |
/aws/codebuild/trinity-beast-build |
30 days | CodeBuild logs |
/aws/ecs/containerinsights/trinity-beast-fargate-cluster/performance |
30 days | Container Insights |
RDSOSMetrics |
30 days | Aurora OS metrics |
The application publishes custom metrics to two CloudWatch namespaces, providing business-level observability beyond standard AWS metrics.
Metrics published by the Live Price Oracle service:
| Metric | Description |
|---|---|
Requests | Total LPO requests received |
CacheHits | Requests served from ElastiCache cache |
CacheMisses | Requests requiring upstream source fetch |
Errors | Failed requests (all error types) |
SourceFailovers | Times a primary source failed and secondary was used |
AvgLatency | Average response time in milliseconds |
Metrics published by the Live Report Service:
| Metric | Description |
|---|---|
Requests | Total LRS report requests |
AvgLatency | Average report generation time in milliseconds |
Errors | Failed report generations |
MonthlyLimitExceeded | Requests rejected due to monthly quota |
DailyLimitExceeded | Requests rejected due to daily quota |
AddOnRequests | Requests using add-on quota beyond base plan |
The 5-layer AutoOps system has its own monitoring footprint — 7 Lambda functions, 6 EventBridge rules, 4 anomaly detection alarms, and a dedicated SNS topic. All feed into the Security Dashboard.
These alarms use CloudWatch Anomaly Detection (machine learning) to learn normal traffic patterns and alert on deviations. They need ~2 weeks to build a baseline.
| Alarm Name | Metric | Direction | Catches |
|---|---|---|---|
TrinityBeast-Anomaly-RequestRate | ALB RequestCount (Sum) | Both (↑↓) | Traffic drops (outage) or unexpected spikes (attack) |
TrinityBeast-Anomaly-Latency | ALB TargetResponseTime (Avg) | Above only (↑) | Slow degradation, DB bottlenecks |
TrinityBeast-Anomaly-ErrorRate | ALB 5xx Count (Sum) | Above only (↑) | Error spikes beyond normal noise |
TrinityBeast-Anomaly-CacheHitRate | ElastiCache CacheHitRate (Avg) | Below only (↓) | Cache evictions, Valkey issues |
Configuration: 3 evaluation periods, 2 datapoints to alarm, treat missing as notBreaching. All alarms → SNS tbi-ops-notifications → also triggers tbi-ops-alarm-trigger EventBridge rule → Step Function health-check-heal.
All 7 functions share the tbi-autonomous-ops-role IAM role. Metrics visible on the Security Dashboard.
| Function | Purpose | Log Group |
|---|---|---|
tbi-ops-notify | SNS notifications with severity levels | /aws/lambda/tbi-ops-notify |
tbi-ops-self-heal | ECS task restart, force-deploy | /aws/lambda/tbi-ops-self-heal |
tbi-ops-waf-action | WAF IP set block/unblock | /aws/lambda/tbi-ops-waf-action |
tbi-ops-honeypot-processor | Drain honeypot queue → WAF block | /aws/lambda/tbi-ops-honeypot-processor |
tbi-ops-bedrock-analyze | AI threat analysis via Bedrock | /aws/lambda/tbi-ops-bedrock-analyze |
tbi-raima-support | AI ticket categorization + drafts | /aws/lambda/tbi-raima-support |
tbi-ops-digest | Daily/weekly operational digests | /aws/lambda/tbi-ops-digest |
| Rule Name | Trigger | Target |
|---|---|---|
tbi-ops-alarm-trigger | CloudWatch alarm → ALARM | Step Function: health-check-heal |
tbi-ops-honeypot-queue-processor | rate(5 minutes) | Lambda: honeypot-processor |
tbi-ops-bedrock-analyze-schedule | rate(5 minutes) | Lambda: bedrock-analyze |
tbi-ops-guardduty-high-finding | GuardDuty severity ≥ 7 | Lambda: bedrock-analyze |
tbi-ops-daily-digest | cron(0 11 * * ? *) — 6 AM EST | Lambda: digest |
tbi-ops-weekly-digest | cron(0 12 ? * MON *) — Mon 7 AM EST | Lambda: digest |
Topic: tbi-ops-notifications (arn:aws:sns:us-east-2:211998422884:tbi-ops-notifications)
Subscriber: tbi-ops-notify Lambda (formats + sends via SES to CoryDeanKalani@CPMP-Site.org)
Severity levels in subject: [INFO], [WARNING], [CRITICAL], [SELF-HEALED]
All 21 alarms route here — no raw AWS emails, everything formatted by Lambda.
When an alarm fires, use the following runbooks to diagnose and resolve the issue. Each category includes the most common root causes and recommended actions.
What it means: One or more ECS tasks are failing health checks from the load balancer.
/aws/ecs/trinity-beast for startup errors or OOM killsWhat it means: An ECS service is consuming more than 80% CPU over a sustained period.
What it means: A container has crashed and no tasks are running for the service. These alarms use TreatMissing: breaching, so missing data also triggers the alarm.
What it means: The Aurora Serverless v2 cluster is consuming more than 80% CPU.
pg_stat_statementsWhat it means: More than 80 active database connections — approaching the connection limit.
What it means: The ElastiCache cluster is under resource pressure — CPU, memory, or connection count is elevated.
What it means: The S3 bucket has exceeded 10 GB, which may indicate unexpected data accumulation.
PutObject events