The Trinity Beast Infrastructure — CloudWatch Dashboard & Alarm Notifications

1. Overview

The Trinity Beast Infrastructure (TBI) uses Amazon CloudWatch as its centralized monitoring and alerting platform. This guide documents every dashboard, alarm, log group, and notification channel deployed across the system.

Dashboards

Alarms

Log Groups

Retention

90 days

4 CloudWatch dashboards for operational and cost visibility
21 alarms covering ECS, Aurora, ElastiCache, ALB, NLB, S3, WAF, GuardDuty, API error rates, and 4 ML-based anomaly detectors
SNS topic tbi-ops-notifications routes all alerts through the tbi-ops-notify Lambda for formatted HTML email delivery via SES
30 log groups with 90-day retention across all services — ECS containers, 14 Lambda functions, Aurora PostgreSQL, Valkey slow/engine logs, VPC Flow Logs, CloudTrail, Container Insights, and RDS OS Metrics
Unified S3 logs bucket (aws-waf-logs-trinity-beast) for high-volume infrastructure logs — WAF, ALB, CloudFront, and S3 access logs with 365-day lifecycle. See Unified Logs & Observability
Custom metrics published to TrinityBeast/LPO and TrinityBeast/LRS namespaces
6 EventBridge rules routing alarms, schedules, and GuardDuty findings to AutoOps

2. Dashboards

Four CloudWatch dashboards provide layered visibility — from real-time application metrics to executive cost summaries.

Dashboard	Purpose
`Trinity-Beast-Application-Dashboard`	Primary ops dashboard — LPO, LRS, AWS infra, Lambda, logs
`Trinity-Beast-Master-Dashboard`	Comprehensive view across all services
`Trinity-Beast-Cost-Dashboard`	Live cost intelligence — resource utilization metrics that drive spend, cost-context tables, links to Cost Explorer

3. Application Dashboard — Widget Reference

The Trinity-Beast-Application-Dashboard is the primary operational dashboard. It contains widgets organized into six sections covering every layer of the stack.

LPO Section

LPO Widgets 7 Widgets

Widget	Type
LPO Requests (per minute)	Metric — line graph
Cache Hit Rate (%)	Metric — gauge / number
Avg Latency (ms)	Metric — line graph
Cache Hits vs Misses	Metric — stacked area
Requests by Asset	Metric — bar chart
Requests by Source (Exchange)	Metric — bar chart
Errors & Source Failovers	Metric — line graph

LRS Section

LRS Widgets 4 Widgets

Widget	Type
LRS Total Requests	Metric — line graph
LRS Avg Latency (ms)	Metric — line graph
LRS Output Format Usage	Metric — bar chart
LRS Errors	Metric — line graph

AWS Infrastructure Section

Infrastructure Widgets 6 Widgets

Widget	Type
ECS CPU Utilization (%)	Metric — line graph
ECS Memory Utilization (%)	Metric — line graph
ALB Response Time & Errors	Metric — line graph
ElastiCache ECPU & Cache Hit Rate	Metric — line graph
ElastiCache Storage (BytesUsedForCache)	Metric — gauge / number
Aurora Serverless Capacity (ACU)	Metric — line graph

Container Logs Section

Log Widgets 4 Widgets

Widget	Type
LPO — Main Service Logs	Log query
LRS — Report Service Logs	Log query
Mirror Service Logs	Log query
Sync Job Logs	Log query

Lambda Section

Lambda Widgets 7 Widgets

Widget	Type
Lambda Invocations	Metric — line graph
Lambda Errors	Metric — line graph
Lambda Duration (ms)	Metric — line graph
Throttles & Concurrency	Metric — line graph
Receipts by Handler Type	Log widget
Recent Receipts — Handler Detail	Log widget
Receipt Lambda Logs	Log query

CloudTrail & VPC Section

Audit & Network Widgets 3 Widgets

Widget	Type
CloudTrail — Errors & Access Denied	Log query
CloudTrail — ECS & Infrastructure Changes	Log query
VPC Flow Logs — Rejected Traffic (Trinity VPC)	Log query

4. Cost Dashboard

One dedicated cost dashboard provides financial visibility into the Trinity Beast Infrastructure spend.

Trinity-Beast-Cost-Dashboard Cost Intelligence

Live cost intelligence dashboard that combines resource utilization metrics with cost context. Rather than embedding stale dollar figures, it shows the metrics that drive spend — ECS CPU/Memory utilization across all 4 services, Aurora ACU + utilization (with min/max annotations), ElastiCache Serverless ECPU/storage, Lambda invocations and duration for all 8 functions, Aurora connections, and NAT Gateway/EC2 data transfer. Two cost-context tables explain the monthly baseline by component (with unit costs) and identify the cost levers you can control. Direct links to Cost Explorer, Budgets, and Savings Plans for exact dollar figures.

Replaces: The previous static Trinity-Beast-Cost-Executive-Dashboard and Trinity-Beast-Cost-Detailed-Dashboard (deleted 2026-05-30) — both were 100% hardcoded markdown text frozen at April 2026 figures, referencing services no longer in use.

5. CloudWatch Alarms

17 static-threshold alarms monitor critical infrastructure metrics. All alarms publish to the tbi-ops-notifications SNS topic, which invokes the tbi-ops-notify Lambda for formatted HTML email delivery. Alarms also trigger the tbi-ops-alarm-trigger EventBridge rule for automated self-healing.

Load Balancers (2 Alarms)

ALB & NLB Health OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-ALB-UnhealthyTargets`	UnHealthyHostCount	`AWS/ApplicationELB`	>= 1	`60s`	3	OK
`Trinity-Beast-NLB-UnhealthyTargets`	UnHealthyHostCount	`AWS/NetworkELB`	>= 1	`60s`	3	OK

ECS Services (6 Alarms)

ECS CPU & Task Count OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State	Notes
`Trinity-Beast-ECS-CPU-High`	CPUUtilization	`AWS/ECS (main-service)`	> 80%	`300s`	2	OK	—
`Trinity-Beast-ECS-CPU-High-Mirror`	CPUUtilization	`AWS/ECS (mirror-service)`	> 80%	`300s`	2	OK	—
`Trinity-Beast-ECS-CPU-High-LRS`	CPUUtilization	`AWS/ECS (lrs-service)`	> 80%	`300s`	2	OK	—
`Trinity-Beast-Main-Service-Count-Low`	RunningTaskCount	`ECS/ContainerInsights (main)`	< 1	`300s`	2	OK	TreatMissing: breaching
`Trinity-Beast-Mirror-Service-Count-Low`	RunningTaskCount	`ECS/ContainerInsights (mirror)`	< 1	`300s`	2	OK	TreatMissing: breaching
`Trinity-Beast-LRS-Service-Count-Low`	RunningTaskCount	`ECS/ContainerInsights (lrs)`	< 1	`300s`	2	OK	TreatMissing: breaching

Aurora (2 Alarms)

Aurora Serverless v2 OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-Aurora-CPU-High`	CPUUtilization	`AWS/RDS (trinity-beast-aurora-cluster)`	> 80%	`300s`	2	OK
`Trinity-Beast-Aurora-Connections-High`	DatabaseConnections	`AWS/RDS (trinity-beast-aurora-cluster)`	> 80	`300s`	2	OK

ElastiCache Serverless (4 Alarms)

ElastiCache Serverless for Valkey OK

Note: These alarms replaced the original node-based alarms (CPUUtilization, DatabaseMemoryUsagePercentage, CurrConnections) after the Serverless migration on July 2, 2026. Serverless uses different metrics — there are no nodes, no CPU percentage, and no fixed memory pool. Instead, you monitor processing units consumed, bytes stored, throttled commands, and successful connections.

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-ElastiCache-CPU-High`	ElastiCacheProcessingUnits	`AWS/ElastiCache`	> 7,500 ECPU/s (75% of 10K limit)	`300s`	2	OK
`Trinity-Beast-ElastiCache-Memory-High`	BytesUsedForCache	`AWS/ElastiCache`	> 4 GB (80% of 5 GB limit)	`300s`	2	OK
`Trinity-Beast-ElastiCache-Throttled`	ThrottledCmds	`AWS/ElastiCache`	> 100 (any throttling indicates capacity pressure)	`300s`	2	OK
`Trinity-Beast-ElastiCache-Connections-High`	SuccessfulReadRequestLatency	`AWS/ElastiCache`	> 5ms p99 (latency degradation signal)	`300s`	3	OK

S3 (1 Alarm)

S3 Bucket Size OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`Trinity-Beast-S3-Size-Unusual-Growth`	BucketSizeBytes	`AWS/S3`	> 10 GB	`86400s`	1	OK

Security & API (4 Alarms)

WAF, GuardDuty, API Error Rates OK

Alarm Name	Metric	Namespace	Threshold	Period	Eval Periods	State
`TrinityBeast-WAF-HighBlockRate`	BlockedRequests	`AWS/WAFV2`	> `100`	`300s`	1	OK
`TrinityBeast-API-5xx-Spike`	HTTPCode_Target_5XX_Count	`AWS/ApplicationELB`	> 10	`300s`	1	OK
`TrinityBeast-API-4xx-Spike`	HTTPCode_Target_4XX_Count	`AWS/ApplicationELB`	> `200`	`300s`	1	OK
`TrinityBeast-GuardDuty-Finding`	finding	`AWS/GuardDuty`	> 0	`300s`	1	OK

6. SNS Notification Routing

All CloudWatch alarms route through a unified AutoOps pipeline. No raw text emails from AWS — every notification is formatted by the tbi-ops-notify Lambda before delivery via SES.

Unified Notification Flow ALL THROUGH AUTOOPS

┌─────────────────────────────────────────────────────────────────────────┐
│                      NOTIFICATION ROUTING                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  CloudWatch Alarm (any of 21 alarms)                                    │
│    ├─→ SNS: tbi-ops-notifications                                       │
│    │     └─→ tbi-ops-notify Lambda                                      │
│    │           └─→ Formatted HTML email (SES)                           │
│    │                 → CoryDeanKalani@CPMP-Site.org                      │
│    │                                                                    │
│    └─→ EventBridge: tbi-ops-alarm-trigger                               │
│          └─→ Step Function: tbi-ops-health-check-heal                   │
│                └─→ Self-heal → verify recovery → notify                 │
│                                                                         │
│  GuardDuty Finding (severity ≥ 7)                                       │
│    └─→ EventBridge: tbi-ops-guardduty-high-finding                      │
│          └─→ tbi-ops-bedrock-analyze Lambda                             │
│                └─→ AI threat assessment → auto-action → notify          │
│                                                                         │
│  Honeypot Hits (every 5 min)                                            │
│    └─→ EventBridge: tbi-ops-honeypot-queue-processor                    │
│          └─→ tbi-ops-honeypot-processor Lambda                          │
│                └─→ WAF IP block → notify                                │
│                                                                         │
│  Bedrock Threat Analysis (every 5 min)                                  │
│    └─→ EventBridge: tbi-ops-bedrock-analyze-schedule                    │
│          └─→ tbi-ops-bedrock-analyze Lambda                             │
│                └─→ Correlate signals → report → notify if HIGH/CRITICAL │
│                                                                         │
│  Support Ticket Submitted                                               │
│    └─→ Application invokes tbi-rhema-support Lambda               │
│          └─→ Categorize → draft response → notify                      │
│                                                                         │
│  Daily/Weekly Digest (cron)                                             │
│    └─→ EventBridge: tbi-ops-daily-digest / tbi-ops-weekly-digest        │
│          └─→ tbi-ops-digest Lambda                                      │
│                └─→ Bedrock summary → formatted email                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

tbi-ops-notifications PRIMARY — All Alerts

Topic ARN

arn:aws:sns:us-east-2:211998422884:tbi-ops-notifications

Subscriber

tbi-ops-notify Lambda

Alarms Attached

Delivery

Formatted HTML (SES)

Protocol	Endpoint	Purpose	Status
Lambda	`tbi-ops-notify`	Formats alert → sends HTML email via SES to `CoryDeanKalani@CPMP-Site.org`	Active

How it works: When any alarm transitions to ALARM or OK, SNS invokes the tbi-ops-notify Lambda. The Lambda parses the alarm payload, formats a branded HTML email with severity badges and context, and sends it via Amazon SES. Subject lines include severity: [INFO], [WARNING], [CRITICAL], [SELF-HEALED].

Sender: The Trinity Beast <No-Reply@CPMP-Site.org>
Recipient: CoryDeanKalani@CPMP-Site.org
Format: HTML email with dark theme, severity color coding, alarm details, and recommended actions.

Trinity-Beast-Critical-Alerts LEGACY — No Alarms Attached

Status: This topic is retained for potential future SMS escalation but no alarms currently target it. All 21 alarms were migrated to tbi-ops-notifications on May 15, 2026. The SMS subscription remains active as a backup escalation channel.

Protocol	Endpoint	Status
Email	`CoryDeanKalani@CPMP-Site.org`	Inactive (no triggers)
`SMS`	`+16156128200`	Inactive (no triggers)

Design Decision (May 2026): All notifications route through a single Lambda (tbi-ops-notify) for consistent formatting, content review, and delivery control. This eliminates raw AWS text emails and ensures every alert arrives as a branded, readable HTML message with actionable context. AWS User Notifications service was disabled — it was sending unformatted alarm summaries that bypassed the AutoOps pipeline.

7. CloudWatch Log Groups

30 CloudWatch log groups capture application and infrastructure output. All groups have explicit 90-day retention (the Trinity Beast multiples-of-3 convention: 3 × 30). For infrastructure-level logs (ALB access, CloudFront, WAF requests, S3 access), see the companion Unified Logs & Observability document — those flow to a dedicated S3 bucket for long-term archive and Athena query.

Unified logging (June 2026): CloudWatch handles real-time, instant-search application logs. The S3 bucket aws-waf-logs-trinity-beast handles high-volume infrastructure logs (WAF full request logs, ALB access logs, CloudFront standard logs, S3 server access logs) with a 365-day lifecycle. This two-tier split keeps CloudWatch costs low while giving maximum retention and Athena queryability on the bulk data. Full architecture documented in Unified Logs & Observability.

ECS Container Logs

Log Group	Retention	Source
`/aws/ecs/trinity-beast`	90 days	All 4 services (Main, Mirror, LRS, Webhook) — UME self-identifies via `cluster_node`
`/aws/ecs/trinity-beast-sync`	90 days	BeastReconciler (nightly sync job)
`/ecs/tbi-translate-worker`	90 days	BeastTranslate (persistent SQS poller + batch orchestrator)

Lambda Function Logs (14 Functions)

Log Group	Retention	Function
`/aws/lambda/trinity-beast-receipt`	90 days	Stripe receipt processing
`/aws/lambda/trinity-beast-queued-writer`	90 days	SQS → Aurora batch inserts
`/aws/lambda/tbi-ops-notify`	90 days	Formatted SES notifications
`/aws/lambda/tbi-ops-self-heal`	90 days	ECS task restart automation
`/aws/lambda/tbi-ops-waf-action`	90 days	WAF rule management
`/aws/lambda/tbi-ops-honeypot-processor`	90 days	Honeypot queue → WAF blocks
`/aws/lambda/tbi-ops-bedrock-analyze`	90 days	AI threat correlation
`/aws/lambda/tbi-rhema-support`	90 days	AI support assistant
`/aws/lambda/tbi-ops-digest`	90 days	Daily/weekly digest
`/aws/lambda/tbi-translate-deploy`	90 days	Legacy (Lambda deleted 2026-06-27 — no new writes)
`/aws/lambda/tbi-translate-finalize`	90 days	Legacy (Lambda deleted 2026-06-27 — no new writes)
`/aws/lambda/tbi-translate-init`	90 days	Legacy (Lambda deleted 2026-06-27 — no new writes)
`/aws/lambda/tbi-translate-batch-prepare`	90 days	Legacy (Lambda deleted 2026-06-19 — no new writes)
`/aws/lambda/tbi-translate-batch-submit`	90 days	Legacy (Lambda deleted 2026-06-19 — no new writes)

Infrastructure & Database Logs

Log Group	Retention	Source
`/aws/rds/cluster/trinity-beast-aurora-cluster/postgresql`	90 days	Aurora slow queries (>1s), lock waits, errors
`/aws/elasticache/trinity-beast-cache/slow-log`	90 days	Valkey commands exceeding slowlog threshold
`/aws/elasticache/trinity-beast-cache/engine-log`	90 days	Valkey engine events (startup, failover, persistence)
`/aws/vpc/trinity-beast-flowlogs`	90 days	VPC Flow Logs (both VPCs)
`/aws/cloudtrail/trinity-beast`	30 days	CloudTrail API audit (S3 archive is indefinite)
`/aws/ecs/containerinsights/trinity-beast-fargate-cluster/performance`	90 days	Container Insights (CPU, memory, network per task)
`RDSOSMetrics`	90 days	Aurora Enhanced Monitoring (OS-level, every 30s)

S3-Delivered Logs (Not in CloudWatch)

The following log types are delivered directly to S3 bucket aws-waf-logs-trinity-beast rather than CloudWatch. They are high-volume, low-urgency logs best queried via download + gunzip + jq, or via Athena (future). Full details in Unified Logs & Observability.

S3 Prefix	Source	Retention
`AWSLogs/.../WAFLogs/`	WAF full request logs (every API request — IP, URI, headers, rule match)	365 days (S3 lifecycle)
`alb/`	ALB access logs (per-request latency, status, target)	365 days
`cloudfront/`	CloudFront standard logs (every website request — edge, cache status)	365 days
`s3-access/`	S3 server access logs (object-level access audit)	365 days

8. Custom Metrics (TrinityBeast Namespace)

The application publishes custom metrics to two CloudWatch namespaces, providing business-level observability beyond standard AWS metrics.

TrinityBeast/LPO Custom Namespace

Metrics published by the Live Price Oracle service:

Metric	Description
`Requests`	Total LPO requests received
`CacheHits`	Requests served from ElastiCache cache
`CacheMisses`	Requests requiring upstream source fetch
`Errors`	Failed requests (all error types)
`SourceFailovers`	Times a primary source failed and secondary was used
`AvgLatency`	Average response time in milliseconds

TrinityBeast/LRS Custom Namespace

Metrics published by the Live Report Service:

Metric	Description
`Requests`	Total LRS report requests
`AvgLatency`	Average report generation time in milliseconds
`Errors`	Failed report generations
`MonthlyLimitExceeded`	Requests rejected due to monthly quota
`DailyLimitExceeded`	Requests rejected due to daily quota
`AddOnRequests`	Requests using add-on quota beyond base plan

9. AutoOps — Autonomous Operations Monitoring

The 5-layer AutoOps system has its own monitoring footprint — 7 Lambda functions, 6 EventBridge rules, 4 anomaly detection alarms, and a dedicated SNS topic. All feed into the Security Dashboard.

Anomaly Detection Alarms (4 Alarms)

ML-Based Anomaly Detection Band Width: 3σ

These alarms use CloudWatch Anomaly Detection (machine learning) to learn normal traffic patterns and alert on deviations. They need ~2 weeks to build a baseline.

Alarm Name	Metric	Direction	Catches
`TrinityBeast-Anomaly-RequestRate`	ALB RequestCount (Sum)	Both (↑↓)	Traffic drops (outage) or unexpected spikes (attack)
`TrinityBeast-Anomaly-Latency`	ALB TargetResponseTime (Avg)	Above only (↑)	Slow degradation, DB bottlenecks
`TrinityBeast-Anomaly-ErrorRate`	ALB 5xx Count (Sum)	Above only (↑)	Error spikes beyond normal noise
`TrinityBeast-Anomaly-CacheHitRate`	ElastiCache CacheHitRate (Avg)	Below only (↓)	Cache evictions, Valkey issues

Configuration: 3 evaluation periods, 2 datapoints to alarm, treat missing as notBreaching. All alarms → SNS tbi-ops-notifications → also triggers tbi-ops-alarm-trigger EventBridge rule → Step Function health-check-heal.

AutoOps Lambda Functions (7 Functions)

AutoOps Lambda Metrics 1770 MB each

All 7 functions share the tbi-autonomous-ops-role IAM role. Metrics visible on the Security Dashboard.

Function	Purpose	Log Group
`tbi-ops-notify`	SNS notifications with severity levels	`/aws/lambda/tbi-ops-notify`
`tbi-ops-self-heal`	ECS task restart, force-deploy	`/aws/lambda/tbi-ops-self-heal`
`tbi-ops-waf-action`	WAF IP set block/unblock	`/aws/lambda/tbi-ops-waf-action`
`tbi-ops-honeypot-processor`	Drain honeypot queue → WAF block	`/aws/lambda/tbi-ops-honeypot-processor`
`tbi-ops-bedrock-analyze`	AI threat analysis via Bedrock	`/aws/lambda/tbi-ops-bedrock-analyze`
`tbi-rhema-support`	AI ticket categorization + drafts	`/aws/lambda/tbi-rhema-support`
`tbi-ops-digest`	Daily/weekly operational digests	`/aws/lambda/tbi-ops-digest`

EventBridge Rules (6 Rules)

AutoOps Event Routing ALL ENABLED

Rule Name	Trigger	Target
`tbi-ops-alarm-trigger`	CloudWatch alarm → ALARM	Step Function: health-check-heal
`tbi-ops-honeypot-queue-processor`	rate(5 minutes)	Lambda: honeypot-processor
`tbi-ops-bedrock-analyze-schedule`	rate(5 minutes)	Lambda: bedrock-analyze
`tbi-ops-guardduty-high-finding`	GuardDuty severity ≥ 7	Lambda: bedrock-analyze
`tbi-ops-daily-digest`	`cron(0 11 * * ? *) — 6 AM EST`	Lambda: digest
`tbi-ops-weekly-digest`	`cron(0 12 ? * MON *) — Mon 7 AM EST`	Lambda: digest

AutoOps SNS Topic

Topic: tbi-ops-notifications (arn:aws:sns:us-east-2:211998422884:tbi-ops-notifications)
Subscriber: tbi-ops-notify Lambda (formats + sends via SES to CoryDeanKalani@CPMP-Site.org)
Severity levels in subject: [INFO], [WARNING], [CRITICAL], [SELF-HEALED]
All 21 alarms route here — no raw AWS emails, everything formatted by Lambda.

10. Alarm Response Playbook

When an alarm fires, use the following runbooks to diagnose and resolve the issue. Each category includes the most common root causes and recommended actions.

ALB/NLB Unhealthy Targets Critical

Alarms: Trinity-Beast-ALB-UnhealthyTargets, Trinity-Beast-NLB-UnhealthyTargets

What it means: One or more ECS tasks are failing health checks from the load balancer.

Check ECS service health in the console — are tasks running or in a crash loop?
Review container logs in /aws/ecs/trinity-beast for startup errors or OOM kills
Verify target group health check path and expected response code
Check if a recent deployment introduced a breaking change
If tasks are running but unhealthy, check application health endpoint directly

ECS CPU High Warning

Alarms: Trinity-Beast-ECS-CPU-High, ECS-CPU-High-Mirror, ECS-CPU-High-LRS

What it means: An ECS service is consuming more than 80% CPU over a sustained period.

Check for a traffic spike — correlate with LPO/LRS request metrics on the Application Dashboard
Consider scaling the service — increase desired task count or adjust auto-scaling thresholds
Check for runaway goroutines or infinite loops in recent deployments
Review Container Insights for per-task CPU breakdown
If sustained, evaluate whether the task CPU allocation (vCPU) needs to be increased

Service Count Low Critical

Alarms: Trinity-Beast-Main-Service-Count-Low, Mirror-Service-Count-Low, LRS-Service-Count-Low

What it means: A container has crashed and no tasks are running for the service. These alarms use TreatMissing: breaching, so missing data also triggers the alarm.

Check ECS service events for task stopped reasons (OOM, exit code, health check failure)
Review container logs for the last running task — look for panic, fatal, or OOM messages
Check if the ECR image exists and is pullable (image pull failures)
Verify the task execution role has required permissions
Manually start a new task if the service is not recovering automatically

Aurora CPU High Warning

Alarm: Trinity-Beast-Aurora-CPU-High

What it means: The Aurora Serverless v2 cluster is consuming more than 80% CPU.

Check for slow queries — use Performance Insights or pg_stat_statements
Verify ACU scaling — is the cluster at max ACU and still under pressure?
Check if the nightly sync job is running and creating batch write pressure
Look for missing indexes on frequently queried columns
Consider increasing the max ACU limit if load is legitimate

Aurora Connections High Warning

Alarm: Trinity-Beast-Aurora-Connections-High

What it means: More than 80 active database connections — approaching the connection limit.

Check connection pool settings in the application — are pools sized correctly?
Look for connection leaks — connections opened but never returned to the pool
Verify that the sync job and Lambda are not opening excessive connections
Consider using RDS Proxy if connection pressure is persistent
Check if a recent deployment changed pool configuration

ElastiCache Serverless — ECPU / Storage / Throttling / Latency Warning

Alarms: Trinity-Beast-ElastiCache-CPU-High, ElastiCache-Memory-High, ElastiCache-Throttled, ElastiCache-Connections-High

What it means: The ElastiCache Serverless cache is approaching its configured limits or experiencing latency degradation. Unlike node-based caches, Serverless doesn't run out of "CPU" — it runs out of provisioned capacity (ECPU) or storage.

ECPU alarm (ElastiCacheProcessingUnits): Command volume is nearing the 10,000 ECPU/s limit. Check for cache stampedes, hot keys, or a spike in traffic. Raise the ECPU limit in the console if sustained growth justifies it.
BytesUsedForCache alarm: Storage is approaching the 5 GB ceiling. Review TTL settings — are cached items living too long? Check for large keys (LRS report caches, search indexes) consuming disproportionate space. Raise the storage limit or prune stale keys.
ThrottledCmds alarm: Commands are being rejected because capacity was exceeded. This is the most urgent signal — it means requests are failing. Immediately raise the ECPU or storage limit. Then investigate what caused the spike.
Latency alarm (SuccessfulReadRequestLatency): p99 reads are taking longer than expected. This can indicate the Serverless proxy is under load or network path issues. Check VPC flow logs and ensure the security group allows traffic from all ECS subnets.
Remember: ElastiCache is a pure cache layer — Aurora is the source of truth. If the cache is under sustained pressure, the system degrades gracefully (slower reads, no data loss). Raising limits is a one-click operation with zero downtime.

S3 Unusual Size Growth Low Priority

Alarm: Trinity-Beast-S3-Size-Unusual-Growth

What it means: The S3 bucket has exceeded 10 GB, which may indicate unexpected data accumulation.

Check for unexpected uploads — review S3 access logs or CloudTrail for PutObject events
Look for log file accumulation — are old log exports or reports piling up?
Verify lifecycle policies are in place to expire or transition old objects
Check if the LRS report output is being stored without cleanup
Review bucket versioning — old versions may be consuming space