The Trinity Beast – AI AutoOps Implementation

1. Overview

The Trinity Beast Infrastructure (TBI) employs a multi-layered pro-active AI and automation system that monitors, defends, heals, and reports on the infrastructure autonomously. Rather than waiting for problems to escalate, the system detects anomalies in real-time, correlates threat signals using AI, and takes corrective action — all before a human needs to intervene.

AI Model

Claude Sonnet 4.6 (Bedrock)

AutoOps Lambdas

7 Functions

Event Rules

6 EventBridge

Workflows

1 Step Function

Anomaly Detectors

4 ML Models

Honeypot Traps

12 Endpoints

Translation

12 Languages

Monthly Cost

~$10–15/mo

Design Philosophy

Self-heal first

All 5 Layers

LIVE ✅

Core Principle: Self-heal first, notify second. If the system can fix it, fix it and tell Cory after. Escalate fast on unknowns. Never suppress critical alerts. Bedrock is advisory — AI suggests, Step Functions decide, Lambda acts.

AWS Services Utilized

Service	Role in TBI	Layer
Amazon EventBridge	Event routing — connects CloudWatch alarms, scheduled triggers, and custom events to automated workflows	Automation
AWS Step Functions	Workflow orchestration — multi-step heal/verify/notify sequences with retry logic and branching	Orchestration
Amazon Bedrock (Claude Sonnet 4.6)	AI threat analysis — pattern correlation, severity assessment, plain-English reports, auto-action recommendations	Intelligence
AWS Lambda (7 Go functions)	Action execution — self-heal, WAF management, notifications, honeypot processing, AI analysis, support automation, operational digests	Execution
AWS WAF	Automated IP blocking — honeypot-triggered and AI-recommended blocks applied to the API WAF	Defense
Amazon SNS	Operational notifications — severity-tagged alerts delivered to email with full context	Communication
Amazon Translate	Multi-lingual content — 384 translated documents, real-time support correspondence translation	Localization
Amazon CloudWatch	Metrics, alarms, anomaly detection — feeds events into EventBridge for automated response	Observability
Amazon GuardDuty	Threat detection — VPC flow logs, CloudTrail, DNS analysis for automated security response	Security
Amazon SES	Transactional email — receipts, notifications, and AI-drafted correspondence in 12 languages	Communication
Amazon SQS	Decoupled processing — usage log pipeline, honeypot auto-block queue	Messaging
ElastiCache (Valkey)	Operational state — threat reports, action logs, self-heal counters, honeypot data, search indexes	State

2. Full System Diagram

The complete AutoOps pipeline showing all 7 Lambda functions, AI services, EventBridge rules, and data flows — with per-Lambda binary sizes and roles. See the legend below the diagram for common Lambda configuration shared across all functions.

Diagram 2.1 — AutoOps Lambda Pipeline — All Components & Configuration

flowchart TB
    subgraph Sources["Event Sources"]
        CWA["☁️ CloudWatch Alarms\n5xx spike · Health fail\nACU ceiling · Anomaly"]
        GD["🛡️ GuardDuty\nVPC flow · CloudTrail\nDNS · Severity ≥ 7"]
        HP["🪤 Honeypot Traps\n12 decoy endpoints\n2-hit auto-block threshold"]
        SCHED["⏱️ Scheduled Events\nEvery 5 min · Daily 6AM\nMonday 7AM EST"]
    end

    subgraph EB["EventBridge (6 Rules)"]
        R1["tbi-ops-alarm-trigger\nCW alarm → Step Function"]
        R2["tbi-ops-honeypot-queue\nEvery 5 min → honeypot-processor"]
        R3["tbi-ops-bedrock-analyze-schedule\nEvery 5 min → bedrock-analyze"]
        R4["tbi-ops-guardduty-high-finding\nGuardDuty severity ≥ 7 → bedrock-analyze"]
        R5["tbi-ops-daily-digest\ncron 0 11 * * ? * → digest"]
        R6["tbi-ops-weekly-digest\ncron 0 12 ? * MON * → digest"]
    end

    subgraph SF["Step Functions"]
        SFN["tbi-ops-health-check-heal\nCheck → Wait 60s → Recheck\n→ Force-Deploy → Verify → Notify"]
    end

    subgraph LambdaLayer["Lambda Functions — Go · provided.al2023 · 1 vCPU · 1770 MB · 60s timeout · Not in VPC"]
        L1["tbi-ops-self-heal\n━━━━━━━━━━━━━━━━━━\n📦 8.5 MB\n🔧 ECS force-deploy\nrestart-task · check-health\nLogs to Valkey"]
        L2["tbi-ops-waf-action\n━━━━━━━━━━━━━━━━━━\n📦 8.2 MB\n🔧 block-ips · unblock-ips\nlist-blocked\nManages WAF IP set"]
        L3["tbi-ops-notify\n━━━━━━━━━━━━━━━━━━\n📦 8.3 MB\n🔧 Branded HTML email\nSES delivery · SNS fallback\n[INFO/WARN/CRIT/HEALED]"]
        L4["tbi-ops-honeypot-processor\n━━━━━━━━━━━━━━━━━━\n📦 8.2 MB\n🔧 Drain autoblock queue\nDeduplicate IPs\nInvoke waf-action + notify"]
        L5["tbi-ops-bedrock-analyze\n━━━━━━━━━━━━━━━━━━\n📦 8.5 MB\n🔧 Gather threat signals\nQuiet check → Bedrock AI\nAuto-act · Store · Notify"]
        L6["tbi-rhema-support\n━━━━━━━━━━━━━━━━━━\n📦 8.5 MB\n🔧 Rhema Support Assistant\nAI draft response · Knowledge gaps\nPre-fetch diagnostics · Notify"]
        L7["tbi-ops-digest\n━━━━━━━━━━━━━━━━━━\n📦 8.5 MB · ⏱️ 180s timeout\n🔧 Daily 300-word summary\nWeekly newsletter pipeline\nBedrock narrative · Notify"]
    end

    subgraph AI["Amazon Bedrock"]
        Claude["Claude Sonnet 4.6\nus.anthropic.claude-sonnet-4-6\nInference Profile · No setup\nPattern correlation\nThreat assessment\nSupport drafting\nOperational narratives"]
    end

    subgraph WAFDef["WAF Defense"]
        IPSet["tbi-autoops-blocked-ips\nWAF IP Set\nID: 8d55de25-8ba5-4982-8c41-f4316c9bd50d"]
        WRule["AutoOps-BlockedIPs\nPriority 7\ntrinity-beast-api-waf"]
    end

    subgraph NotifyLayer["Notifications"]
        SNS["tbi-ops-notifications\nSNS Topic\nARN: arn:aws:sns:us-east-2:211998422884:..."]
        SES["Amazon SES\nBranded HTML email\nNo-Reply@CPMP-Site.org → Cory"]
    end

    subgraph ValkeyState["Valkey State (ElastiCache)"]
        VK["autoops:threats:daily\nautoops:actions:log\nautoops:self-heal:count\nautoops:digest:daily · weekly\nsupport:ticket:{id}\nhoneypot:autoblock_queue\nreport:text:YYYY-MM-DD"]
    end

    subgraph Anomaly["Layer 3 — Anomaly Detection"]
        AD["4 CloudWatch ML Models\nRequestCount · TargetResponseTime\nHTTPCode_5XX · CacheHitRate\nBand width: 2σ · 3 eval periods"]
    end

    %% Event sources → EventBridge
    CWA -->|"alarm state change"| R1
    CWA -->|"alarm state change"| R3
    GD -->|"severity ≥ 7"| R4
    HP -->|"hit logged to queue"| R2
    SCHED --> R2
    SCHED --> R3
    SCHED --> R5
    SCHED --> R6
    AD -->|"anomaly alarm"| R1

    %% EventBridge → targets
    R1 --> SFN
    R2 --> L4
    R3 --> L5
    R4 --> L5
    R5 --> L7
    R6 --> L7

    %% Step Function → Lambdas
    SFN -->|"check-health / force-deploy"| L1
    SFN -->|"notify result"| L3

    %% Lambda → Lambda
    L4 -->|"invoke sync"| L2
    L4 -->|"invoke async"| L3
    L5 -->|"invoke async"| L3
    L5 -->|"auto-block sync"| L2
    L6 -->|"invoke async"| L3
    L7 -->|"invoke async"| L3

    %% Lambda → AI
    L5 -->|"InvokeModel"| Claude
    L6 -->|"InvokeModel"| Claude
    L7 -->|"InvokeModel"| Claude

    %% Lambda → WAF
    L2 --> IPSet
    IPSet --> WRule

    %% Lambda → Notifications
    L3 --> SNS
    SNS --> SES

    %% Lambda → Valkey
    L1 -->|"log action"| VK
    L4 -->|"drain queue"| VK
    L5 -->|"store threat report"| VK
    L6 -->|"store ticket analysis"| VK
    L7 -->|"store digest"| VK

    %% Styling
    style Sources fill:#1e293b,stroke:#ef4444,color:#e2e8f0
    style EB fill:#1e293b,stroke:#FF9900,color:#e2e8f0
    style SF fill:#1e293b,stroke:#60a5fa,color:#e2e8f0
    style LambdaLayer fill:#1e293b,stroke:#10b981,color:#e2e8f0
    style AI fill:#1e293b,stroke:#a78bfa,color:#e2e8f0
    style WAFDef fill:#1e293b,stroke:#ef4444,color:#e2e8f0
    style NotifyLayer fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style ValkeyState fill:#1e293b,stroke:#06b6d4,color:#e2e8f0
    style Anomaly fill:#1e293b,stroke:#818cf8,color:#e2e8f0

⚙️ Lambda Common Configuration (All 7 Functions)

Runtime: Go · provided.al2023

CPU: 1 vCPU (1770 MB = 1 full vCPU allocation)

Memory: 1770 MB (multiple of 3 — fast execution, no cold-start pain)

Timeout: 60s (except tbi-ops-digest: 180s)

VPC: Not in VPC — uses public API endpoints + AWS service APIs

IAM Role: tbi-autonomous-ops-role (shared)

Architecture: Linux x86-64 (ELF) — cross-compiled with GOOS=linux GOARCH=amd64

Deploy: bash scripts/kcc.sh deploy-lambda autoops <name>

📦 Binary Sizes (Deployed)

tbi-ops-notify

8.3 MB

tbi-ops-self-heal

8.5 MB

tbi-ops-waf-action

8.2 MB

tbi-ops-honeypot-processor

8.2 MB

tbi-ops-bedrock-analyze

8.5 MB

tbi-rhema-support

8.5 MB

tbi-ops-digest

8.5 MB

🗺️ Diagram Color Key

Event Sources & WAF Defense

EventBridge — event routing & scheduling

Step Functions — workflow orchestration

Lambda Functions — Go action execution

Amazon Bedrock — AI analysis & generation

Notifications — SNS → SES email pipeline

Valkey State — operational data & action logs

Anomaly Detection — ML-based predictive alarms

3. Event-Driven Automation (EventBridge)

Amazon EventBridge is the central nervous system of AutoOps. It receives events from across the infrastructure and routes them to the appropriate automated response.

How It Works

Pattern: Event Source → EventBridge Rule (pattern match) → Target (Step Function or Lambda)

Every CloudWatch alarm state change, every GuardDuty finding, and every scheduled interval generates an event. EventBridge rules match these events by pattern and invoke the correct automated response — no polling, no delays.

Diagram 3.1 — EventBridge Event Flow

graph LR
    A["CloudWatch Alarm
State: ALARM"] -->|"Event Pattern"| EB["EventBridge
Event Bus"]
    B["Scheduled
rate(5 minutes)"] -->|"Schedule"| EB
    EB -->|"tbi-ops-alarm-trigger"| SF["Step Function
health-check-heal"]
    EB -->|"tbi-ops-honeypot-queue"| L["Lambda
honeypot-processor"]

    style EB fill:#1e293b,stroke:#FF9900,color:#e2e8f0
    style SF fill:#1e293b,stroke:#60a5fa,color:#e2e8f0
    style L fill:#1e293b,stroke:#10b981,color:#e2e8f0

Active Rules

Rule Name	Event Pattern	Target	Purpose
`tbi-ops-alarm-trigger`	CloudWatch alarm state change → ALARM	Step Function: `tbi-ops-health-check-heal`	Triggers automated health check and self-healing workflow when any alarm fires
`tbi-ops-honeypot-queue-processor`	Scheduled: every 5 minutes	Lambda: `tbi-ops-honeypot-processor`	Drains the honeypot auto-block queue and applies WAF rules for repeat offenders
`tbi-ops-bedrock-analyze-schedule`	Scheduled: every 5 minutes	Lambda: `tbi-ops-bedrock-analyze`	Periodic AI threat analysis — skips Bedrock when infrastructure is quiet (cost-conscious)
`tbi-ops-guardduty-high-finding`	GuardDuty Finding, severity ≥ 7	Lambda: `tbi-ops-bedrock-analyze`	Immediate AI threat analysis on HIGH/CRITICAL GuardDuty findings
`trinity-beast-nightly-sync`	Scheduled: `cron(0 6 * * ? *)` (1 AM EST)	ECS Task: `trinity-beast-sync-job`	Nightly Aurora → Valkey sync + search index rebuild

Future Rules (Planned)

Aurora ACU ceiling → alert + auto-scale recommendation
Certificate expiration → trigger renewal workflow

4. Workflow Orchestration (Step Functions)

AWS Step Functions orchestrate multi-step automated responses that require sequencing, branching, and retry logic. A single Lambda can restart a task — but a Step Function can check health, wait, recheck, decide whether to restart or escalate, verify recovery, and notify.

Active State Machine

tbi-ops-health-check-heal Step Function

Triggered by CloudWatch alarm via EventBridge. Orchestrates the full self-healing cycle.

Workflow Steps

Step	Action	On Success	On Failure
1. Check Health	Invoke `tbi-ops-self-heal` with `action: check-health`	If all healthy → notify INFO and exit	Continue to step 2
2. Wait 60s	Built-in Wait state — allow transient issues to resolve	Continue to step 3	—
3. Recheck Health	Invoke `tbi-ops-self-heal` again	If recovered → notify SELF-HEALED and exit	Continue to step 4
4. Force Deploy	Invoke `tbi-ops-self-heal` with `action: force-deploy`	Continue to step 5	Notify CRITICAL + escalate
5. Verify Recovery	Wait 90s, then recheck health	Notify SELF-HEALED	Notify CRITICAL — requires attention

Design Decision: The 60-second wait before action prevents false positives from transient network blips or rolling deployments. Most alarms self-resolve within this window. Only persistent failures trigger the force-deploy.

5. AI Threat Analysis (Amazon Bedrock)

Amazon Bedrock provides the intelligence layer — correlating multiple weak signals into a coherent threat picture that simple threshold-based alarms would miss.

Model Configuration

Model

Claude Sonnet 4.6

Model ID

us.anthropic.claude-sonnet-4-6

Access Method

Inference Profile

Max Tokens

1,024

Diagram 5.1 — Bedrock Threat Analysis Pipeline

graph TB
    subgraph Gather["Signal Gathering"]
        S1["WAF blocks (24h)"]
        S2["Honeypot hits"]
        S3["Alarm state"]
        S4["GuardDuty findings"]
        S5["Self-heal count"]
        S6["Recent IPs"]
    end

    subgraph Decide["Quiet Check"]
        QC{"All signals
below threshold?"}
    end

    subgraph Analyze["Bedrock Analysis"]
        BR["Claude Sonnet 4.6
System: Security Analyst
Input: Threat signals JSON
Output: Structured report"]
    end

    subgraph Act["Response Actions"]
        A1["Store report in Valkey"]
        A2["Notify if HIGH/CRITICAL"]
        A3["Execute safe auto-actions"]
    end

    S1 --> QC
    S2 --> QC
    S3 --> QC
    S4 --> QC
    S5 --> QC
    S6 --> QC
    QC -->|"Yes"| Skip["Log quiet · Skip Bedrock"]
    QC -->|"No"| BR
    BR --> A1
    BR --> A2
    BR --> A3

    style Gather fill:#1e293b,stroke:#ef4444,color:#e2e8f0
    style Decide fill:#1e293b,stroke:#f59e0b,color:#e2e8f0
    style Analyze fill:#1e293b,stroke:#a78bfa,color:#e2e8f0
    style Act fill:#1e293b,stroke:#10b981,color:#e2e8f0

What Bedrock Analyzes

Signal	Source	What It Tells Us
WAF blocks (24h)	CloudWatch / `/public/infrastructure`	Volume of blocked malicious requests — baseline vs. spike
Honeypot hits	Valkey `honeypot:log`	Active scanning/enumeration attempts against decoy endpoints
Auto-blocked IPs	Valkey `honeypot:blocked_ips`	How many IPs have been automatically banned
Pending queue	Valkey `honeypot:autoblock_queue`	IPs awaiting WAF block — indicates active attack if growing
Alarms firing	CloudWatch	Infrastructure health issues (5xx, health check failures)
GuardDuty findings	GuardDuty	Network-level threats (port scanning, unusual API calls)
Self-heal count	Valkey `autoops:self-heal:count`	Frequency of automated recoveries — high count may indicate underlying issue
Recent honeypot IPs	Valkey `honeypot:log` (top 10)	Specific attacker IPs for pattern analysis

Output Format

Bedrock returns structured JSON:

severity: NONE | LOW | MEDIUM | HIGH | CRITICAL
summary: Plain English description of the threat landscape
patterns: Identified attack types (port scanning, credential stuffing, DDoS, enumeration)
recommendations: Human-actionable suggestions for Cory
auto_actions: Safe, reversible actions the system can take automatically

Cost Control

Bedrock is only invoked when signals exceed quiet thresholds (WAF blocks > 50, honeypot hits > 10, any alarm firing, any GuardDuty finding). During quiet periods, the Lambda logs quiet status and exits without calling Bedrock — keeping costs near zero.

6. Self-Healing Lambda Functions

Seven Go Lambda functions form the action layer of AutoOps. Each is a standalone binary, cross-compiled for Linux, running on the provided.al2023 runtime at 1770 MB memory (multiple of 3 — fast execution, no cold-start pain).

Function Inventory

All functions share the same base configuration: Go · provided.al2023 · 1 vCPU · 1770 MB · Not in VPC · IAM role: tbi-autonomous-ops-role.

Function	Binary Size	Timeout	Layer	Actions
`tbi-ops-notify`	`8.3 MB`	`60s`	1	Branded HTML email via SES · SNS fallback · Severity badges [INFO/WARNING/CRITICAL/SELF-HEALED]
`tbi-ops-self-heal`	`8.5 MB`	`60s`	1	`force-deploy`, `restart-task`, `check-health` · Logs actions to Valkey
`tbi-ops-waf-action`	`8.2 MB`	`60s`	1	`block-ips`, `unblock-ips`, `list-blocked` · Manages `tbi-autoops-blocked-ips` WAF IP set
`tbi-ops-honeypot-processor`	`8.2 MB`	`60s`	1	Drain `honeypot:autoblock_queue` → deduplicate IPs → invoke waf-action (sync) → notify (async)
`tbi-ops-bedrock-analyze`	`8.5 MB`	`60s`	2	Gather 8 threat signals → quiet check → Claude Sonnet 4.6 analysis → store in Valkey → notify if HIGH/CRITICAL → execute safe auto-actions
`tbi-rhema-support`	`8.5 MB`	`60s`	4	Auto-categorize ticket (8 categories) → Claude Sonnet 4.6 draft response in `preferred_lang` → pre-fetch diagnostics → notify Cory with full context
`tbi-ops-digest`	`8.5 MB`	`180s`	5	Daily 300-word summary + weekly newsletter pipeline · Reads `report:text:YYYY-MM-DD` from Valkey · Claude Sonnet 4.6 narrative · Auto-translate 11 languages · Send via notify

Inter-Lambda Communication

AutoOps Lambdas invoke each other directly via the AWS Lambda Invoke API:

honeypot-processor → invokes waf-action (synchronous) and notify (async)
bedrock-analyze → invokes notify (async) and waf-action (synchronous for auto-blocks)
bedrock-support → invokes notify (async) with ticket analysis and draft response
digest → invokes notify (async) with daily/weekly operational summary
Step Function → invokes self-heal and notify as workflow steps

Shared IAM Role

All 7 functions share tbi-autonomous-ops-role with permissions for: ECS (update-service, stop-task), WAF (update-ip-set), CloudWatch (get-metrics, describe-alarms), Bedrock (invoke-model), SES (send-email), SNS (publish), SQS (receive/delete), Secrets Manager (get-secret), and Lambda (invoke-function on tbi-ops-*).

7. Honeypot Defense System

12 decoy endpoints that no legitimate user would ever access. Any hit reveals a scanner, bot, or attacker probing for vulnerabilities.

Trap Endpoints

Endpoint

/wp-admin

Endpoint

/.env

Endpoint

/phpmyadmin

Endpoint

/.git/config

Endpoint

/wp-login.php

Endpoint

/administrator

Endpoint

/xmlrpc.php

Endpoint

/backup.sql

Endpoint

/config.php

Endpoint

/debug

Endpoint

/actuator

Endpoint

/server-status

How It Works

Step	Action	Storage
1. Hit detected	Log IP, user-agent, path, timestamp	Valkey: `honeypot:ip:{ip}` hash + `honeypot:log` sorted set
2. Tarpit	2-second delay response (wastes scanner time)	—
3. Threshold check	If same IP hits 2+ traps → queue for auto-block	Valkey: `honeypot:autoblock_queue`
4. Queue drain (5 min)	EventBridge → `honeypot-processor` Lambda	—
5. WAF block	IPs added to `tbi-autoops-blocked-ips` WAF IP set	Valkey: `honeypot:blocked_ips` set
6. Notify	SELF-HEALED notification with blocked IP list	—

Table-driven design: Honeypot paths are stored in a Valkey set (honeypot:paths). Add or remove trap endpoints without redeploying — just update the set.

8. WAF Automation

The WAF layer provides both static protection (managed rule groups) and dynamic, AI-driven blocking via the AutoOps IP set.

WAF Configuration

WAF	Protects	Rules
`trinity-beast-api-waf`	ALB (API)	IP Reputation, Common Rules, Known Bad Inputs, SQL Injection, Rate Limit Global (2000/5min), Rate Limit Admin (100/5min), AutoOps-BlockedIPs (priority 7)
`CreatedByCloudFront-449feaa5`	CloudFront (Website)	Anti-DDoS, IP Reputation, Common Rules, Known Bad Inputs

AutoOps WAF Integration

tbi-autoops-blocked-ips WAF IP Set

ID: 8d55de25-8ba5-4982-8c41-f4316c9bd50d

Managed exclusively by AutoOps Lambdas. IPs are added by:

tbi-ops-honeypot-processor — repeat honeypot offenders (2+ hits)
tbi-ops-bedrock-analyze — AI-recommended blocks (safe auto-actions)
tbi-ops-waf-action — manual blocks via KCC commands

9. Notification & Alerting

All operational notifications flow through a branded email pipeline. No raw text, no ugly AWS formatting — every alert arrives as a professionally designed HTML email via Amazon SES.

Notification Flow

The tbi-ops-notify Lambda serves as the single gateway for all notifications. It accepts events from two paths:

Direct invocation — other AutoOps Lambdas (bedrock-analyze, self-heal, digest, honeypot-processor) invoke it directly with structured event payloads
SNS subscription — CloudWatch alarms and other AWS services publish to the SNS topic, which triggers the Lambda. The Lambda parses the SNS envelope, detects the message type (alarm, generic), and normalizes it into a structured notification.

Diagram 9.1 — Branded Email Notification Pipeline

graph LR
    subgraph Sources["Event Sources"]
        CW["CloudWatch Alarms"]
        BA["tbi-ops-bedrock-analyze"]
        SH["tbi-ops-self-heal"]
        DG["tbi-ops-digest"]
        HP["tbi-ops-honeypot-processor"]
        SP["tbi-rhema-support"]
    end

    subgraph Router["SNS Topic"]
        SNS["tbi-ops-notifications"]
    end

    subgraph Notify["Notification Lambda"]
        NL["tbi-ops-notify"]
        Parse["Parse Event Type"]
        Template["Build HTML Email"]
    end

    subgraph Delivery["Email Delivery"]
        SES["Amazon SES"]
        Email["Branded HTML Email
Dark theme · Structured sections
Severity badge · Action cards"]
    end

    subgraph Fallback["Fallback Path"]
        SNSFB["SNS Plain Text
(only if SES fails)"]
    end

    CW -->|"Alarm state change"| SNS
    SNS -->|"Lambda subscription"| NL
    BA -->|"Direct invoke"| NL
    SH -->|"Direct invoke"| NL
    DG -->|"Direct invoke"| NL
    HP -->|"Direct invoke"| NL
    SP -->|"Direct invoke"| NL
    NL --> Parse
    Parse --> Template
    Template --> SES
    SES --> Email
    Template -.->|"SES failure"| SNSFB

    style Sources fill:#1e293b,stroke:#60a5fa,color:#e2e8f0
    style Router fill:#1e293b,stroke:#FF9900,color:#e2e8f0
    style Notify fill:#1e293b,stroke:#10b981,color:#e2e8f0
    style Delivery fill:#1e293b,stroke:#a855f7,color:#e2e8f0
    style Fallback fill:#1e293b,stroke:#64748b,color:#94a3b8

    linkStyle 0 stroke:#FF9900
    linkStyle 1 stroke:#FF9900
    linkStyle 2 stroke:#10b981
    linkStyle 3 stroke:#10b981
    linkStyle 4 stroke:#10b981
    linkStyle 5 stroke:#10b981
    linkStyle 6 stroke:#10b981
    linkStyle 7 stroke:#60a5fa
    linkStyle 8 stroke:#60a5fa
    linkStyle 9 stroke:#a855f7
    linkStyle 10 stroke:#a855f7
    linkStyle 11 stroke:#64748b

SNS Topic

Topic

tbi-ops-notifications

ARN

arn:aws:sns:us-east-2:211998422884:tbi-ops-notifications

Subscriber

tbi-ops-notify Lambda

Protocol

Lambda (no email subscriptions)

Email Delivery

Service

Amazon SES

From

CPMP Mission <No-Reply@CPMP-Site.org>

CoryDeanKalani@Gmail.com

Format

Branded HTML (dark theme, Gmail-compatible)

Email Template Features

Gmail-compatible dark theme — uses bgcolor HTML attributes on every <td> (Gmail strips CSS background properties)
Solid hex colors only — no rgba() values (Gmail strips them entirely)
Color-coded severity badge — INFO (blue), WARNING (amber), CRITICAL (red), SELF-HEALED (green)
Auto-parsed sections — Summary, Patterns, Recommendations detected from detail text and rendered as structured HTML
Bullet lists — lines starting with • or - render as actual <ul> lists
Action Taken card — green left-border card showing what the system did
Attention banner — red banner for CRITICAL events requiring human intervention
Branded footer — "The Trinity Beast AutoOps · us-east-2 · The Trinity Beast Infrastructure"

Event Parsing

The Lambda auto-detects the incoming event format and normalizes it:

Source Format	Detection	Parsing
CloudWatch Alarm (via SNS)	`Records[].Sns.Message` contains `AlarmName`	Extracts alarm name, state transition, metric, reason → structured sections
AutoOps Lambda (direct)	`severity` + `title` fields present	Used directly — already in `NotifyEvent` format
Generic SNS message	Has `Records` array but no alarm fields	Subject becomes title, message body becomes detail

Severity Levels

Level	Subject Prefix	When Used	Action Required
INFO	`[INFO]`	Health check passed, digest generated, routine status	None — informational only
WARNING	`[WARNING]`	HIGH threat detected, anomaly alarm, elevated activity	Review when convenient
CRITICAL	`[CRITICAL]`	Self-heal failed, IAM changes detected, EC2 launch detected	Immediate attention required
SELF-HEALED	`[SELF-HEALED]`	Problem detected AND fixed automatically, alarm returned to OK	None — system handled it

Design Principles

Single gateway — all notifications funnel through one Lambda, one template, one brand
Never lose a notification — if SES fails, falls back to SNS plain text
No raw AWS emails — SNS topic has zero email subscriptions; only Lambda subscription
Consistent branding — same dark theme, same footer, same severity colors across all notification types
Deduplication — Bedrock threat notifications have a 1-hour cooldown per severity level

10. Custom AI Translation Engine (Bedrock)

A custom Bedrock-powered translation engine replaces AWS Translate for all document translation. The engine uses sentinel preprocessing, multi-layer validation, and Step Function orchestration to produce structurally correct translations that preserve code blocks, diagrams, and brand terminology.

Translation Coverage

Languages

12 (incl. English)

Documents

407+ HTML files

Engine

Custom Bedrock (Claude Sonnet 4.6)

Worker

ECS Fargate (Python 3.11)

Architecture

Component	Type	Purpose
`POST /admin/translate`	Admin API	Job submission, monitoring, control (9 endpoints)
`trinity-beast-translation-queue`	SQS	Decouple submission from execution
`tbi-translate-pipe`	EventBridge Pipe	SQS → Step Function (no glue Lambda)
`tbi-translation-orchestrator`	Step Functions	Fan-out: docs serial, languages parallel ×11
`tbi-translate-worker`	ECS Fargate Task	Bedrock translation + sentinel preprocessing + validation + finalization
`/admin/translate/complete`	LPO Endpoint (Go)	CloudFront invalidation + search rebuild + notification + cost + scale-down

Key Innovations

Sentinel preprocessing — three-pass system replaces protected content with placeholder tokens before the model sees it. The model cannot corrupt what it never sees. 57 protected brand terms, version numbers, URLs, ARNs, and numeric values are all shielded.
Multi-layer validation — every translated chunk is validated for tag count integrity, protected term survival, version number preservation, and HTML structure. Failures trigger automatic retries with temperature jitter.
ECS Fargate worker — no timeout ceiling. Large documents (140+ KB) translate to completion regardless of processing time. The worker runs as a 1 vCPU / 3 GB Fargate task invoked by the Step Function via ecs:runTask.sync.
Fire-and-forget operation — a single API call translates any document into all 11 languages, deploys to S3, invalidates CloudFront, rebuilds the search index, and emails a summary.

Translation Protection

Principle: Translate the understanding, preserve the execution. Code blocks, Mermaid diagrams, and technical identifiers are protected by the sentinel system. A developer in any language reads the explanation in their native tongue, then copy-pastes the code and has it work on the first try.

Full documentation: See AutoOps-Translation-Engine.html" style="color:#60a5fa;">AutoOps Translation Engine for the complete technical reference — sentinel types, validator system, Step Function orchestration, admin API, cost protection, and operations guide.

11. Observability & Monitoring

The monitoring layer feeds events into AutoOps and provides the data that Bedrock analyzes.

CloudWatch Alarms

Alarm	Metric	Threshold	AutoOps Response
WAF High Block Rate	WAF BlockedRequests	> 1000/5min	Triggers Bedrock analysis
API 5xx Spike	ALB 5xx count	> 10/min	Health check → self-heal workflow
API 4xx Spike	ALB 4xx count	> 100/min	Bedrock analysis for abuse patterns
GuardDuty Finding	GuardDuty event	Any HIGH/CRITICAL	Immediate notification + Bedrock analysis
ECS CPU High	ECS CPUUtilization	> 80%	Alert (scaling is manual)
Service Count Low	ECS RunningTaskCount	< 1	Health check → force-deploy
Aurora CPU High	RDS CPUUtilization	> 80%	Alert + ACU review

GuardDuty

Detector: 18ceef6f8dddcf6082473cc7016ee458
Sources: VPC flow logs, CloudTrail API calls, DNS query logs
Integration: Findings generate EventBridge events → AutoOps response

Public Infrastructure Endpoint

All monitoring data is exposed via GET /public/infrastructure — a single unauthenticated endpoint that powers the Infrastructure Live page. Cached 60s in Valkey. Includes CloudFront, WAF, SQS, Lambda, Sync, GuardDuty, Alarms, Honeypot, and AutoOps stats.

12. Cost Analysis

The entire AI and automation layer runs for approximately $10–15/month — a remarkably low cost for a fully autonomous, AI-powered operations system running 24/7.

Service	Usage Pattern	Monthly Cost
EventBridge	~8,640 events/month (alarm changes + scheduled)	~$0.50
Step Functions	~100 executions/month (only on alarm)	~$1.00
Lambda (7 AutoOps)	~10,000 invocations/month, 1770 MB, <5s avg	~$2–5
Amazon Bedrock	~50–200 invocations/month (only when signals active)	~$2–5
SNS	~100 notifications/month	~$0.10
Amazon Bedrock (Translation)	~$1.65 per doc-language pair, usage-dependent	~$2–10
Total AI & Automation		~$10–15/mo

Cost-conscious by design: Bedrock is only called when threat signals exceed quiet thresholds. During normal operations (most of the time), the Lambda checks signals, finds everything quiet, and exits without invoking the AI model. This keeps Bedrock costs near zero during peaceful periods.

13. Future Layers

All 5 layers of the AutoOps system are fully deployed and operational.

Layer	Name	Status	Purpose
1	Autonomous Operations	LIVE	Event-driven self-healing, WAF automation, notifications
2	Intelligent Threat Response	LIVE	AI-powered threat analysis via Bedrock (every 5 min + GuardDuty HIGH)
3	Predictive Operations	LIVE	CloudWatch Anomaly Detection — 4 ML models learning traffic patterns
4	Customer Engagement	LIVE	AI-assisted support — auto-categorize, auto-reply as AutoOps, human escalation option, multi-lingual
5	Self-Documenting Infrastructure	LIVE	Bedrock-generated daily + weekly operational digests

Layer 3: Predictive Operations (CloudWatch Anomaly Detection)

4 Anomaly Detectors: Request rate, latency, 5xx error rate, cache hit rate
4 Anomaly Alarms: TrinityBeast-Anomaly-RequestRate, TrinityBeast-Anomaly-Latency, TrinityBeast-Anomaly-ErrorRate, TrinityBeast-Anomaly-CacheHitRate
Band width: 2 standard deviations — tight enough to catch real issues, loose enough to avoid noise
Evaluation: 3 periods, 2 datapoints to alarm, treat missing as notBreaching
All alarms → SNS → EventBridge tbi-ops-alarm-trigger → Step Function health-check-heal
ML models need ~2 weeks to build baseline, then detect deviations from learned patterns

Layer 4: Customer Engagement — Rhema (`tbi-rhema-support`)

Auto-categorize tickets: billing, technical, api-access, rate-limit, feature-request, bug-report, account, general
Auto-draft responses in customer's preferred_lang via Bedrock
Pre-fetch diagnostic info (key status, usage, rate limits, last seen) before Cory reads ticket
Store analysis in Valkey (support:ticket:{id})
Notify Cory with category, priority, internal notes, and draft response
Triggered by application when support ticket is submitted

Layer 5: Self-Documenting Infrastructure (`tbi-ops-digest`)

Daily digest: 6 AM EST — concise 300-word operational summary
Weekly digest + newsletter: Monday 7 AM EST — generates subscriber newsletter from 7 days of daily reports
EventBridge rules: tbi-ops-daily-digest + tbi-ops-weekly-digest
Gathers stats from /public/infrastructure endpoint
Generates narrative via Bedrock (Claude Sonnet 4.6)
Stores in Valkey (autoops:digest:daily, autoops:digest:weekly)
Sends via tbi-ops-notify (email digest to Cory)

Weekly Newsletter Data Pipeline

The weekly newsletter uses daily operations reports as its primary source of truth — not just a point-in-time snapshot:

Session reports generated — Kiro saves HTML reports to ~/daily-reports/ and uploads to s3://trinity-beast-website-east2/daily-reports/
Nightly sync job — reads HTML reports from S3, strips tags to plain text, stores in Valkey as report:text:YYYY-MM-DD (30-day TTL). Backfills all dates from manifest that don't already exist.
Weekly digest Lambda — on Monday 8 AM EDT, fetches last 7 days of report:text:* from Valkey (up to 12 KB per day)
Bedrock generation — daily reports fed as primary context with accuracy guardrails (never fabricate data, never speculate on missing metrics)
Save + translate + send — newsletter saved to Aurora, auto-translated to 11 languages, sent to all active subscribers via SES

Key design decision: The newsletter reads REAL operational data from daily reports — actual WAF block counts, actual session accomplishments, actual security events. Bedrock rewrites this for subscribers but cannot invent or speculate. If a metric is unavailable, it is omitted rather than explained away.

1. Overview

AWS Services Utilized

2. Full System Diagram

⚙️ Lambda Common Configuration (All 7 Functions)

📦 Binary Sizes (Deployed)

🗺️ Diagram Color Key

3. Event-Driven Automation (EventBridge)

How It Works

Active Rules

Future Rules (Planned)

4. Workflow Orchestration (Step Functions)

Active State Machine

Workflow Steps

5. AI Threat Analysis (Amazon Bedrock)

Model Configuration

What Bedrock Analyzes

Output Format

Cost Control

6. Self-Healing Lambda Functions

Function Inventory

Inter-Lambda Communication

Shared IAM Role

7. Honeypot Defense System

Trap Endpoints

How It Works

8. WAF Automation

WAF Configuration

AutoOps WAF Integration

9. Notification & Alerting

Notification Flow

SNS Topic

Email Delivery

Email Template Features

Event Parsing

Severity Levels

Design Principles

10. Custom AI Translation Engine (Bedrock)

Translation Coverage

Architecture

Key Innovations

Translation Protection

11. Observability & Monitoring

CloudWatch Alarms

GuardDuty

Public Infrastructure Endpoint

12. Cost Analysis

13. Future Layers

Layer 3: Predictive Operations (CloudWatch Anomaly Detection)

Layer 4: Customer Engagement — Rhema (tbi-rhema-support)

Layer 5: Self-Documenting Infrastructure (tbi-ops-digest)

Weekly Newsletter Data Pipeline

Layer 4: Customer Engagement — Rhema (`tbi-rhema-support`)

Layer 5: Self-Documenting Infrastructure (`tbi-ops-digest`)