The Trinity Beast – AutoOps Translation Engine

1. Why a Custom Translation Engine

The Trinity Beast Infrastructure maintains 40 technical documents translated into 11 languages — over 440 translated files total total. The original approach used AWS Translate batch jobs. It worked for simple prose but failed catastrophically on technical documentation.

1.1 Where AWS Translate Fails

AWS Translate is a neural machine translation service optimized for general-purpose text. Technical documentation with embedded code, diagrams, and brand terminology exposes its fundamental limitations:

Failure Mode	Example	Impact
Translates code blocks	function getName() → función obtenerNombre()	Code no longer executes
Translates variable names	api_key → clave_api	Documentation references break
Breaks Mermaid diagrams	Translates node labels inside mermaid blocks	Diagrams fail to render
Corrupts HTML structure	Merges adjacent elements, drops attributes	Styling and layout break
Transliterates brand names	AutoOps → آٹو آپس (Urdu phonetic)	Brand identity lost, search breaks
Localizes numeric units	32 GB → 32 Go (French)	Technical specs become ambiguous
Drops version numbers	PostgreSQL 17.7 → PostgreSQL	Version-specific guidance lost
Ignores translate attribute	Translates content inside protected zones	Defeats the HTML5 standard mechanism

1.2 The Scale Problem

With 40 documents × 11 languages, every documentation update triggers a translation cascade. Before the custom engine:

Each document required manual post-processing to fix code blocks, diagrams, and brand names
A single document update meant re-translating and re-fixing 11 language versions
No audit trail — local log files only, no provenance tracking
No retry mechanism — a failed translation required starting over from scratch
No cost visibility — Bedrock spend was invisible until the monthly bill
Translation was becoming a full-time job that blocked documentation improvements

1.3 The Solution

A custom Bedrock-powered translation engine that understands the boundary between human language and machine language. The engine uses defense-in-depth across the full pipeline:

Source validation — catches structural defects, encoding issues, and Mermaid syntax errors before burning any Bedrock tokens. Auto-repairs what it can, rejects early with actionable reports when it can't.
Language detection — auto-detects source language via Unicode script analysis and word frequency heuristics (21 languages, no API calls, <10ms). No pivot through English required.
Sentinel preprocessing — replaces protected content with placeholder tokens before the model sees it, then restores them after translation. The model cannot corrupt what it never sees.
Smart splitting — 24 KB max part size prevents model output truncation. Per-language code tag thresholds handle complex-script sensitivity.
Multi-layer validation — every translated chunk is validated against the source for structural integrity, protected term preservation, version number survival, and HTML tag count matching. Failures trigger automatic retries with temperature jitter.
Integrity check with diagram auto-stitch — full-document post-translation repair. Counts Mermaid diagrams, stitches missing ones back from source, repairs broken span/strong/em wrappers.
Event-driven orchestration — a managed Step Functions pipeline handles fan-out (MaxConcurrency 6), retries, deployment, search index rebuilding, and notification. Fire-and-forget from the operator's perspective.

Result: A single POST /admin/translate call translates any document from any supported source language into up to 11 target languages, deploys to S3, invalidates CloudFront, rebuilds the search index, and emails a summary. Source language is auto-detected when not specified — no pivot through English required.

2. Architecture

2.1 Pipeline Flow

The translation service is an event-driven pipeline that decouples submission from execution. The operator submits a job; the system handles everything else asynchronously.

Diagram 2.1: End-to-End Pipeline Architecture

flowchart TB
    subgraph Operator
        A[POST /admin/translate]
    end
    subgraph "LPO Server (Go)"
        B[Validate & Enqueue]
        C[Valkey State]
        D[Aurora Record]
    end
    subgraph "AWS Pipeline"
        E[SQS Queue]
        F[EventBridge Pipe]
        G[Step Function]
    end
    subgraph "Translation Intelligence (Python)"
        direction LR
        subgraph "Pre-Processing"
            H0[Source Validation]
            H1[Language Detection]
            H2[Complexity Analysis]
            H3[Document Preprocessor]
        end
        subgraph "Translation Core"
            H4[Sentinel System — 3 Types]
            H5[Bedrock — 3-Region Failover]
            H6[Validator — Hard + Soft Tiers]
            H7[Integrity Check + Auto-Repair]
        end
    end
    subgraph "Deployment (Go)"
        direction LR
        I[S3 Write]
        J[CloudFront Invalidation]
        K[Search Index Rebuild]
        L[SES Notification]
    end

    A --> B
    B --> C
    B --> D
    B --> E
    E --> F
    F --> G
    G --> H0
    H0 --> H1
    H1 --> H2
    H2 --> H3
    H3 --> H4
    H4 --> H5
    H5 --> H6
    H6 --> H7
    H7 --> I
    I --> J
    J --> K
    K --> L

    style A fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style B fill:#1e293b,stroke:#334155,color:#e2e8f0
    style C fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style D fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style E fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style F fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style G fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style H0 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H1 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H2 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H3 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H4 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H5 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H6 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H7 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style I fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style J fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style K fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style L fill:#064e3b,stroke:#10b981,color:#e2e8f0

2.2 Components

Component	Type	Runtime	Purpose
`POST /admin/translate` (+ 8 more)	Admin API	Go	Job submission, monitoring, control
`trinity-beast-translation-queue`	SQS	—	Decouple submission from execution
`tbi-translate-pipe`	EventBridge Pipe	—	SQS → Step Function trigger (no glue Lambda)
`tbi-translation-orchestrator`	Step Functions	—	Fan-out, retry, deploy, finalize orchestration
`tbi-translate-worker`	ECS Fargate Task	Python 3.11	Bedrock translation + sentinel + validation (no timeout ceiling)
`tbi-translate-init`	Lambda	Go	Records execution ARN, transitions queued → running
`tbi-translate-deploy`	Lambda	Go	CloudFront invalidation per document
`tbi-translate-finalize`	Lambda	Go	Search rebuild + SES notification + state transition
`translation_jobs`	Aurora table	—	Permanent job records (28 columns)
`translation_job_events`	Aurora table	—	Granular per-doc/lang audit log

2.3 Why Python (The Only Python in the Fleet)

Every other compute workload in The Trinity Beast Infrastructure is written in Go. The translation worker is the sole exception, and for good reason:

lxml for HTML parsing — Go's HTML parsers are adequate for simple tasks but lack the XPath and tree-manipulation capabilities needed for sentinel preprocessing on complex nested documents.
Battle-tested engine — the translator package was developed and debugged over weeks of production use. Rewriting in Go would re-introduce every bug already fixed (56+ smoke test cases).
Rapid iteration — prompt engineering and validator tuning require fast feedback loops. Python's interpreted nature allows testing changes without compile cycles.
Separation of concerns — the translation engine is a self-contained package with its own config, prompts, validators, and chunker. Keeping it in Python isolates it from the Go service layer.
Container image deployment — ships as a Docker image to ECR, runs as an ECS Fargate task with no timeout ceiling. The same image also supports Lambda invocation for smaller documents.

Convention note: All Lambda functions use 1770 MB memory (multiple of 3). The worker runs as an ECS Fargate task (1 vCPU / 3 GB) with no timeout ceiling — large documents translate to completion regardless of processing time. Deploy and finalize Lambdas use 60s and 180s timeouts respectively.

3. Sentinel Preprocessing System

The sentinel system is the core innovation that makes reliable technical document translation possible. It operates on a simple principle: the model cannot corrupt what it never sees.

Before any chunk is sent to Bedrock, protected content is replaced with placeholder tokens. The model translates the prose around the placeholders. After translation, the placeholders are swapped back to the original content. Validation then confirms everything survived intact.

3.1 Four Sentinel Types

Type A — Full Element Extraction (`TBP{N}`)

Replaces entire translate="no" elements with a single token. The model sees only the placeholder and places it in the natural position for the target language's word order.

Before	After Sentinel Pass
`<span translate="no">CloudFront</span> invalidation`	`__TBP0__ invalidation`
`<code translate="no">api_key</code> parameter`	`__TBP1__ parameter`

Handles arbitrary nesting depth — processes innermost elements first, then sweeps outward until stable.

Type B — Paired Open/Close (`TBO{N}` / `TBC{N}`)

For plain <span> wrappers containing translatable text (badges, titles, method labels). The wrapper tags become sentinels; the text between them is translated normally.

Before	After Sentinel Pass
`<span class="badge">UDP Port 2679</span>`	`__TBO0__UDP Port 2679__TBC0__`

The model translates "UDP Port 2679" while the <span class="badge"> wrapper survives intact.

Type C — Numeric Protection (`TBN{N}`)

Protects bare numbers in prose from the model's tendency to drop, paraphrase, or localize them. Matches integers, decimals, percentages, and number+unit pairs.

Before	After Sentinel Pass	Problem Prevented
`uses 1770 MB of memory`	`uses __TBN0__ of memory`	French translating "MB" → "Mo"
`achieves 98.5% uptime`	`achieves __TBN1__ uptime`	Japanese dropping the decimal
`62% cache hit rate`	`__TBN2__ cache hit rate`	German paraphrasing to words

Type D — Brand Term Protection (`TBT{N}`)

Protects brand terms, product names, and proper nouns that must never be translated or transliterated. Unlike Type A (which requires translate="no" in the source HTML), Type D operates from a centralized configuration list — no source markup needed.

Before	After Sentinel Pass	Problem Prevented
`powered by The Trinity Beast`	`powered by __TBT0__`	Hindi transliterating to ट्रिनिटी बीस्ट
`deployed on CloudFront`	`deployed on __TBT1__`	Arabic transliterating to كلاود فرونت
`Cory Dean Kalani`	`__TBT2__`	Urdu transliterating person names

Protected terms are defined in translation-config.json (57 terms). The sentinel pass matches terms using word-boundary regex for short terms (≤5 chars) and substring matching for longer terms. Restoration is exact — the original term text is re-injected at the sentinel position.

Sentinel Recovery Pass (Post-Restoration)

Complex-script models (Hindi, Urdu, Arabic) occasionally drop Type D sentinel tokens entirely from their output — the token simply doesn't appear in the translated text. The recovery pass runs after normal restoration and before validation:

Iterates all TERM entries in the sentinels list
Checks if the term is present in the source but missing from the restored output
Re-injects the original term text at an approximate position (ratio-based paragraph matching)
Falls back to insertion before the last closing tag if position cannot be determined

This eliminates the class of failures where the model acknowledges the sentinel in its "thinking" but omits it from the output — a behavior observed primarily in Indic scripts with token-dense chunks.

3.2 Processing Flow

Diagram 3.1: Sentinel Preprocessing Flow

flowchart TD
    A[Source HTML Chunk] --> B[Pass 1: Extract translate=no elements]
    B --> C[Pass 2: Wrap plain span text in paired sentinels]
    C --> D[Pass 3: Replace bare numbers with numeric sentinels]
    D --> D2[Pass 4: Replace brand terms with TERM sentinels]
    D2 --> E[Send to Bedrock with sentinel-aware prompt]
    E --> F[Receive translated chunk with sentinels intact]
    F --> G[Deduplicate any model-doubled paired sentinels]
    G --> H[Restore sentinels high-to-low index order]
    H --> H2[Recovery pass: re-inject any dropped TERM sentinels]
    H2 --> I[Run validators against source + restored output]
    I -->|PASS| J[Accept chunk]
    I -->|FAIL| K{Retries remaining?}
    K -->|Yes| L[Retry with strict prompt + temperature jitter]
    L --> E
    K -->|No| M[Raise TranslationError]

    style A fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style E fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style J fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style M fill:#450a0a,stroke:#ef4444,color:#e2e8f0

The four passes execute in strict order — later passes operate on the output of earlier ones. This means Type C (numeric) sentinels can protect numbers that appear inside Type B (paired) text, and Type D (brand term) sentinels protect terms that appear anywhere in the translatable content, providing defense-in-depth.

3.3 Restoration and Deduplication

After translation, sentinels are restored in reverse index order (high → low) to prevent prefix collisions (__TBP1__ must not match inside __TBP10__).

A deduplication pass runs before restoration to handle a known model behavior: occasionally the model emits a paired sentinel twice consecutively (a bilingual output instinct). The deduplicator collapses __TBO0__text__TBC0__ __TBO0__text__TBC0__ into a single occurrence.

4. Validator System

Every translated chunk is validated against the source before acceptance. Validators enforce structural integrity and content preservation — if a translation passes all validators, it is guaranteed to be functionally correct (code works, links resolve, diagrams render).

4.1 Validation Checks

Validator	Type	What It Checks	Failure Example
`check_protected_terms`	Hard	Every protected term in source appears in output	"CloudFront" missing from Japanese output
`check_version_numbers`	Hard	All version numbers (X.Y.Z) survive translation	"17.7" dropped from PostgreSQL reference
`check_preserve_patterns`	Hard	URLs, emails, IPs, ARNs, resource IDs, cron expressions, memory sizes	ARN truncated or IP address reformatted
`check_tag_counts`	Hard	HTML tag counts match for structural tags	Extra `<span>` added or `<code>` dropped
`check_translate_no_zones`	Hard	Content inside `translate="no"` zones unchanged	Protected code block content altered

Protected term matching: Short uppercase acronyms (≤4 chars like SQS, ECR, S3) use word-boundary matching to avoid false positives where the acronym appears as a substring (e.g., "ECR" inside "SECRET"). Longer terms use plain substring matching.

Implementation (v2.5): The check_tag_counts and check_translate_no_zones validators use character scanning with exact boundary matching — no regex. We control these tags. We know that a tag starts with < and ends with >. The scanner finds complete opening tags by looking for <tagname followed by a boundary character (>, space, tab, newline, or /), then reads to the closing >. This eliminates false positives from partial regex matches and is immune to edge cases where tag names appear as text content (e.g., documenting translate="no" as literal text inside a code tag).

4.2 Retry Strategy

When validation fails, the engine retries with two progressive adjustments:

Strict prompt activation — adds an explicit warning: "PREVIOUS ATTEMPT FAILED VALIDATION. Be more careful: every protected term and every version number from the input MUST appear unchanged in the output."
Temperature jitter — increments temperature by 0.1 per retry (0.0 → 0.1 → 0.2 → 0.3, capped at 0.5). A deterministic temp=0 retry produces the same erroneous output; temperature jitter lets the model take a different sampling path.

Maximum retries: 3 (configurable). If all attempts fail, a TranslationError is raised with the chunk index, validator detail, and a preview of the problematic chunk.

4.3 Hard vs Soft Failures

Validators are classified into two tiers based on what they protect:

Tier	Tags	Behavior	Rationale
Hard (content-critical)	`<code>`, `<pre>`, `<a>`	Retry → reject on failure	Missing code blocks, broken links, or lost pre-formatted content means the translation is functionally broken
Soft (decorative/structural)	`<span>`, `<strong>`, `<em>`, `<br>`	Log warning, pass through	Missing styling wrappers don't break functionality — the post-translation integrity check repairs them

This tiered approach eliminates the failure mode where a correctly-translated document is rejected because the model dropped a single decorative <span> wrapper during RTL reordering. The content is correct — only the styling wrapper is missing — and the integrity check restores it automatically.

The ValidationReport aggregates all results and exposes:

.passed — True if zero hard failures
.hard_failures — list of blocking issues (content-critical tags)
.soft_failures — list of warnings (decorative tags — repaired post-translation)
.summary() — human-readable status string

4.4 Post-Translation Integrity Check

After translation completes and chunks are reassembled, a full-document integrity check runs before the S3 write. This is the defense-in-depth layer — it repairs structural drift that the per-chunk validator intentionally allows through (soft failures).

Repair Capabilities

Issue	Detection	Repair Action
`</br>` injection	String scan for invalid closing br tags	Strip all occurrences (never valid HTML)
`<br>` inside Mermaid blocks	Regex scan within `<pre class="mermaid">`	Remove (breaks Mermaid syntax)
Mermaid content corruption	Byte-for-byte comparison with source	Flag as warning (cannot auto-repair content changes)
Missing `translate="no"` span wrappers	Compare source protected elements to output	Re-wrap bare content with original element tags
Missing `<strong>`/`<em>` wrappers	Same pattern as span recovery	Re-wrap bare content

The integrity check only repairs translate="no" elements (where content is byte-for-byte identical between source and output). For translated content that lost its wrapper, the check logs the discrepancy but cannot reliably re-wrap (the content has been translated — matching it to the source wrapper requires semantic understanding).

Design principle: If the translated content is present and correct but the HTML structure is degraded, repair it. Only flag as unrecoverable if content is actually missing or corrupted. The customer sees a clean translation — the repairs happen invisibly.

4.5 Source Document Validation (v2.8)

Before any translation work begins, the source document passes through a validation gate. This catches defects that would cause translation failures or produce broken output — rejecting early saves Bedrock tokens and prevents corrupted translations from reaching S3.

Defect Categories

Category	What It Catches	Auto-Repairable?
STRUCTURAL	Unclosed tags, malformed HTML, nesting violations	Yes (up to 5 unclosed tags)
MERMAID	Empty diagram blocks, missing type declaration, mismatched brackets	No — reject with location
ENCODING	BOM markers, null bytes, mixed encodings	Yes (strip BOM/nulls)
SIZE	Document exceeds 500 KB, excessive nesting depth (>30 levels)	No — reject with size info
CONFLICT	`translate="no"` on root element (nothing to translate)	No — reject immediately

Validation Flow

Size check — reject if > 500 KB (chunking becomes unreliable at this size)
Encoding check — detect and strip BOM markers, null bytes; flag mixed encodings
Structural HTML check — scan for unclosed tags; auto-repair up to 5 by appending closing tags at the correct nesting level
Mermaid syntax check — validate every <pre class="mermaid"> block has a valid diagram type, balanced brackets, and non-empty content
Conflict check — reject if the root <body> or <html> element has translate="no"

Rejection vs Repair

The validator follows a strict philosophy: try to fix it silently, reject early if you can't. Repairable issues (unclosed tags, BOM markers) are fixed in-place — the customer never knows. Unrecoverable issues produce an actionable defect report with the exact location, what's wrong, and how to fix it.

ValidationResult:
  valid: false
  rejection_reason: "2 unrecoverable defects found"
  defects:
    - severity: error
      category: MERMAID
      location: "Section 5, line 342"
      description: "Empty Mermaid block — no diagram content"
      suggestion: "Add diagram content or remove the empty <pre class='mermaid'> block"
    - severity: error
      category: SIZE
      location: "Document root"
      description: "Document is 612 KB (limit: 500 KB)"
      suggestion: "Split into multiple documents or remove large embedded assets"

Cost savings: A rejected document costs zero Bedrock tokens. Without source validation, a broken document would fail during translation (after burning tokens on partial chunks), produce a corrupted output, and require manual investigation. Source validation catches these cases in <10ms with zero API calls.

4.6 Diagram Integrity (v2.8)

Mermaid diagrams are code — they must survive translation byte-for-byte. The integrity check (section 4.4) now includes dedicated diagram verification with automatic recovery.

Detection

The integrity check counts Mermaid blocks in the source (<pre class="mermaid">) and compares against the translated output. If any diagrams are missing from the output, the auto-stitch mechanism activates.

Auto-Stitch Recovery

When a diagram is missing from the translated output:

Identify which source diagram is absent (by content matching)
Extract the full <div class="diagram-wrap"> block from source (includes label + pre)
Locate the correct insertion point in the output (same section, same relative position)
Inject the source diagram block verbatim — diagrams don't need translation

The stitched diagram is the English version, which is functionally correct — Mermaid syntax is language-independent. The surrounding prose is already translated, so the reader gets translated explanations with a working diagram.

Tag Inventory Integration

The _count_tags function now reports diagram count alongside other structural tags:

Tag Inventory (source → output):
• Trinity-Beast-Performance-Report.html
  IN:  code:75 pre:8 strong:12 em:3 a:6 br:20 diagrams:4
  OUT: code:75 pre:8 strong:12 em:3 a:6 br:20 diagrams:4

If a diagram is lost during translation and auto-stitched back, the final count still matches — the stitch happens before the tag inventory is calculated. A mismatch in the diagrams count after stitching indicates a structural issue that needs manual review.

Result: The Performance Report (75 KB, 4 Mermaid diagrams, 18 sections) translates to French with all 4 diagrams intact — 3 survived translation naturally, 1 was auto-stitched from source. The reader sees no difference.

5. Step Function Orchestration

The tbi-translation-orchestrator Step Function coordinates the entire translation pipeline. As of v3.0, it uses a language-persistent container pattern — one container per language, each processing all documents sequentially.

5.1 Language-Persistent Container Pattern (v3.0)

Diagram 5.1: Step Function State Machine (v3.0)

flowchart TD
    A[UnwrapInput] --> AB[InitJob - tbi-translate-init]
    AB --> B[PerLang Map - Parallel, Unlimited]
    B --> C[tbi-translate-worker container]
    C --> C2[Process ALL docs sequentially]
    C2 -->|All docs done| D[Lang Container Exits]
    D -->|Success| E[Lang Succeeded]
    D -->|Failure| F[RecordLangFailure]
    E --> G{All langs done?}
    F --> G
    G --> H[tbi-translate-deploy - Batch Mode]
    H --> J[tbi-translate-finalize]
    J --> K[Job Complete]

    style A fill:#1e293b,stroke:#334155,color:#e2e8f0
    style AB fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style B fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style C fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style C2 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style J fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style K fill:#064e3b,stroke:#10b981,color:#e2e8f0

v3.0 Architecture: Container Count = Language Count

The v3.0 architecture inverts the execution model. Instead of launching N×M containers (one per doc-language pair), it launches M containers (one per language). Each container receives the full list of documents as a JSON array and processes them sequentially before exiting.

Job	Old (v2.x) Containers	New (v3.0) Containers	Reduction
3 docs × 11 langs	33	11	67%
30 docs × 11 langs (full library)	330	11	97%
1 doc × 11 langs	11	11	0% (unchanged)
6 docs × 3 langs	18	3	83%

Why language-persistent?

Eliminates cold-start overhead — each container loads the translation engine, config, and Bedrock client once, then reuses them across all documents. No repeated initialization.
Unlimited parallelism — MaxConcurrency=0 (unlimited) on the PerLang Map. All language containers launch simultaneously. ECS Fargate handles scheduling.
Continue-on-failure — if one document fails within a container, the container continues to the next document. Per-doc progress is reported to Valkey in real-time.
Batch deployment — the deploy Lambda receives all results in one call (batch mode), creates CloudFront invalidations for all docs × all langs in a single operation.
Backward compatible — single-doc jobs behave identically to before (one container per language, one doc to process).

Document Array Passing

The Step Function uses States.JsonToString (an intrinsic function) to serialize the docs array into a string environment variable for the ECS container. The worker's task_runner.py deserializes it on startup and iterates through each document.

UnwrapInput State

EventBridge Pipes always wrap SQS records in an array, even with batch size 1. The UnwrapInput Pass state uses InputPath: "$[0]" to extract the single job envelope from the array wrapper.

InitJob State

Replaced the original Pass state with the tbi-translate-init Lambda. This state records the Step Function execution ARN ($$.Execution.Id) back to Valkey via POST /admin/translate/update/{job_id} and transitions the job state from queued → running.

Passes the full payload through unchanged to the PerDoc Map
Has a Catch block — if InitJob fails, the pipeline still continues (InitJob failure is non-fatal)
The recorded execution ARN enables the self-healing sweeper to check Step Function status for orphaned jobs

5.2 Error Handling and Recovery

Failure Mode	Handling	Job State
Single language fails after 3 retries	Catch → RecordLangFailure pass state, continue other langs	`partial`
All languages for a doc fail	Deploy Lambda receives empty succeeded list, skips invalidation	`partial`
Worker timeout (no response)	ECS task runs to completion — no timeout ceiling. Step Function waits via `ecs:runTask.sync`	`running`
Step Function execution exception	Finalize still runs via catch-all; job marked `failed`	`failed`
Operator cancels mid-flight	`StopExecution` API call; job marked `cancelled`	`cancelled`
Step Function fails before Finalize	Self-healing sweeper detects orphaned job via execution ARN, marks as `failed`	`failed`

Per-lang independence: Failure of one (doc, lang) pair never aborts work on the other 10 languages. This is enforced by the Step Function's Catch on the inner Map iterator — errors are captured as data, not propagated as exceptions.

5.3 EventBridge Pipe Integration

The tbi-translate-pipe connects SQS to the Step Function without a glue Lambda:

Source: trinity-beast-translation-queue
Filter: None (all messages trigger)
Target: tbi-translation-orchestrator
Input transformation: InputTemplate extracts body fields using <$.body.field> syntax (implicit JSON parsing of SQS body)
IAM: tbi-translate-pipe-role with sqs:ReceiveMessage, sqs:DeleteMessage, states:StartExecution

This is the AWS-native pattern for SQS-to-Step-Function integration — no code, no cold start, built-in error handling.

5.4 Self-Healing Sweeper

The sweeper runs automatically on every GET /admin/translate/health call (piggybacked) and is also available as a dedicated POST /admin/translate/sweep endpoint.

It scans all jobs in tx:active (the Valkey SET of active job IDs). For each job older than 15 minutes in queued or running state:

Checks the Step Function execution status via the recorded ARN
If FAILED, TIMED_OUT, or ABORTED → marks job as failed, removes from tx:active, updates Aurora with reason
If no execution ARN recorded (pipe never triggered) → marks as failed
If Step Function is still RUNNING → leaves it alone

All sweep actions are logged to translation_job_events for audit trail.

Result: This eliminates the stuck queue problem permanently — no manual cleanup needed. Jobs that silently fail are automatically detected and marked, keeping the active set accurate and the queue healthy.

5.5 Job Phase Transitions

The job state now reflects the exact phase of execution:

Phase	Meaning
`queued`	Submitted to SQS, waiting for EventBridge Pipe to trigger Step Function
`running`	InitJob Lambda fired, Step Function execution ARN recorded, worker translating
`deploying`	All translations complete, deploy Lambda creating CloudFront invalidations
`finalizing`	Deploy complete, finalize Lambda rebuilding search index and writing final state
`succeeded` / `partial` / `failed`	Terminal states — all sub-tasks complete, email notification sent

This gives real-time visibility into exactly where a job is in the pipeline.

6. Admin API (9 Endpoints)

All endpoints require the X-Admin-Key header. They are served by the LPO server (Go) alongside the existing admin routes.

6.1 Submit Translation Job

`POST /admin/translate`

Submits a new translation job. Validates inputs, checks cost limits, creates job state in Valkey (synchronous) and Aurora (async goroutine), enqueues to SQS.

// Request
POST /admin/translate
X-Admin-Key: tbcc-admin-...
X-Idempotency-Key: my-unique-key (optional)
Content-Type: application/json

{
  "docs": ["Trinity-Beast-API-Reference.html", "Trinity-Beast-Architecture-Guide.html"],
  "langs": "all",
  "options": {
    "force": false,
    "delta": false,
    "skip_search_rebuild": false,
    "skip_validation": false
  }
}

// Response 200
{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T16:42:00Z",
  "data": {
    "job_id": "1747407720-a3f8b2c1d4e5",
    "state": "queued",
    "submitted_at": "2026-05-16T16:42:00Z"
  },
  "error": ""
}

Validation rules:

docs — required, 1-6 entries, each must be a valid filename in S3
langs — "all" (expands to all 11) or an array of 1-11 valid language codes
options.force — bypass known-failure guard and difficulty rejection
options.delta — skip pairs where the translated file is already newer than the source (saves up to 90% on re-translation)
Daily dollar spend must be under $600 (autoops:bedrock:spend:daily)
Daily token usage must be under 50M combined tokens (autoops:bedrock:tokens:input:daily + autoops:bedrock:tokens:output:daily)
Active jobs must be under 3 (additional jobs queue in SQS)

6.2 Monitoring Endpoints

`GET /admin/translate/status/{job_id}`

Returns the full job state. Aurora is the primary source — state, timestamps, docs, langs, cost, and Step Function ARN are read from translation_jobs. Real-time per-doc/lang progress is overlaid from Valkey (written per-pair by the worker, too frequent for Aurora writes). If Aurora doesn't have the job yet (async insert still pending), falls back to Valkey.

`GET /admin/translate/queue`

Lists all pending and active jobs (state in queued or running).

`GET /admin/translate/history`

Returns the last 50 completed jobs from translation_jobs in Aurora, ordered by submission date descending. Includes state, docs, succeeded/failed pair counts, cost, and reason. Falls back to the Valkey tx:history list if Aurora is unavailable.

`GET /admin/translate/health`

System health overview:

{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate/health] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate/health",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T17:30:00Z",
  "data": {
    "queue_depth": 0,
    "active_jobs": 1,
    "last_completed_at": "2026-05-16T17:30:00Z",
    "last_state": "succeeded",
    "daily_spend_usd": "12.40",
    "daily_spend_limit_usd": "600.00",
    "daily_input_tokens": 284150,
    "daily_output_tokens": 312480,
    "daily_token_limit": 50000000,
    "swept_jobs": 0
  },
  "error": ""
}

6.3 Control Endpoints

`POST /admin/translate/cancel/{job_id}`

Stops the Step Function execution via StopExecution API. Marks job as cancelled. Returns 409 if already in a terminal state.

`POST /admin/translate/retry-failed/{job_id}`

Creates a new job from the failed (doc, lang) pairs of a completed-with-partial job. Returns 409 if the original is still running.

`POST /admin/translate/sweep`

Manually triggers the self-healing sweeper. Idempotent — safe to call repeatedly.

// Response 200
{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate/sweep] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate/sweep",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T18:00:00Z",
  "data": {
    "swept": 2,
    "checked": 5,
    "results": [
      {
        "job_id": "1747407720-a3f8b2c1d4e5",
        "prior_state": "running",
        "submitted_at": "2026-05-16T16:42:00Z",
        "sfn_status": "FAILED",
        "action": "marked_failed"
      }
    ]
  },
  "error": ""
}

6.4 Worker Callback Endpoints

These endpoints are called by the worker task and finalize Lambdas to update Aurora without needing direct database access (worker and Lambdas are outside the VPC).

`POST /admin/translate/update/{job_id}`

Updates job state, progress, cost, and timing fields. Called by worker task after each (doc, lang) translation and by finalize Lambda on completion.

`POST /admin/translate/event/{job_id}`

Records a granular event in the translation_job_events table. Used for audit trail — each doc/lang start, success, failure, retry is logged as a separate event.

Fire-and-forget pattern: Both callback endpoints always return 200 regardless of Aurora write outcome. The translation pipeline must never fail because observability data couldn't be written. Errors are logged but never propagated.

7. Aurora Observability — Source of Truth

Aurora is the authoritative record for all translation job state. Valkey serves one specific role: real-time per-pair progress updates during active execution (written too frequently for Aurora). For everything else — job state, history, cost, audit trail — Aurora is read first.

Design principle: Valkey is the price cache, search indexes, and real-time counters. It is not a job ledger. Aurora is the ledger. When you need to know what was translated, when, at what cost, and with what result — query Aurora.

7.1 translation_jobs Table

One row per job submission. 28 columns covering the full lifecycle. This table is the ground truth for gap analysis, cost reporting, and audit:

Column Group	Fields	Purpose
Identity	`id`, `job_id`, `idempotency_key`	Unique identification and deduplication
State	`state`, `submitted_at`, `started_at`, `completed_at`	Lifecycle tracking — authoritative terminal state
Input	`docs` (JSONB), `langs` (JSONB), `options` (JSONB)	What was requested
Progress	`total_pairs`, `succeeded_pairs`, `failed_pairs`, `progress` (JSONB)	Per-doc/lang status map
Cost	`bedrock_cost_usd`, `bedrock_invocations`	Spend tracking per job
Execution	`step_function_arn`, `errors` (JSONB), `elapsed_seconds`	Traceability and debugging
Deployment	`cloudfront_invalidation_ids`, `search_index_rebuilt`, `notification_sent`	Post-translation actions
Lineage	`retry_of`, `reason`	Retry chain and submission reason
Metadata	`submitted_by`, `created_at`, `updated_at`	Audit trail

Gap analysis query: To find which documents have never been translated, query SELECT DISTINCT jsonb_array_elements_text(docs) FROM translation_jobs ORDER BY 1 and compare against the S3 document list. Aurora is the only reliable source for this — Valkey keys expire and don't persist across cache flushes.

7.2 translation_job_events Table

Granular audit log — one row per significant event in a job's lifecycle. Used by the retry-failed handler as the authoritative source of which (doc, lang) pairs failed:

Column	Type	Example Values
`job_id`	VARCHAR	`1747407720-a3f8b2c1d4e5`
`event_type`	VARCHAR	`lang_started`, `lang_succeeded`, `lang_failed`, `deploy_started`, `finalize_complete`
`doc`	VARCHAR	`Trinity-Beast-API-Reference.html`
`lang`	VARCHAR	`ja`, `ar`, `es`
`detail`	JSONB	Cost, chunk count, error message, validator report
`created_at`	TIMESTAMP	Event timestamp

7.3 Read/Write Strategy

The translation system uses a deliberate split between Aurora and Valkey based on access pattern:

Data	Primary Store	Reason
Job state (queued/running/succeeded/failed)	Aurora	Authoritative terminal state — never expires, queryable, auditable
Job history (last 50 completed)	Aurora	Permanent record — survives cache flushes, supports gap analysis
Per-pair progress (es: succeeded, ja: running…)	Valkey	Written per-pair during execution — too frequent for Aurora writes, only needed during active polling
Daily spend counter	Valkey	Needs atomic INCRBYFLOAT and 24h TTL auto-reset — Aurora is wrong tool for this
Active job set	Valkey	Fast set membership check on every submit — Aurora query would add latency to the hot path

Write path

Submit handler: writes to Valkey synchronously (fast, needed immediately for active job tracking), writes to Aurora in a go func() goroutine (non-blocking — API response returns without waiting)
Update/Event handlers: always return HTTP 200 regardless of Aurora write outcome — errors are logged but never propagated. The translation pipeline must never fail because a monitoring write was slow.
Finalize Lambda: calls POST /admin/translate/update/{job_id} with terminal state — Aurora is updated, Valkey is updated, job removed from active set.

Read path

Status endpoint: reads Aurora first (authoritative state, timestamps, cost). Overlays Valkey progress (real-time per-pair map). Falls back to Valkey if Aurora insert is still pending (race window on submit).
History endpoint: reads Aurora exclusively — last 50 jobs ordered by submission date. Valkey fallback only if Aurora is unavailable.
Retry-failed handler: queries translation_job_events for lang_failed events — Aurora is the only reliable source for which pairs failed.

Do not rely on Valkey for job state. Valkey keys have no TTL on job hashes and can be flushed, evicted under memory pressure, or simply stale if the finalize Lambda's update call was lost. Aurora is the record of what happened. Valkey is the window into what is happening right now.

8. Cost Protection

The translation engine calls Bedrock (Claude Sonnet 4.6) for every chunk of every document in every language. Without guardrails, a single typo in a batch submission could trigger hundreds of expensive API calls.

8.1 Three Protection Layers

Layer	Where	Limit	Behavior on Breach
Per-request limits	Admin API (submit handler)	Max 6 docs, max 12 langs, max 3 active jobs	400 Bad Request (docs/langs) or queue in SQS (active jobs)
Daily dollar cap	Admin API (submit handler)	$600/day (`autoops:bedrock:spend:daily`)	429 Too Many Requests until counter expires
Daily token cap	Admin API (submit handler)	50M combined tokens/day (`autoops:bedrock:tokens:input:daily` + `autoops:bedrock:tokens:output:daily`)	429 Too Many Requests until counters expire
Per-invocation tracking	Worker task	Increments after every Bedrock call	Source of truth for daily counters

8.2 Spend Tracking

Two parallel counters track daily usage — a dollar cap and a token cap. Both live in Valkey with 24-hour TTL auto-reset and are checked on every job submission.

Dollar Cap (`autoops:bedrock:spend:daily`)

Type: STRING (numeric dollar value as string)
Updated by: INCRBYFLOAT after every Bedrock invocation in the worker task
Reset: 24-hour TTL auto-reset — the worker sets EXPIRE autoops:bedrock:spend:daily 86400 after each increment
Limit: $600/day — checked by submit handler before admitting new jobs

Why $600? A full batch translation of the entire 40-document library × 11 languages costs approximately $726 in raw Bedrock spend at ~$1.65 per doc-language pair (Sonnet 4.6) — but in practice the library is never re-translated all at once. Typical batches are 3 or 6 documents (per the Trinity Beast multiples-of-3 convention) and run well under $200. The $600 cap is a daily safety guardrail with comfortable headroom for several batches plus normal AutoOps overhead (threat analysis, digests, support) in the same 24-hour window.

Token Cap (`autoops:bedrock:tokens:input:daily` + `autoops:bedrock:tokens:output:daily`)

Type: STRING (integer token count)
Updated by: INCRBY after every Bedrock invocation — separate keys for input and output tokens
Reset: 24-hour TTL auto-reset — same pattern as dollar cap
Limit: 50M combined tokens/day — model-agnostic secondary guard
Purpose: Provides a predictable ceiling that doesn't change when model pricing changes. At Sonnet 4.6 rates, 50M tokens ≈ $750 — safely above the $600 dollar cap, so the dollar cap fires first under normal conditions. The token cap catches edge cases where pricing changes make the dollar cap insufficient.

Kill switch: Setting autoops:bedrock:kill = "1" in Valkey causes both the submit endpoint and the worker task to refuse all operations. Use this for emergency cost containment.

Pricing formula:

Bedrock cost — token-based: (input_tokens × input_rate) + (output_tokens × output_rate) per language pair
Infrastructure markup (9%) — covers ECS Fargate compute, S3 storage, CloudFront invalidation, SQS queuing, and Step Function orchestration
Service fee (30%) — applied to the combined cost (Bedrock + infrastructure)
Total price = Bedrock cost × 1.09 × 1.30

Token rates (stored in Valkey, always current):

Agent	Input (per 1M tokens)	Output (per 1M tokens)
Haiku 3.5	$1.00	$5.00
Sonnet 4.6	$3.00	$15.00
Opus 4	$5.00	$25.00

Typical costs (Bedrock spend, before markup):

Scenario	Bedrock Cost	Customer Price
1 document × 1 language (Haiku 3.5)	~$0.12	~$0.17
1 document × 1 language (Sonnet 4.6)	~$1.65	~$2.34
1 document × 11 languages (Sonnet 4.6)	~$18	~$25.50
6 documents × 11 languages (1 batch job, Trinity Beast convention)	~$108	~$153
30 documents × 11 languages (full library, Sonnet)	~$540	~$770

8.3 Infrastructure Integration

Translation engine metrics are exposed through two public interfaces:

GET /public/infrastructure — includes a translation section with daily spend, daily limit, active jobs, queue depth, cost-per-pair estimate, and daily token counts (daily_input_tokens, daily_output_tokens, daily_token_limit). Consumed by the daily digest Lambda, nightly sync, and any monitoring dashboard.
KCC Live Dashboard — a dedicated Translation Engine card displays real-time spend, limit, active/queued jobs, cost-per-pair, and a token usage chart (input + output bars against the 50M daily limit). Auto-refreshes every 30 seconds alongside all other infrastructure panels.
Infrastructure Live page — the Translation Engine sub-section shows live stats: spend today, tokens in, tokens out, and active jobs — populated from /public/infrastructure every 30 seconds.

Email notification timing: The email notification is the absolute LAST step in the pipeline. It fires only after: translation, deployment, search index rebuild, state update, and history push are ALL complete. The email is a comprehensive report including: job summary, translation results, CloudFront invalidation IDs, search index status, and any Bedrock error details. If Bedrock reports validation failures, the specific error messages and validator details are included in the email.

9. CLI Compatibility

The existing CLI tool (scripts/kcc_helpers/translate_doc.py) continues to work unchanged. A --remote flag routes through the new service instead of running Bedrock locally:

Flag	Behavior	Use Case
`--local` (current default)	Runs translator engine in-process, calls Bedrock directly from laptop	Development, debugging, single-doc quick fixes
`--remote`	POSTs to `/admin/translate`, polls `/admin/translate/status/{id}` every 5s, streams progress to stdout	Production translations, batch operations

The --remote flag produces identical terminal output to local mode — same progress bars, same chunk counters, same completion summary. The operator's workflow doesn't change; only the execution path does.

Default flip plan: Start with --local as default to avoid surprising anyone. After 30 days of clean production runs through the service, flip the default to --remote and add --local as the explicit fallback.

10. Configuration Reference — Protected Terms

All translation behavior is driven by a single config file: scripts/translation-config.json. This is the shared source of truth consumed by both the Python engine and the Go admin API.

10.1 Protected Terms (57 entries)

Brand names, product names, AWS services, exchange names, and acronyms that must never be translated or transliterated:

Cross Power Ministries of Pakistan, The Trinity Beast Infrastructure,
The Trinity Beast, Trinity Beast Command Center, Kiro Command Center,
Cory Dean Kalani, Shafiq Bhatti, BeastWebhook, BeastMirror, BeastMain,
BeastLRS, Claude Sonnet 4.6, Bedrock, ElastiCache, EventBridge,
CloudFront, GuardDuty, CloudWatch, CloudTrail, Step Functions,
Crypto.com, Coinbase, Gate.io, Gemini, Kraken, Aurora, Valkey,
Stripe, Kiro, Fargate, PostgreSQL, Lambda, Route 53, AutoOps,
TBCC, CPMP, TBI, KCC, OKX, ECR, ECS, ALB, NLB, WAF, SNS, SQS,
SES, VPC, IAM, S3 ...

Per-Request Protected Terms

In addition to the global protected terms list, you can submit document-specific terms via the protected_terms array in the translation request. This is useful for:

Proper nouns specific to a document (customer names, project names)
Technical identifiers not in the global list (new service names, API endpoints)
Domain-specific terminology that should remain in English

POST /admin/translate
{
  "docs": ["Trinity-Beast-API-Reference.html"],
  "langs": "all",
  "protected_terms": ["MyCustomService", "SpecialEndpoint", "ProjectAlpha"]
}

Per-request terms are merged with the global list for that job only. They do not persist across jobs.

10b. Configuration Reference — Preserve Patterns

10.2 Preserve Patterns

Regex patterns for technical tokens that must survive translation unchanged:

Pattern Name	Matches	Example
`url`	HTTP/HTTPS URLs	`https://api.cpmp-site.org/admin/translate`
`email`	Email addresses	`CoryDeanKalani@CPMP-Site.org`
`memory_size`	Number + memory unit	`1770 MB`, `32 GB`
`percentage`	Number + %	`98.5%`, `62%`
`cron_expr`	Cron expressions	`cron(0 11 * * ? *)`
`ip_address`	IPv4 with optional CIDR	`10.0.1.0/24`
`aws_arn`	AWS ARN format	`arn:aws:sns:us-east-2:211998422884:tbi-ops-notifications`
`aws_resource_id`	AWS resource identifiers	`vpc-03deaddb7083cd59c`, `sg-050b617f93b2388f6`

10c. Configuration Reference — Limits

10.3 Limits

Parameter	Value	Purpose
`max_chunk_chars`	6000	Default maximum characters per chunk (Latin scripts: es, pt, fr, de)
`max_chunk_chars_by_lang`	See below	Per-language overrides for complex scripts
`max_retries`	3	Retry attempts per chunk on validation failure
`request_timeout_seconds`	300	Per-Bedrock-call timeout (5 minutes — large RTL chunks need headroom)
`max_output_tokens`	8192	Maximum tokens in Bedrock response

Per-language chunk size overrides:

Languages	Chunk Size	Rationale
`hi, ur, ar`	3000 chars	Devanagari and Arabic scripts expand significantly during translation. Smaller chunks prevent Bedrock timeouts.
`ja, zh, ru`	4500 chars	CJK and Cyrillic have moderate expansion. Mid-range chunks balance throughput and reliability.
`es, pt, fr, de, it`	6000 chars (default)	Latin scripts translate quickly with minimal expansion.

11. Operations Guide

11.1 Submitting a Translation Job

Single document, all languages:

curl -s -X POST \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html"],"langs":"all"}' \
  https://api.cpmp-site.org/admin/translate | jq .

Multiple documents, specific languages:

curl -s -X POST \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html","Trinity-Beast-Architecture-Guide.html"],"langs":["es","pt","fr"]}' \
  https://api.cpmp-site.org/admin/translate | jq .

With idempotency key (safe to retry):

curl -s -X POST \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "X-Idempotency-Key: api-ref-2026-05-16" \
  -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html"],"langs":"all"}' \
  https://api.cpmp-site.org/admin/translate | jq .

11.2 Monitoring Progress

# Check job status
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/status/{job_id} | jq .

# View queue
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/queue | jq .

# System health
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/health | jq .

# Recent history
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/history | jq .

11.3 Troubleshooting

Symptom	Cause	Resolution
Job stuck in `queued`	EventBridge Pipe not consuming	Check Pipe status in console; verify IAM role
429 on submit	Daily spend cap hit ($600)	Wait for 24h TTL expiry, or reset manually: `SET autoops:bedrock:spend:daily 0`
Partial completion	Some languages failed validation	`POST /admin/translate/retry-failed/{id}`
Worker timeout	Document too large (many chunks)	Check Step Function execution history for the failing chunk index
Cancel returns 404	Job only in Aurora, not Valkey	Cancel handler falls back to Aurora — ensure latest code is deployed
No email notification	Finalize Lambda error	Check CloudWatch logs for `tbi-translate-finalize`
Search not updated	Search rebuild timed out	Run `bash scripts/kcc.sh build-search` manually

Cancel a running job:

curl -s -X POST -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/cancel/{job_id}

This stops the Step Function execution immediately. Documents already translated and deployed remain live. The search index is rebuilt for whatever landed successfully.

12. Regional Failover

The translation engine implements automatic regional failover to maintain availability during Bedrock service disruptions. This was added after a us-east-2 outage during development exposed the single-region weakness — better to discover this before customers were affected.

12.1 Failover Chain

When a Bedrock call fails with a service-level error, the engine automatically retries in the next region:

Priority	Region	Location	Role
1	`us-east-2`	Ohio	Primary — all normal traffic
2	`us-east-1`	N. Virginia	First fallback
3	`us-west-2`	Oregon	Second fallback

The failover is transparent to the caller — the translation completes successfully as long as at least one region is available. A log message records when a fallback region was used.

12.2 Trigger Conditions

Failover is triggered for service-level errors and timeouts that indicate the region is unavailable or overloaded:

Error Type	Meaning	Action
`ServiceUnavailableException`	Bedrock service is down (503)	Retry same region once, then failover
`ThrottlingException`	Rate limit or capacity exceeded	Retry same region once, then failover
`ModelStreamErrorException`	Model streaming failure	Retry same region once, then failover
`ReadTimeoutError`	Response took longer than 300s	Retry same region once, then failover
`ConnectTimeoutError`	Could not establish connection within 10s	Retry same region once, then failover

Other errors (validation failures, authentication errors, malformed requests) are not retried — they would fail identically everywhere.

Per-Region Retry with Backoff

Each region gets 2 attempts before the engine moves to the next region. A 5-second backoff between attempts allows transient pressure to clear:

us-east-2 (attempt 1) → timeout → wait 5s →
us-east-2 (attempt 2) → timeout →
us-east-1 (attempt 1) → timeout → wait 5s →
us-east-1 (attempt 2) → timeout →
us-west-2 (attempt 1) → timeout → wait 5s →
us-west-2 (attempt 2) → timeout → FAIL (raise exception)

Total: 6 attempts across 3 regions. In practice, transient spikes clear within 5-10 seconds, so the retry within the same region usually succeeds without needing failover.

12.3 Cost Impact

Regional failover has negligible cost impact:

Same pricing: Bedrock pricing is identical across all three regions
No duplicate charges: Failed requests that trigger failover are not billed (no tokens consumed)
Minimal latency: Cross-region latency adds ~50-100ms per call, unnoticeable in the context of a multi-second translation
Estimated overhead: ~$0.0005 per document when failover is triggered (additional API call setup cost)

Resilience benefit: A complete regional outage no longer blocks translations. The May 2026 us-east-2 outage would have caused a 4-hour translation blackout without this feature. With failover, translations continued uninterrupted via us-east-1.

13. Document Preparation Guide

Proper document preparation ensures clean translations with minimal post-processing. This section covers the conventions that help the translation engine produce accurate results.

13.1 Code Tag Usage

The <code translate="no"> tag tells the translation engine to preserve content exactly as written. Use it correctly to avoid formatting artifacts in translated documents.

When to Use Code Tags

Use <code translate="no"> for technical identifiers that would break if translated:

Service names: tbi-ops-notify, BeastMain
API endpoints: /admin/translate, /public/infrastructure
Environment variables: AWS_REGION, ADMIN_KEY
Function names: _apply_sentinels(), translate()
File paths: /var/log/app.log, scripts/kcc.sh
Database columns: job_id, created_at
Configuration keys: max_retries, request_timeout_seconds
Command-line flags: --remote, --force

When NOT to Use Code Tags

Do not wrap pure data values in code tags — they should appear as plain text:

Memory sizes: 1770 MB, 32 GB, 3 GB (not <code>1770 MB</code>)
Timeouts: 60 seconds, 180s (not <code>60 seconds</code>)
Percentages: 98.5%, 62% (not <code>98.5%</code>)
Counts: 11 languages, 6 documents (not <code>11 languages</code>)
Costs: $600, $1.65 (not <code>$600</code>)
Version numbers in prose: version 17.7 (not <code>17.7</code>)

Why this matters: The translation engine's sentinel system protects code-tagged content from translation. If you wrap "32 GB" in code tags, it survives translation — but so does the monospace formatting, which looks wrong in prose. The engine has a post-processor that strips spurious code wrappers from pure numeric values, but it's better to author correctly from the start.

Quick Test

Ask yourself: "If I changed this value, would the system break?" If yes, use code tags. If no (it's just a number or measurement), leave it as plain text.

Content	Would changing it break something?	Use code tags?
`tbi-ops-notify`	Yes — Lambda name	✅ Yes
1770 MB	No — just a memory size	❌ No
`/admin/translate`	Yes — API endpoint	✅ Yes
$600	No — just a dollar amount	❌ No
`max_retries`	Yes — config key	✅ Yes
3 retries	No — just a count	❌ No

13.2 Protected Terms Submission

For documents with domain-specific terminology not in the global protected terms list, submit additional terms with the translation request:

POST /admin/translate
{
  "docs": ["Customer-Integration-Guide.html"],
  "langs": "all",
  "protected_terms": [
    "CustomerCorp",
    "ProjectPhoenix",
    "DataSync API",
    "IntegrationHub"
  ]
}

These terms are added to the global list for this job only. The engine will:

Wrap each term in <span translate="no"> during preprocessing
Replace with sentinel tokens before sending to Bedrock
Restore the original terms after translation
Validate that all terms survived intact

Best Practices for Protected Terms

Be specific: "DataSync API" is better than "DataSync" (avoids false matches)
Include variations: If a term appears as both "ProjectPhoenix" and "Project Phoenix", include both
Case matters: "CustomerCorp" and "customercorp" are treated as different terms
Don't over-protect: Common English words that translate well don't need protection

13.3 Clarification Workflow

When the translation engine encounters ambiguous content, it may flag it for human review. This happens in the validation phase when:

A protected term appears to have been partially translated
A version number format doesn't match the expected pattern
HTML structure differs significantly between source and output

Flagged content appears in the job status response under the warnings array:

{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate/status/1747407720-a3f8b2c1d4e5] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate/status/1747407720-a3f8b2c1d4e5",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T17:45:00Z",
  "data": {
    "job_id": "1747407720-a3f8b2c1d4e5",
    "state": "succeeded",
    "warnings": [
      "chunk 14 (ja): soft failure — protected term 'DataSync' may have been altered",
      "chunk 22 (ar): soft failure — version number format changed from X.Y.Z to X.Y"
    ]
  },
  "error": ""
}

Soft failures don't block the translation — the output is still deployed. Review the warnings and manually verify the flagged sections if needed.

Feedback loop: If you consistently see the same term flagged, add it to the global protected terms list in scripts/translation-config.json. This prevents future warnings and improves translation quality across all documents.

14. Pre-Scan Complexity Analysis

Before translation begins, the engine analyzes each document for complexity factors that may cause validation failures. This pre-scan identifies code-heavy sections and recommends whether to proceed, exercise caution, or split the document.

14.1 Complexity Metrics

The pre-scan calculates a complexity score for each section based on:

Factor	Weight	Why It Matters
Code tags	1.0 per tag	Each code tag must survive translation intact — more tags = more validation points
Code tags in tables	1.5 per tag	Tables with code examples are harder — model tends to merge or drop tags when reordering
Tables	2.0 per table	Tables with technical content require careful structure preservation
Pre blocks	0.5 per block	Usually have translate="no" — lower risk but still tracked
Protected spans	0.3 per span	Handled by sentinel system — low risk

Section Thresholds

High-density section: Complexity score > 15 OR > 10 code tags
Document split threshold: Total score > 50 OR > 3 high-density sections

14.2 Recommendations

Based on the analysis, the pre-scan returns one of three recommendations:

Recommendation	Criteria	Action
PROCEED	Score < 20, no high-density sections	Translate normally — low failure risk
CAUTION	Score < 50, ≤ 2 high-density sections	Proceed but monitor — may need retries
SPLIT	Score ≥ 50 OR > 3 high-density sections	Consider splitting document before translation

Pre-Scan Output Example

DOCUMENT TRANSLATION COMPLEXITY ANALYSIS
========================================
Total characters: 81,107
Total sections: 13
Total code tags: 287
Overall complexity score: 415.4
Recommendation: SPLIT

WARNINGS:
  ⚠️  Document has 287 code tags — high validation failure risk
  ⚠️  Section 'step-function' has 51 code tags — consider simplifying
  ⚠️  Section 'observability' has 48 code tags — consider simplifying

HIGH-DENSITY SECTIONS (9):
  • architecture: 11 code tags, score 22.7
  • sentinel-system: 22 code tags, score 34.5
  • step-function: 51 code tags, score 71.1
  ...

SUGGESTED SPLIT: 4 parts
  → Split after 'validators' (After 3 high-density sections)
  → Split after 'observability' (After 3 high-density sections)
  → Split after 'doc-prep' (After 3 high-density sections)

14.3 Document Splitting

When the pre-scan recommends splitting, it suggests natural break points at section boundaries. Options for handling complex documents:

Option 1: Split into Multiple Documents

Create separate HTML files for each part (e.g., Doc-Part1.html, Doc-Part2.html). Each part translates independently with lower failure risk. Link them together with navigation.

Option 2: Simplify High-Density Sections

Reduce code tag density in problematic sections:

Replace code examples with prose descriptions where possible
Move detailed code to appendices or separate reference docs
Use styled spans instead of code tags for visual-only formatting
Consolidate similar examples into single code blocks

Option 3: Translate in Batches

Submit fewer languages per job (e.g., 3 instead of 11). This reduces concurrent load and allows the model more capacity per translation. Retry failed languages individually.

Per-Language Split Thresholds (v2.5)

Complex scripts (Urdu, Arabic, Hindi) struggle with high tag density even when Latin-script languages handle the same chunk fine. The prescan now applies per-language code tag limits — tighter thresholds for scripts where the model is more likely to drop markup:

Language	Script	Max Code Tags per Part
Default (Latin, CJK, Cyrillic)	Latin / Kanji / Cyrillic	30
Urdu (ur)	Nastaliq	18
Arabic (ar)	Arabic	18
Hindi (hi)	Devanagari	20

Configuration key: max_code_tags_per_part_by_lang in translation-config.json. When the prescan runs for a specific language, it uses that language's threshold to determine split points. A document that translates as one part for Spanish may automatically split into 2-3 parts for Urdu.

Result: The Translation Service document (22 code tags in the Architecture section) previously failed for Urdu on every attempt. With the per-language threshold of 18, the prescan splits Architecture and Observability into separate parts. All 11 languages now translate successfully.

This document is an edge case: The Translation Engine documentation itself has 287 code tags and a complexity score of 415 — it's documentation about a translation engine, so it's packed with code examples. Most documents score under 50.

14.4 Splitting Safety Valve (v2.8)

Even when code tag density is low, a single part that exceeds the model's effective output window will be silently truncated — sections at the end of the part simply disappear from the output. The safety valve enforces a hard character limit per part regardless of prescan recommendations.

The Problem

The Performance Report (75 KB) has 18 sections with moderate code density. The prescan recommended splitting into 3 parts based on code tag thresholds. But Part 1 was 36 KB of prose-heavy content — well under the code tag limit but far beyond the model's output token budget. The model translated the first ~24 KB faithfully, then its output simply stopped. Sections 7-8 (partner-sustained, udp-engine) vanished without any error signal.

The Fix

# Safety valve: max chars per part (prevents model output truncation)
MAX_CHARS_PER_PART = 24000  # ~6000 tokens, well within max_output_tokens

The splitter now enforces a 24 KB ceiling on every part. If a part exceeds this limit after the prescan-based split, it is further subdivided at the nearest section boundary. This is conservative — Latin scripts could handle ~30 KB, but 24 KB is safe for all languages including RTL and CJK where token efficiency is lower.

Impact

Document	Before (v2.6)	After (v2.8)
Performance Report (75 KB)	3 parts (Part 1: 36 KB — truncated)	4 parts (largest: 22 KB — clean)
API Reference (180 KB)	8 parts (all under 24 KB already)	8 parts (no change — already safe)
Translation Engine (116 KB)	11 parts (code-density driven)	11 parts (no change — code splits dominate)

The safety valve only activates when the prescan's code-tag-based splitting produces oversized parts. For most documents, the code density split already keeps parts well under 24 KB.

Result: Performance Report went from dropping 3 entire sections (silent truncation) to a perfect 18/18 sections, 4/4 diagrams, 20/20 <br/> tags across all 11 languages.

15. Document-Level Preprocessor

The document-level preprocessor is a critical layer that runs before chunking. It extracts complex HTML elements from the entire document, replacing them with simple Unicode placeholders. After translation, the postprocessor restores the original elements. This eliminates the "model drops tags" failure mode entirely.

15.1 The Problem

The per-chunk sentinel system (Section 3) works well for most documents, but complex documents with many <code>, <strong>, and <em> tags exposed a fundamental limitation:

By the time chunks are created, they already contain many protected elements
Each element becomes a sentinel placeholder (__TBP0__, __TBP1__, etc.)
Chunks with 20+ placeholders overwhelm the model's attention
The model occasionally drops, duplicates, or merges placeholders during translation
Validation catches these failures, but retries often produce the same errors

Example failure: A chunk with 27 <code translate="no"> tags consistently failed validation with tag count mismatch (27→23) — the model dropped 4 placeholders despite explicit instructions to preserve them.

15.2 The Solution

Extract ALL problematic elements from the entire document before chunking. The model never sees these elements — only simple Unicode placeholders that it cannot confuse with HTML structure.

Key insight: The model cannot corrupt what it never sees. By extracting elements at the document level, each chunk has zero complex tags to worry about. The model translates clean prose with obvious markers.

Before vs After

Pipeline Stage	Before (v2.2)	After (v2.3)
Document received	290 code tags	290 code tags
After preprocessing	—	0 code tags (290 placeholders)
Per-chunk sentinels	20+ placeholders per chunk	0-2 placeholders per chunk
Model cognitive load	High (complex structure)	Low (clean prose)
Validation failures	Frequent on complex docs	Rare

15.3 Processing Flow

The preprocessor integrates into the translation pipeline as the first step:

Document → PREPROCESS → Chunk → Translate → Reassemble → POSTPROCESS → Output
              ↓                                              ↓
     Extract ALL code/pre/strong/em tags        Restore placeholders
     Replace with ⟦CODE_001⟧, ⟦STRONG_002⟧     with original HTML
     Build manifest mapping                     from manifest

Integration in `engine.py`

def translate(text, target_lang, mode="html", ...):
    # Step 1: PREPROCESS — Extract elements (document-level)
    simplified_html, manifest = preprocess_for_translation(text)
    
    # Step 2: CHUNK — Split simplified document (zero complex tags now)
    head, chunks, tail = chunker.split_document(simplified_html, lang=target_lang)
    
    # Step 3: TRANSLATE — Each chunk through Bedrock
    for chunk in chunks:
        translated = _translate_chunk(chunk, ...)  # Per-chunk sentinels still run
    
    # Step 4: REASSEMBLE
    reassembled = chunker.reassemble(head, translated_chunks, tail)
    
    # Step 5: POSTPROCESS — Restore placeholders with original elements
    output = postprocess_translation(reassembled, manifest)

15.4 Element Extraction

The preprocessor extracts elements in order of specificity (most specific first) to handle nesting correctly:

Pass	Elements Extracted	Placeholder Format
1	`<pre translate="no">` blocks	`⟦PRE_001⟧`
2	`<code translate="no">` tags	`⟦CODE_001⟧`
3	Other `translate="no"` elements	`⟦SPAN_001⟧`
4	`<strong>`, `<em>`, `<b>`, `<i>` tags	`⟦STRONG_001⟧`, `⟦EM_001⟧`
5	Numeric patterns (memory sizes, percentages, versions)	`⟦MEM_001⟧`, `⟦PCT_001⟧`, `⟦VER_001⟧`

Placeholder Format

Placeholders use Unicode brackets (⟦ and ⟧) that will never appear in real HTML content:

Format: ⟦TYPE_NNN⟧ (e.g., ⟦CODE_042⟧)
TYPE: Element type (CODE, PRE, SPAN, STRONG, EM, B, I)
NNN: Zero-padded index (001, 002, ...)
Zero collision risk: Unicode brackets don't appear in HTML, code, or prose

Nested Element Handling

The preprocessor handles arbitrary nesting depth by processing innermost elements first:

Source:
<span translate="no"><code translate="no">tbi-ops-notify</code> Lambda</span>

Pass 1: Extract inner code tag
<span translate="no">⟦CODE_001⟧ Lambda</span>

Pass 2: Extract outer span
⟦SPAN_002⟧

Model sees: ⟦SPAN_002⟧ (one token, no nesting)

Sibling Placeholder Awareness

When the preprocessor extracts elements from a container (e.g., a table cell), earlier passes leave placeholder text in the parent. Later passes must not be confused by these sibling placeholders — a <code translate="no"> tag in the same table cell as an already-extracted element is still a valid extraction target.

Bug fixed (v2.4): The original _is_inside_placeholder check walked up the DOM tree looking for the ⟦ character in any parent's text. This caused false positives — if a sibling element had been extracted (leaving ⟦CODE_042⟧ in the parent's text), the check incorrectly skipped remaining <code translate="no"> tags in the same container. Those unextracted tags then overwhelmed the model during complex-script translation (Hindi, Urdu). Fix: the check now always returns false — if an element still exists in the DOM tree, it wasn't extracted and is a valid target.

15.5 Restoration

After translation, the postprocessor restores placeholders in reverse index order (high → low) to prevent prefix collisions:

Translated: ⟦SPAN_002⟧

Restore ⟦SPAN_002⟧:
<span translate="no">⟦CODE_001⟧ Lambda</span>

Restore ⟦CODE_001⟧:
<span translate="no"><code translate="no">tbi-ops-notify</code> Lambda</span>

Perfect reconstruction — model never had to understand nesting.

Manifest Structure

The manifest maps each placeholder to its original HTML, enabling exact restoration:

{
  "⟦CODE_001⟧": {
    "type": "CODE",
    "html": "<code translate=\"no\">tbi-ops-notify</code>",
    "index": 1
  },
  "⟦SPAN_002⟧": {
    "type": "SPAN",
    "html": "<span translate=\"no\">⟦CODE_001⟧ Lambda</span>",
    "index": 2
  }
}

Result: The Translation Engine document (290 code tags, complexity 423) now translates with 0 retries across all 11 parts. Previously it failed consistently on Part 8 (config section with 27 code tags).

15.6 Numeric Pattern Extraction

Pass 5 extracts numeric patterns from the text after HTML element extraction. This protects bare numbers in prose that weren't already inside code or span tags. The model cannot convert, localize, or drop what it never sees.

Why Numeric Extraction Matters

When translating to complex scripts (Arabic, Hindi, Urdu), the model occasionally:

Drops numeric values: "32 GB" becomes just "GB" or disappears entirely
Localizes units: "MB" becomes "Mo" (French) or "ميغابايت" (Arabic)
Converts formats: "98.5%" becomes "٩٨٫٥٪" (Arabic numerals)
Paraphrases: "1770 MB" becomes "approximately 2 GB"

These transformations break technical accuracy. The numeric extraction pass prevents all of them.

Patterns Extracted

Pattern Type	Regex	Examples	Placeholder
Memory sizes	`\d+(?:\.\d+)?\s?(?:GB\|MB\|KB\|TB)`	32 GB, 1770 MB, 256 KB	`⟦MEM_001⟧`
Percentages	`\d+(?:\.\d+)?%`	98.5%, 62%, 100%	`⟦PCT_001⟧`
Version numbers	`\d+\.\d+(?:\.\d+)?`	4.6, 17.7, 2.3.1	`⟦VER_001⟧`

Processing Order

Numeric extraction runs after HTML element extraction (Passes 1-4). This means:

Numbers inside <code> tags are already protected by Pass 2
Numbers inside translate="no" spans are already protected by Pass 3
Pass 5 only catches bare numbers in prose that weren't otherwise protected
No double-extraction — the regex skips content already inside placeholders

Example: Hindi Translation

Source:
"The Lambda uses 1770 MB of memory and achieves 98.5% uptime."

After Pass 5:
"The Lambda uses ⟦MEM_042⟧ of memory and achieves ⟦PCT_043⟧ uptime."

Model translates prose, placeholders survive intact.

After restoration:
"लैम्ब्डा 1770 MB मेमोरी का उपयोग करता है और 98.5% अपटाइम प्राप्त करता है।"

Technical values preserved exactly — no localization, no conversion.

Result: Translation failures caused by numeric value loss (preserve_memory_size: missing: GB, MB) are now resolved across all 11 languages. Numeric values survive intact regardless of target script.

Placeholder Collision Prevention

The numeric extraction pass includes safeguards to prevent extracting numbers that are part of existing placeholder names (e.g., the "001" in ⟦CODE_001⟧):

Skips matches preceded by an underscore within a placeholder token
Skips matches followed by the closing bracket ⟧
Skips matches inside unclosed placeholder brackets

Without these guards, the numeric regex would corrupt placeholder names by extracting their index numbers, producing nested placeholders like ⟦CODE___TBN10__⟧ that the model cannot handle.

16. Notification System

The translation engine sends email notifications via the AutoOps notification pipeline (tbi-ops-notify Lambda → SES). Notifications are consolidated across batch jobs and include detailed per-document breakdowns.

16.1 Email Format

Each notification email includes:

Batch Summary: Total jobs, documents, languages, succeeded/failed pairs, final state
Total Time: Formatted as "Xm Ys" (e.g., "7m 12s") for readability
Per-Document Breakdown: Which languages succeeded and failed for each document
Deployment Status: CloudFront invalidation count, S3 deployment confirmation
Search Index Status: Whether the search index was rebuilt successfully
Error Details: Specific validation failures with chunk/validator information

Example Notification

Subject: [INFO] Translation Complete: 2 docs × 11 langs — 22/22 pairs SUCCEEDED

Batch Summary:
• Jobs: 2
• Documents: 2
• Languages: 11
• Total Pairs: 22
• Succeeded: 22
• Failed: 0
• Final State: SUCCEEDED
• Total Time: 7m 12s

Documents Translated:
• Trinity-Beast-AutoOps-Translation-Engine.html
  ✓ Succeeded: es, pt, fr, de, ru, hi, ja, zh, ar, ur, it
• Trinity-Beast-Infrastructure-Overview.html
  ✓ Succeeded: es, pt, fr, de, ru, hi, ja, zh, ar, ur, it

Deployment:
• CloudFront Invalidations: 2
• All translated files deployed to S3

Search Index:
• Rebuilt successfully (all 11 languages)

Partial Success Example

Subject: [WARNING] Translation Complete: 1 doc × 11 langs — 10/11 pairs PARTIAL

Documents Translated:
• Complex-Technical-Guide.html
  ✓ Succeeded: es, pt, fr, de, ru, hi, ja, zh, ar, it
  ✗ Failed: ur

Error Details:
• Complex-Technical-Guide.html → ur: chunk 14 failed validation after 3 retries
  check_tag_counts: expected 27 code tags, found 23

16.2 Batch Consolidation

When multiple translation jobs are submitted together (e.g., translating 5 documents), the notification system consolidates them into a single email:

Deferred sending: Each finalizing job checks if other jobs are still active
Last job sends: Only the last job to complete sends the consolidated email
Safety net: If all jobs see each other as "active" (race condition), a 5-second wait and re-check prevents missed notifications
Fallback: If no jobs are queued and the batch is wrapping up, the current job sends anyway

This prevents notification spam when translating multiple documents — you get one comprehensive email covering the entire batch, not 5 separate emails.

16.3 Document Resolver (v2.5)

When the same document appears in multiple jobs within a batch (e.g., initial run fails Urdu, retry succeeds), the notification resolves duplicate entries into a single final-state view:

Merge logic: For each document, collect all succeeded and failed languages across all jobs
Retry overrides failure: If a language appears in both succeeded (retry job) and failed (original job), it's reported as succeeded
Deduplicated counts: Summary totals (Succeeded, Failed, Total Pairs) are recalculated from the resolved state — not raw aggregation
Single entry per document: The notification shows each document exactly once with its final language breakdown

Without the resolver, a retry job would show the same document twice — once with the failure and once with the fix — making the notification confusing and the counts misleading.

16.4 Tag Inventory (v2.8)

Every notification includes a Tag Inventory section showing source vs output tag counts per document. This lets you detect at a glance if the model is adding or dropping tags. As of v2.8, the inventory also reports Mermaid diagram counts:

Tag Inventory (source → output):
• Trinity-Beast-Translation-Service.html
  IN:  code:22 pre:5 strong:8 em:2 a:4 br:3 diagrams:1
  OUT: code:22 pre:5 strong:8 em:2 a:4 br:3 diagrams:1

If the model has a bad day and adds a <span> that wasn't in the source, or drops code tags, you'll see the mismatch immediately:

  IN:  code:23 pre:5 strong:8 diagrams:2
  OUT: code:20 pre:4 strong:8 diagrams:1    ← 3 code dropped, 1 diagram lost

Tag counts are logged per-language in Aurora (translation_job_events) with tags_in and tags_out fields. The notification shows the first successful language's counts (source tags are identical across all languages since it's the same source document).

Recipient: All translation notifications go to CoryDeanKalani@CPMP-Site.org via the unified AutoOps notification pipeline. The sender is CPMP Mission <No-Reply@CPMP-Site.org>.

17. Delta Translation (Incremental Updates)

Documents change frequently — a new endpoint, a revised architecture, an updated pricing table. Without delta translation, every edit requires re-translating the entire document across all 11 languages. Delta translation solves this by identifying exactly which sections changed and translating only those, reusing cached translations for everything else.

17.1 Concept

The delta translation system leverages two key properties of the document library:

S3 versioning: Every document upload creates a new version in S3. Previous versions are retained indefinitely, providing a complete edit history.
 markers: Human-placed section boundaries in the English source that divide documents into logical, independently-translatable sections.

By comparing the current English document against the version that was last translated, the system identifies which sections changed (by content hash) and only sends those to Bedrock. Unchanged sections are pulled directly from the existing translated document. Typical savings: 70–90% on incremental updates.

17.2 S3 Versioning as Diff Source

The website bucket (trinity-beast-website-east2) has versioning enabled. Every aws s3 cp or s3api put-object creates a new version with a unique VersionId. The delta system uses this to:

List all versions of a document with timestamps and sizes
Fetch any previous version by VersionId
Compare current content against the version that was last successfully translated

No separate manifest storage is required — S3 already has the full history. A lightweight metadata file (docs/delta/{doc}.{lang}.json) tracks which VersionId was last translated for each document-language pair.

17.3 Comment Preservation (Sentinel Pass 0)

For delta translation to work,  markers must survive the translation round-trip. Previously, Bedrock silently dropped HTML comments during translation. The sentinel system now includes a Pass 0 that protects all HTML comments:

# Pass 0: Before Bedrock sees the chunk
<!-- TBI-CHUNK -->  →  __TBP0__    (sentinel token)
<!-- Section 5 -->  →  __TBP1__    (sentinel token)

# After translation: sentinels restored
__TBP0__  →  <!-- TBI-CHUNK -->
__TBP1__  →  <!-- Section 5 -->

This is implemented as the first pass in _apply_sentinels() in engine.py, before the existing translate="no" element extraction (Pass 1), paired span sentinels (Pass 2), and numeric protection (Pass 3). Comments are treated as Type A (FULL) sentinels — extracted completely and restored verbatim.

17.4 Hash-Based Section Matching

The algorithm is position-independent — sections are matched by content hash, not by index. This means markers can be added, removed, or repositioned between versions without breaking the delta logic.

Diagram 17.1: Delta Translation Flow

flowchart TD
    A[Fetch Current English from S3] --> B[Split by TBI-CHUNK markers]
    B --> C[Hash each section SHA-256]
    D[Fetch Previous English version] --> E[Split by TBI-CHUNK markers]
    E --> F[Hash each section]
    C --> G{Compare hashes}
    F --> G
    G -->|Match found| H[Pull from existing translation]
    G -->|No match| I[Send to Bedrock]
    H --> J[Reassemble with TBI-CHUNK markers]
    I --> J
    J --> K[Deploy to S3 + Save metadata]

    style A fill:#1e3a5f,stroke:#60a5fa,color:#e0e0e0
    style D fill:#1e3a5f,stroke:#60a5fa,color:#e0e0e0
    style H fill:#064e3b,stroke:#10b981,color:#e0e0e0
    style I fill:#7c2d12,stroke:#f97316,color:#e0e0e0
    style K fill:#1e3a5f,stroke:#60a5fa,color:#e0e0e0

Marker repositioning example:

Version 1: 5 sections (markers at A, B, C, D)
Version 2: 6 sections (new marker added — A, B, C, C2, D)
Sections before and after the new marker still match by hash → reused
The split section produces two new hashes → both translate fresh
Result: 4 of 6 sections reused (67% savings) despite marker change

17.5 CLI Commands

Four KCC commands support delta translation and chunk management:

Delta Diff (Analysis Only)

# List available S3 versions
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html --list-versions

# Compare current vs previous version (auto-detects)
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html

# Compare against a specific version
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html --version-id ksYxUBZIUB8Roi2KQYje6ig9R7JesL9z

# Show delta for a specific language
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html --lang ja

Delta Translate (Incremental Translation — Local CLI)

# Dry run — show what would change without calling Bedrock
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html es --dry-run

# Translate only changed sections for one language
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html es

# Translate changed sections for all languages
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html all

# Force full translation (creates fresh baseline)
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html all --force

Delta via Remote API (`options.delta`)

The delta option is also available on POST /admin/translate — the worker skips any language pair where the translated file on S3 is already newer than the source document. No local CLI needed.

# Submit a delta job via the remote API — skips up-to-date pairs automatically
curl -s -X POST -H "X-Admin-Key: $ADMIN_KEY" -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html"],"langs":"all","options":{"delta":true}}' \
  https://api.cpmp-site.org/admin/translate | jq .

Delta Validate (Marker Preservation Check)

# Validate TBI-CHUNK markers survived translation for all delta-enabled docs
bash scripts/kcc.sh delta-validate all all

# Validate a specific doc across all languages
bash scripts/kcc.sh delta-validate Trinity-Beast-API-Reference.html all

# Validate a specific doc + language pair
bash scripts/kcc.sh delta-validate Trinity-Beast-API-Reference.html es

Reports pass/fail per doc×lang pair. Exit code 0 if all pass, 1 if any markers were lost. Run after any translation job to confirm Sentinel Pass 0 is working correctly.

Chunk Sizer (Auto-Placement Suggestions)

# Analyze a doc from S3 and suggest TBI-CHUNK marker placement
bash scripts/kcc.sh chunk-size Trinity-Beast-API-Reference.html

# Analyze a local file
bash scripts/kcc.sh chunk-size /path/to/local/doc.html

Scans the document for <section>, <h2>, <h3>, and .category-section boundaries. Reports current chunk sizes (if markers exist), identifies policy violations, and suggests where to insert markers to stay within the 15KB/18KB/12KB policy. Dense sections (high translate="no" density) automatically target the tighter 12KB limit.

17.6 Bootstrap Path

Existing translated documents do not contain  markers (they were stripped before the sentinel fix). The bootstrap sequence is:

First run (full cost): Use --force to translate the entire document. The sentinel fix preserves markers in the output. Delta metadata is saved to S3.
Subsequent runs (delta savings): The tool detects the existing translation has markers, loads metadata to identify the previous English version, and only translates changed sections.

After the bootstrap run, typical savings on incremental updates:

Change Type	Typical Savings	Example
Single section edit	85–95%	Fix a typo, update one endpoint
New section added	70–85%	Add a new feature section
Marker repositioned	60–75%	Split a large section in two
Major rewrite	20–40%	Restructure half the document

Cost model: At approximately $1.50 per section-language pair, a 9-section document across 11 languages costs ~$148.50 for a full translation. With delta (2 sections changed), the same update costs ~$33 — a 78% reduction.

Quick Reference

Item	Value
Model	`us.anthropic.claude-sonnet-4-6` (cross-region inference profile)
Failover Regions	`us-east-2` → `us-east-1` → `us-west-2`
Target Languages	11: es, pt, fr, de, ru, hi, ja, zh, ar, ur, it
Worker Runtime	Python 3.11 (ECS Fargate task, container image)
Deploy/Finalize Runtime	Go (`provided.al2023`)
Worker Resources	1 vCPU / 3 GB (Fargate — no timeout ceiling)
Memory (Lambdas)	1770 MB
Worker Timeout	None (runs to completion)
Finalize Timeout	180s
Deploy Timeout	60s
Max Docs per Request	6
Max Active Jobs	3
Daily Dollar Cap	$600 (24h TTL auto-reset)
Daily Token Cap	50M combined tokens (24h TTL auto-reset)
Chunk Size (Latin scripts)	6000 chars
Chunk Size (CJK + Russian)	4500 chars (ja, zh, ru)
Chunk Size (Indic + Arabic)	3000 chars (hi, ur, ar)
Retries per Chunk	3
Max Part Size	24 KB (safety valve — prevents model output truncation)
MaxConcurrency (per-language)	0 (unlimited — all language containers launch simultaneously)
ECR Repository	`tbi-translate-worker`
SQS Queue	`trinity-beast-translation-queue`
Step Function	`tbi-translation-orchestrator`
IAM Role (Worker + Lambdas)	`tbi-translate-role`
IAM Role (Pipe)	`tbi-translate-pipe-role`
IAM Role (Step Function)	`tbi-translate-orchestrator-role`
Valkey Keys	`tx:job:{id}`, `tx:active`, `tx:history`, `tx:idempotency:{key}`, `autoops:bedrock:spend:daily`, `autoops:bedrock:tokens:input:daily`, `autoops:bedrock:tokens:output:daily`
Aurora Tables	`translation_jobs`, `translation_job_events`
Delta Metadata	`docs/delta/{doc}.{lang}.json` (S3)
Delta CLI	`bash scripts/kcc.sh delta-diff`, `bash scripts/kcc.sh delta-translate`, `bash scripts/kcc.sh delta-validate`, `bash scripts/kcc.sh chunk-size`
CloudWatch Namespace	`TBI/Translation`