The Trinity Beast – AutoOps Translation Engine

Custom Bedrock-powered document translation service — sentinel preprocessing, validator system, Step Function orchestration

Region: us-east-2 (Ohio) Version: v3.0 May 21, 2026

1. Why a Custom Translation Engine

The Trinity Beast Infrastructure maintains 40 technical documents translated into 11 languages — over 440 translated files total total. The original approach used AWS Translate batch jobs. It worked for simple prose but failed catastrophically on technical documentation.

1.1 Where AWS Translate Fails

AWS Translate is a neural machine translation service optimized for general-purpose text. Technical documentation with embedded code, diagrams, and brand terminology exposes its fundamental limitations:

Failure ModeExampleImpact
Translates code blocksfunction getName()función obtenerNombre()Code no longer executes
Translates variable namesapi_keyclave_apiDocumentation references break
Breaks Mermaid diagramsTranslates node labels inside mermaid blocksDiagrams fail to render
Corrupts HTML structureMerges adjacent elements, drops attributesStyling and layout break
Transliterates brand namesAutoOpsآٹو آپس (Urdu phonetic)Brand identity lost, search breaks
Localizes numeric units32 GB32 Go (French)Technical specs become ambiguous
Drops version numbersPostgreSQL 17.7PostgreSQLVersion-specific guidance lost
Ignores translate attributeTranslates content inside protected zonesDefeats the HTML5 standard mechanism

1.2 The Scale Problem

With 40 documents × 11 languages, every documentation update triggers a translation cascade. Before the custom engine:

1.3 The Solution

A custom Bedrock-powered translation engine that understands the boundary between human language and machine language. The engine uses defense-in-depth across the full pipeline:

Result: A single POST /admin/translate call translates any document from any supported source language into up to 11 target languages, deploys to S3, invalidates CloudFront, rebuilds the search index, and emails a summary. Source language is auto-detected when not specified — no pivot through English required.

2. Architecture

2.1 Pipeline Flow

The translation service is an event-driven pipeline that decouples submission from execution. The operator submits a job; the system handles everything else asynchronously.

Diagram 2.1: End-to-End Pipeline Architecture

flowchart TB
    subgraph Operator
        A[POST /admin/translate]
    end
    subgraph "LPO Server (Go)"
        B[Validate & Enqueue]
        C[Valkey State]
        D[Aurora Record]
    end
    subgraph "AWS Pipeline"
        E[SQS Queue]
        F[EventBridge Pipe]
        G[Step Function]
    end
    subgraph "Translation Intelligence (Python)"
        direction LR
        subgraph "Pre-Processing"
            H0[Source Validation]
            H1[Language Detection]
            H2[Complexity Analysis]
            H3[Document Preprocessor]
        end
        subgraph "Translation Core"
            H4[Sentinel System — 3 Types]
            H5[Bedrock — 3-Region Failover]
            H6[Validator — Hard + Soft Tiers]
            H7[Integrity Check + Auto-Repair]
        end
    end
    subgraph "Deployment (Go)"
        direction LR
        I[S3 Write]
        J[CloudFront Invalidation]
        K[Search Index Rebuild]
        L[SES Notification]
    end

    A --> B
    B --> C
    B --> D
    B --> E
    E --> F
    F --> G
    G --> H0
    H0 --> H1
    H1 --> H2
    H2 --> H3
    H3 --> H4
    H4 --> H5
    H5 --> H6
    H6 --> H7
    H7 --> I
    I --> J
    J --> K
    K --> L

    style A fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style B fill:#1e293b,stroke:#334155,color:#e2e8f0
    style C fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style D fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style E fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style F fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style G fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style H0 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H1 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H2 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H3 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H4 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H5 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H6 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H7 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style I fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style J fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style K fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style L fill:#064e3b,stroke:#10b981,color:#e2e8f0
        

2.2 Components

ComponentTypeRuntimePurpose
POST /admin/translate (+ 8 more)Admin APIGoJob submission, monitoring, control
trinity-beast-translation-queueSQSDecouple submission from execution
tbi-translate-pipeEventBridge PipeSQS → Step Function trigger (no glue Lambda)
tbi-translation-orchestratorStep FunctionsFan-out, retry, deploy, finalize orchestration
tbi-translate-workerECS Fargate TaskPython 3.11Bedrock translation + sentinel + validation (no timeout ceiling)
tbi-translate-initLambdaGoRecords execution ARN, transitions queued → running
tbi-translate-deployLambdaGoCloudFront invalidation per document
tbi-translate-finalizeLambdaGoSearch rebuild + SES notification + state transition
translation_jobsAurora tablePermanent job records (28 columns)
translation_job_eventsAurora tableGranular per-doc/lang audit log

2.3 Why Python (The Only Python in the Fleet)

Every other compute workload in The Trinity Beast Infrastructure is written in Go. The translation worker is the sole exception, and for good reason:

Convention note: All Lambda functions use 1770 MB memory (multiple of 3). The worker runs as an ECS Fargate task (1 vCPU / 3 GB) with no timeout ceiling — large documents translate to completion regardless of processing time. Deploy and finalize Lambdas use 60s and 180s timeouts respectively.

3. Sentinel Preprocessing System

The sentinel system is the core innovation that makes reliable technical document translation possible. It operates on a simple principle: the model cannot corrupt what it never sees.

Before any chunk is sent to Bedrock, protected content is replaced with placeholder tokens. The model translates the prose around the placeholders. After translation, the placeholders are swapped back to the original content. Validation then confirms everything survived intact.

3.1 Four Sentinel Types

Type A — Full Element Extraction (__TBP{N}__)

Replaces entire translate="no" elements with a single token. The model sees only the placeholder and places it in the natural position for the target language's word order.

BeforeAfter Sentinel Pass
<span translate="no">CloudFront</span> invalidation __TBP0__ invalidation
<code translate="no">api_key</code> parameter __TBP1__ parameter

Handles arbitrary nesting depth — processes innermost elements first, then sweeps outward until stable.

Type B — Paired Open/Close (__TBO{N}__ / __TBC{N}__)

For plain <span> wrappers containing translatable text (badges, titles, method labels). The wrapper tags become sentinels; the text between them is translated normally.

BeforeAfter Sentinel Pass
<span class="badge">UDP Port 2679</span> __TBO0__UDP Port 2679__TBC0__

The model translates "UDP Port 2679" while the <span class="badge"> wrapper survives intact.

Type C — Numeric Protection (__TBN{N}__)

Protects bare numbers in prose from the model's tendency to drop, paraphrase, or localize them. Matches integers, decimals, percentages, and number+unit pairs.

BeforeAfter Sentinel PassProblem Prevented
uses 1770 MB of memory uses __TBN0__ of memory French translating "MB" → "Mo"
achieves 98.5% uptime achieves __TBN1__ uptime Japanese dropping the decimal
62% cache hit rate __TBN2__ cache hit rate German paraphrasing to words

Type D — Brand Term Protection (__TBT{N}__)

Protects brand terms, product names, and proper nouns that must never be translated or transliterated. Unlike Type A (which requires translate="no" in the source HTML), Type D operates from a centralized configuration list — no source markup needed.

BeforeAfter Sentinel PassProblem Prevented
powered by The Trinity Beast powered by __TBT0__ Hindi transliterating to ट्रिनिटी बीस्ट
deployed on CloudFront deployed on __TBT1__ Arabic transliterating to كلاود فرونت
Cory Dean Kalani __TBT2__ Urdu transliterating person names

Protected terms are defined in translation-config.json (57 terms). The sentinel pass matches terms using word-boundary regex for short terms (≤5 chars) and substring matching for longer terms. Restoration is exact — the original term text is re-injected at the sentinel position.

Sentinel Recovery Pass (Post-Restoration)

Complex-script models (Hindi, Urdu, Arabic) occasionally drop Type D sentinel tokens entirely from their output — the token simply doesn't appear in the translated text. The recovery pass runs after normal restoration and before validation:

  1. Iterates all TERM entries in the sentinels list
  2. Checks if the term is present in the source but missing from the restored output
  3. Re-injects the original term text at an approximate position (ratio-based paragraph matching)
  4. Falls back to insertion before the last closing tag if position cannot be determined

This eliminates the class of failures where the model acknowledges the sentinel in its "thinking" but omits it from the output — a behavior observed primarily in Indic scripts with token-dense chunks.

3.2 Processing Flow

Diagram 3.1: Sentinel Preprocessing Flow

flowchart TD
    A[Source HTML Chunk] --> B[Pass 1: Extract translate=no elements]
    B --> C[Pass 2: Wrap plain span text in paired sentinels]
    C --> D[Pass 3: Replace bare numbers with numeric sentinels]
    D --> D2[Pass 4: Replace brand terms with TERM sentinels]
    D2 --> E[Send to Bedrock with sentinel-aware prompt]
    E --> F[Receive translated chunk with sentinels intact]
    F --> G[Deduplicate any model-doubled paired sentinels]
    G --> H[Restore sentinels high-to-low index order]
    H --> H2[Recovery pass: re-inject any dropped TERM sentinels]
    H2 --> I[Run validators against source + restored output]
    I -->|PASS| J[Accept chunk]
    I -->|FAIL| K{Retries remaining?}
    K -->|Yes| L[Retry with strict prompt + temperature jitter]
    L --> E
    K -->|No| M[Raise TranslationError]

    style A fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style E fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style J fill:#064e3b,stroke:#10b981,color:#e2e8f0
    style M fill:#450a0a,stroke:#ef4444,color:#e2e8f0
        

The four passes execute in strict order — later passes operate on the output of earlier ones. This means Type C (numeric) sentinels can protect numbers that appear inside Type B (paired) text, and Type D (brand term) sentinels protect terms that appear anywhere in the translatable content, providing defense-in-depth.

3.3 Restoration and Deduplication

After translation, sentinels are restored in reverse index order (high → low) to prevent prefix collisions (__TBP1__ must not match inside __TBP10__).

A deduplication pass runs before restoration to handle a known model behavior: occasionally the model emits a paired sentinel twice consecutively (a bilingual output instinct). The deduplicator collapses __TBO0__text__TBC0__ __TBO0__text__TBC0__ into a single occurrence.

4. Validator System

Every translated chunk is validated against the source before acceptance. Validators enforce structural integrity and content preservation — if a translation passes all validators, it is guaranteed to be functionally correct (code works, links resolve, diagrams render).

4.1 Validation Checks

ValidatorTypeWhat It ChecksFailure Example
check_protected_termsHardEvery protected term in source appears in output"CloudFront" missing from Japanese output
check_version_numbersHardAll version numbers (X.Y.Z) survive translation"17.7" dropped from PostgreSQL reference
check_preserve_patternsHardURLs, emails, IPs, ARNs, resource IDs, cron expressions, memory sizesARN truncated or IP address reformatted
check_tag_countsHardHTML tag counts match for structural tagsExtra <span> added or <code> dropped
check_translate_no_zonesHardContent inside translate="no" zones unchangedProtected code block content altered

Protected term matching: Short uppercase acronyms (≤4 chars like SQS, ECR, S3) use word-boundary matching to avoid false positives where the acronym appears as a substring (e.g., "ECR" inside "SECRET"). Longer terms use plain substring matching.

Implementation (v2.5): The check_tag_counts and check_translate_no_zones validators use character scanning with exact boundary matching — no regex. We control these tags. We know that a tag starts with < and ends with >. The scanner finds complete opening tags by looking for <tagname followed by a boundary character (>, space, tab, newline, or /), then reads to the closing >. This eliminates false positives from partial regex matches and is immune to edge cases where tag names appear as text content (e.g., documenting translate="no" as literal text inside a code tag).

4.2 Retry Strategy

When validation fails, the engine retries with two progressive adjustments:

  1. Strict prompt activation — adds an explicit warning: "PREVIOUS ATTEMPT FAILED VALIDATION. Be more careful: every protected term and every version number from the input MUST appear unchanged in the output."
  2. Temperature jitter — increments temperature by 0.1 per retry (0.0 → 0.1 → 0.2 → 0.3, capped at 0.5). A deterministic temp=0 retry produces the same erroneous output; temperature jitter lets the model take a different sampling path.

Maximum retries: 3 (configurable). If all attempts fail, a TranslationError is raised with the chunk index, validator detail, and a preview of the problematic chunk.

4.3 Hard vs Soft Failures

Validators are classified into two tiers based on what they protect:

TierTagsBehaviorRationale
Hard (content-critical)<code>, <pre>, <a>Retry → reject on failureMissing code blocks, broken links, or lost pre-formatted content means the translation is functionally broken
Soft (decorative/structural)<span>, <strong>, <em>, <br>Log warning, pass throughMissing styling wrappers don't break functionality — the post-translation integrity check repairs them

This tiered approach eliminates the failure mode where a correctly-translated document is rejected because the model dropped a single decorative <span> wrapper during RTL reordering. The content is correct — only the styling wrapper is missing — and the integrity check restores it automatically.

The ValidationReport aggregates all results and exposes:

4.4 Post-Translation Integrity Check

After translation completes and chunks are reassembled, a full-document integrity check runs before the S3 write. This is the defense-in-depth layer — it repairs structural drift that the per-chunk validator intentionally allows through (soft failures).

Repair Capabilities

IssueDetectionRepair Action
</br> injectionString scan for invalid closing br tagsStrip all occurrences (never valid HTML)
<br> inside Mermaid blocksRegex scan within <pre class="mermaid">Remove (breaks Mermaid syntax)
Mermaid content corruptionByte-for-byte comparison with sourceFlag as warning (cannot auto-repair content changes)
Missing translate="no" span wrappersCompare source protected elements to outputRe-wrap bare content with original element tags
Missing <strong>/<em> wrappersSame pattern as span recoveryRe-wrap bare content

The integrity check only repairs translate="no" elements (where content is byte-for-byte identical between source and output). For translated content that lost its wrapper, the check logs the discrepancy but cannot reliably re-wrap (the content has been translated — matching it to the source wrapper requires semantic understanding).

Design principle: If the translated content is present and correct but the HTML structure is degraded, repair it. Only flag as unrecoverable if content is actually missing or corrupted. The customer sees a clean translation — the repairs happen invisibly.

4.5 Source Document Validation (v2.8)

Before any translation work begins, the source document passes through a validation gate. This catches defects that would cause translation failures or produce broken output — rejecting early saves Bedrock tokens and prevents corrupted translations from reaching S3.

Defect Categories

CategoryWhat It CatchesAuto-Repairable?
STRUCTURALUnclosed tags, malformed HTML, nesting violationsYes (up to 5 unclosed tags)
MERMAIDEmpty diagram blocks, missing type declaration, mismatched bracketsNo — reject with location
ENCODINGBOM markers, null bytes, mixed encodingsYes (strip BOM/nulls)
SIZEDocument exceeds 500 KB, excessive nesting depth (>30 levels)No — reject with size info
CONFLICTtranslate="no" on root element (nothing to translate)No — reject immediately

Validation Flow

  1. Size check — reject if > 500 KB (chunking becomes unreliable at this size)
  2. Encoding check — detect and strip BOM markers, null bytes; flag mixed encodings
  3. Structural HTML check — scan for unclosed tags; auto-repair up to 5 by appending closing tags at the correct nesting level
  4. Mermaid syntax check — validate every <pre class="mermaid"> block has a valid diagram type, balanced brackets, and non-empty content
  5. Conflict check — reject if the root <body> or <html> element has translate="no"

Rejection vs Repair

The validator follows a strict philosophy: try to fix it silently, reject early if you can't. Repairable issues (unclosed tags, BOM markers) are fixed in-place — the customer never knows. Unrecoverable issues produce an actionable defect report with the exact location, what's wrong, and how to fix it.

ValidationResult:
  valid: false
  rejection_reason: "2 unrecoverable defects found"
  defects:
    - severity: error
      category: MERMAID
      location: "Section 5, line 342"
      description: "Empty Mermaid block — no diagram content"
      suggestion: "Add diagram content or remove the empty <pre class='mermaid'> block"
    - severity: error
      category: SIZE
      location: "Document root"
      description: "Document is 612 KB (limit: 500 KB)"
      suggestion: "Split into multiple documents or remove large embedded assets"

Cost savings: A rejected document costs zero Bedrock tokens. Without source validation, a broken document would fail during translation (after burning tokens on partial chunks), produce a corrupted output, and require manual investigation. Source validation catches these cases in <10ms with zero API calls.

4.6 Diagram Integrity (v2.8)

Mermaid diagrams are code — they must survive translation byte-for-byte. The integrity check (section 4.4) now includes dedicated diagram verification with automatic recovery.

Detection

The integrity check counts Mermaid blocks in the source (<pre class="mermaid">) and compares against the translated output. If any diagrams are missing from the output, the auto-stitch mechanism activates.

Auto-Stitch Recovery

When a diagram is missing from the translated output:

  1. Identify which source diagram is absent (by content matching)
  2. Extract the full <div class="diagram-wrap"> block from source (includes label + pre)
  3. Locate the correct insertion point in the output (same section, same relative position)
  4. Inject the source diagram block verbatim — diagrams don't need translation

The stitched diagram is the English version, which is functionally correct — Mermaid syntax is language-independent. The surrounding prose is already translated, so the reader gets translated explanations with a working diagram.

Tag Inventory Integration

The _count_tags function now reports diagram count alongside other structural tags:

Tag Inventory (source → output):
• Trinity-Beast-Performance-Report.html
  IN:  code:75 pre:8 strong:12 em:3 a:6 br:20 diagrams:4
  OUT: code:75 pre:8 strong:12 em:3 a:6 br:20 diagrams:4

If a diagram is lost during translation and auto-stitched back, the final count still matches — the stitch happens before the tag inventory is calculated. A mismatch in the diagrams count after stitching indicates a structural issue that needs manual review.

Result: The Performance Report (75 KB, 4 Mermaid diagrams, 18 sections) translates to French with all 4 diagrams intact — 3 survived translation naturally, 1 was auto-stitched from source. The reader sees no difference.

5. Step Function Orchestration

The tbi-translation-orchestrator Step Function coordinates the entire translation pipeline. As of v3.0, it uses a language-persistent container pattern — one container per language, each processing all documents sequentially.

5.1 Language-Persistent Container Pattern (v3.0)

Diagram 5.1: Step Function State Machine (v3.0)

flowchart TD
    A[UnwrapInput] --> AB[InitJob - tbi-translate-init]
    AB --> B[PerLang Map - Parallel, Unlimited]
    B --> C[tbi-translate-worker container]
    C --> C2[Process ALL docs sequentially]
    C2 -->|All docs done| D[Lang Container Exits]
    D -->|Success| E[Lang Succeeded]
    D -->|Failure| F[RecordLangFailure]
    E --> G{All langs done?}
    F --> G
    G --> H[tbi-translate-deploy - Batch Mode]
    H --> J[tbi-translate-finalize]
    J --> K[Job Complete]

    style A fill:#1e293b,stroke:#334155,color:#e2e8f0
    style AB fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style B fill:#2e1065,stroke:#a78bfa,color:#e2e8f0
    style C fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style C2 fill:#92400e,stroke:#fbbf24,color:#e2e8f0
    style H fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style J fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    style K fill:#064e3b,stroke:#10b981,color:#e2e8f0
        

v3.0 Architecture: Container Count = Language Count

The v3.0 architecture inverts the execution model. Instead of launching N×M containers (one per doc-language pair), it launches M containers (one per language). Each container receives the full list of documents as a JSON array and processes them sequentially before exiting.

JobOld (v2.x) ContainersNew (v3.0) ContainersReduction
3 docs × 11 langs331167%
30 docs × 11 langs (full library)3301197%
1 doc × 11 langs11110% (unchanged)
6 docs × 3 langs18383%

Why language-persistent?

Document Array Passing

The Step Function uses States.JsonToString (an intrinsic function) to serialize the docs array into a string environment variable for the ECS container. The worker's task_runner.py deserializes it on startup and iterates through each document.

UnwrapInput State

EventBridge Pipes always wrap SQS records in an array, even with batch size 1. The UnwrapInput Pass state uses InputPath: "$[0]" to extract the single job envelope from the array wrapper.

InitJob State

Replaced the original Pass state with the tbi-translate-init Lambda. This state records the Step Function execution ARN ($$.Execution.Id) back to Valkey via POST /admin/translate/update/{job_id} and transitions the job state from queuedrunning.

5.2 Error Handling and Recovery

Failure ModeHandlingJob State
Single language fails after 3 retriesCatch → RecordLangFailure pass state, continue other langspartial
All languages for a doc failDeploy Lambda receives empty succeeded list, skips invalidationpartial
Worker timeout (no response)ECS task runs to completion — no timeout ceiling. Step Function waits via ecs:runTask.syncrunning
Step Function execution exceptionFinalize still runs via catch-all; job marked failedfailed
Operator cancels mid-flightStopExecution API call; job marked cancelledcancelled
Step Function fails before FinalizeSelf-healing sweeper detects orphaned job via execution ARN, marks as failedfailed

Per-lang independence: Failure of one (doc, lang) pair never aborts work on the other 10 languages. This is enforced by the Step Function's Catch on the inner Map iterator — errors are captured as data, not propagated as exceptions.

5.3 EventBridge Pipe Integration

The tbi-translate-pipe connects SQS to the Step Function without a glue Lambda:

This is the AWS-native pattern for SQS-to-Step-Function integration — no code, no cold start, built-in error handling.

5.4 Self-Healing Sweeper

The sweeper runs automatically on every GET /admin/translate/health call (piggybacked) and is also available as a dedicated POST /admin/translate/sweep endpoint.

It scans all jobs in tx:active (the Valkey SET of active job IDs). For each job older than 15 minutes in queued or running state:

All sweep actions are logged to translation_job_events for audit trail.

Result: This eliminates the stuck queue problem permanently — no manual cleanup needed. Jobs that silently fail are automatically detected and marked, keeping the active set accurate and the queue healthy.

5.5 Job Phase Transitions

The job state now reflects the exact phase of execution:

PhaseMeaning
queuedSubmitted to SQS, waiting for EventBridge Pipe to trigger Step Function
runningInitJob Lambda fired, Step Function execution ARN recorded, worker translating
deployingAll translations complete, deploy Lambda creating CloudFront invalidations
finalizingDeploy complete, finalize Lambda rebuilding search index and writing final state
succeeded / partial / failedTerminal states — all sub-tasks complete, email notification sent

This gives real-time visibility into exactly where a job is in the pipeline.

6. Admin API (9 Endpoints)

All endpoints require the X-Admin-Key header. They are served by the LPO server (Go) alongside the existing admin routes.

6.1 Submit Translation Job

POST /admin/translate

Submits a new translation job. Validates inputs, checks cost limits, creates job state in Valkey (synchronous) and Aurora (async goroutine), enqueues to SQS.

// Request
POST /admin/translate
X-Admin-Key: tbcc-admin-...
X-Idempotency-Key: my-unique-key (optional)
Content-Type: application/json

{
  "docs": ["Trinity-Beast-API-Reference.html", "Trinity-Beast-Architecture-Guide.html"],
  "langs": "all",
  "options": {
    "force": false,
    "delta": false,
    "skip_search_rebuild": false,
    "skip_validation": false
  }
}

// Response 200
{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T16:42:00Z",
  "data": {
    "job_id": "1747407720-a3f8b2c1d4e5",
    "state": "queued",
    "submitted_at": "2026-05-16T16:42:00Z"
  },
  "error": ""
}

Validation rules:

6.2 Monitoring Endpoints

GET /admin/translate/status/{job_id}

Returns the full job state. Aurora is the primary source — state, timestamps, docs, langs, cost, and Step Function ARN are read from translation_jobs. Real-time per-doc/lang progress is overlaid from Valkey (written per-pair by the worker, too frequent for Aurora writes). If Aurora doesn't have the job yet (async insert still pending), falls back to Valkey.

GET /admin/translate/queue

Lists all pending and active jobs (state in queued or running).

GET /admin/translate/history

Returns the last 50 completed jobs from translation_jobs in Aurora, ordered by submission date descending. Includes state, docs, succeeded/failed pair counts, cost, and reason. Falls back to the Valkey tx:history list if Aurora is unavailable.

GET /admin/translate/health

System health overview:

{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate/health] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate/health",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T17:30:00Z",
  "data": {
    "queue_depth": 0,
    "active_jobs": 1,
    "last_completed_at": "2026-05-16T17:30:00Z",
    "last_state": "succeeded",
    "daily_spend_usd": "12.40",
    "daily_spend_limit_usd": "600.00",
    "daily_input_tokens": 284150,
    "daily_output_tokens": 312480,
    "daily_token_limit": 50000000,
    "swept_jobs": 0
  },
  "error": ""
}

6.3 Control Endpoints

POST /admin/translate/cancel/{job_id}

Stops the Step Function execution via StopExecution API. Marks job as cancelled. Returns 409 if already in a terminal state.

POST /admin/translate/retry-failed/{job_id}

Creates a new job from the failed (doc, lang) pairs of a completed-with-partial job. Returns 409 if the original is still running.

POST /admin/translate/sweep

Manually triggers the self-healing sweeper. Idempotent — safe to call repeatedly.

// Response 200
{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate/sweep] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate/sweep",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T18:00:00Z",
  "data": {
    "swept": 2,
    "checked": 5,
    "results": [
      {
        "job_id": "1747407720-a3f8b2c1d4e5",
        "prior_state": "running",
        "submitted_at": "2026-05-16T16:42:00Z",
        "sfn_status": "FAILED",
        "action": "marked_failed"
      }
    ]
  },
  "error": ""
}

6.4 Worker Callback Endpoints

These endpoints are called by the worker task and finalize Lambdas to update Aurora without needing direct database access (worker and Lambdas are outside the VPC).

POST /admin/translate/update/{job_id}

Updates job state, progress, cost, and timing fields. Called by worker task after each (doc, lang) translation and by finalize Lambda on completion.

POST /admin/translate/event/{job_id}

Records a granular event in the translation_job_events table. Used for audit trail — each doc/lang start, success, failure, retry is logged as a separate event.

Fire-and-forget pattern: Both callback endpoints always return 200 regardless of Aurora write outcome. The translation pipeline must never fail because observability data couldn't be written. Errors are logged but never propagated.

7. Aurora Observability — Source of Truth

Aurora is the authoritative record for all translation job state. Valkey serves one specific role: real-time per-pair progress updates during active execution (written too frequently for Aurora). For everything else — job state, history, cost, audit trail — Aurora is read first.

Design principle: Valkey is the price cache, search indexes, and real-time counters. It is not a job ledger. Aurora is the ledger. When you need to know what was translated, when, at what cost, and with what result — query Aurora.

7.1 translation_jobs Table

One row per job submission. 28 columns covering the full lifecycle. This table is the ground truth for gap analysis, cost reporting, and audit:

Column GroupFieldsPurpose
Identityid, job_id, idempotency_keyUnique identification and deduplication
Statestate, submitted_at, started_at, completed_atLifecycle tracking — authoritative terminal state
Inputdocs (JSONB), langs (JSONB), options (JSONB)What was requested
Progresstotal_pairs, succeeded_pairs, failed_pairs, progress (JSONB)Per-doc/lang status map
Costbedrock_cost_usd, bedrock_invocationsSpend tracking per job
Executionstep_function_arn, errors (JSONB), elapsed_secondsTraceability and debugging
Deploymentcloudfront_invalidation_ids, search_index_rebuilt, notification_sentPost-translation actions
Lineageretry_of, reasonRetry chain and submission reason
Metadatasubmitted_by, created_at, updated_atAudit trail

Gap analysis query: To find which documents have never been translated, query SELECT DISTINCT jsonb_array_elements_text(docs) FROM translation_jobs ORDER BY 1 and compare against the S3 document list. Aurora is the only reliable source for this — Valkey keys expire and don't persist across cache flushes.

7.2 translation_job_events Table

Granular audit log — one row per significant event in a job's lifecycle. Used by the retry-failed handler as the authoritative source of which (doc, lang) pairs failed:

ColumnTypeExample Values
job_idVARCHAR1747407720-a3f8b2c1d4e5
event_typeVARCHARlang_started, lang_succeeded, lang_failed, deploy_started, finalize_complete
docVARCHARTrinity-Beast-API-Reference.html
langVARCHARja, ar, es
detailJSONBCost, chunk count, error message, validator report
created_atTIMESTAMPEvent timestamp

7.3 Read/Write Strategy

The translation system uses a deliberate split between Aurora and Valkey based on access pattern:

DataPrimary StoreReason
Job state (queued/running/succeeded/failed)AuroraAuthoritative terminal state — never expires, queryable, auditable
Job history (last 50 completed)AuroraPermanent record — survives cache flushes, supports gap analysis
Per-pair progress (es: succeeded, ja: running…)ValkeyWritten per-pair during execution — too frequent for Aurora writes, only needed during active polling
Daily spend counterValkeyNeeds atomic INCRBYFLOAT and 24h TTL auto-reset — Aurora is wrong tool for this
Active job setValkeyFast set membership check on every submit — Aurora query would add latency to the hot path

Write path

Read path

Do not rely on Valkey for job state. Valkey keys have no TTL on job hashes and can be flushed, evicted under memory pressure, or simply stale if the finalize Lambda's update call was lost. Aurora is the record of what happened. Valkey is the window into what is happening right now.

8. Cost Protection

The translation engine calls Bedrock (Claude Sonnet 4.6) for every chunk of every document in every language. Without guardrails, a single typo in a batch submission could trigger hundreds of expensive API calls.

8.1 Three Protection Layers

LayerWhereLimitBehavior on Breach
Per-request limitsAdmin API (submit handler)Max 6 docs, max 12 langs, max 3 active jobs400 Bad Request (docs/langs) or queue in SQS (active jobs)
Daily dollar capAdmin API (submit handler)$600/day (autoops:bedrock:spend:daily)429 Too Many Requests until counter expires
Daily token capAdmin API (submit handler)50M combined tokens/day (autoops:bedrock:tokens:input:daily + autoops:bedrock:tokens:output:daily)429 Too Many Requests until counters expire
Per-invocation trackingWorker taskIncrements after every Bedrock callSource of truth for daily counters

8.2 Spend Tracking

Two parallel counters track daily usage — a dollar cap and a token cap. Both live in Valkey with 24-hour TTL auto-reset and are checked on every job submission.

Dollar Cap (autoops:bedrock:spend:daily)

Why $600? A full batch translation of the entire 40-document library × 11 languages costs approximately $726 in raw Bedrock spend at ~$1.65 per doc-language pair (Sonnet 4.6) — but in practice the library is never re-translated all at once. Typical batches are 3 or 6 documents (per the Trinity Beast multiples-of-3 convention) and run well under $200. The $600 cap is a daily safety guardrail with comfortable headroom for several batches plus normal AutoOps overhead (threat analysis, digests, support) in the same 24-hour window.

Token Cap (autoops:bedrock:tokens:input:daily + autoops:bedrock:tokens:output:daily)

Kill switch: Setting autoops:bedrock:kill = "1" in Valkey causes both the submit endpoint and the worker task to refuse all operations. Use this for emergency cost containment.

Pricing formula:

Token rates (stored in Valkey, always current):

AgentInput (per 1M tokens)Output (per 1M tokens)
Haiku 3.5$1.00$5.00
Sonnet 4.6$3.00$15.00
Opus 4$5.00$25.00

Typical costs (Bedrock spend, before markup):

ScenarioBedrock CostCustomer Price
1 document × 1 language (Haiku 3.5)~$0.12~$0.17
1 document × 1 language (Sonnet 4.6)~$1.65~$2.34
1 document × 11 languages (Sonnet 4.6)~$18~$25.50
6 documents × 11 languages (1 batch job, Trinity Beast convention)~$108~$153
30 documents × 11 languages (full library, Sonnet)~$540~$770

8.3 Infrastructure Integration

Translation engine metrics are exposed through two public interfaces:

Email notification timing: The email notification is the absolute LAST step in the pipeline. It fires only after: translation, deployment, search index rebuild, state update, and history push are ALL complete. The email is a comprehensive report including: job summary, translation results, CloudFront invalidation IDs, search index status, and any Bedrock error details. If Bedrock reports validation failures, the specific error messages and validator details are included in the email.

9. CLI Compatibility

The existing CLI tool (scripts/kcc_helpers/translate_doc.py) continues to work unchanged. A --remote flag routes through the new service instead of running Bedrock locally:

FlagBehaviorUse Case
--local (current default)Runs translator engine in-process, calls Bedrock directly from laptopDevelopment, debugging, single-doc quick fixes
--remotePOSTs to /admin/translate, polls /admin/translate/status/{id} every 5s, streams progress to stdoutProduction translations, batch operations

The --remote flag produces identical terminal output to local mode — same progress bars, same chunk counters, same completion summary. The operator's workflow doesn't change; only the execution path does.

Default flip plan: Start with --local as default to avoid surprising anyone. After 30 days of clean production runs through the service, flip the default to --remote and add --local as the explicit fallback.

10. Configuration Reference — Protected Terms

All translation behavior is driven by a single config file: scripts/translation-config.json. This is the shared source of truth consumed by both the Python engine and the Go admin API.

10.1 Protected Terms (57 entries)

Brand names, product names, AWS services, exchange names, and acronyms that must never be translated or transliterated:

Cross Power Ministries of Pakistan, The Trinity Beast Infrastructure,
The Trinity Beast, Trinity Beast Command Center, Kiro Command Center,
Cory Dean Kalani, Shafiq Bhatti, BeastWebhook, BeastMirror, BeastMain,
BeastLRS, Claude Sonnet 4.6, Bedrock, ElastiCache, EventBridge,
CloudFront, GuardDuty, CloudWatch, CloudTrail, Step Functions,
Crypto.com, Coinbase, Gate.io, Gemini, Kraken, Aurora, Valkey,
Stripe, Kiro, Fargate, PostgreSQL, Lambda, Route 53, AutoOps,
TBCC, CPMP, TBI, KCC, OKX, ECR, ECS, ALB, NLB, WAF, SNS, SQS,
SES, VPC, IAM, S3 ...

Per-Request Protected Terms

In addition to the global protected terms list, you can submit document-specific terms via the protected_terms array in the translation request. This is useful for:

POST /admin/translate
{
  "docs": ["Trinity-Beast-API-Reference.html"],
  "langs": "all",
  "protected_terms": ["MyCustomService", "SpecialEndpoint", "ProjectAlpha"]
}

Per-request terms are merged with the global list for that job only. They do not persist across jobs.

10b. Configuration Reference — Preserve Patterns

10.2 Preserve Patterns

Regex patterns for technical tokens that must survive translation unchanged:

Pattern NameMatchesExample
urlHTTP/HTTPS URLshttps://api.cpmp-site.org/admin/translate
emailEmail addressesCoryDeanKalani@CPMP-Site.org
memory_sizeNumber + memory unit1770 MB, 32 GB
percentageNumber + %98.5%, 62%
cron_exprCron expressionscron(0 11 * * ? *)
ip_addressIPv4 with optional CIDR10.0.1.0/24
aws_arnAWS ARN formatarn:aws:sns:us-east-2:211998422884:tbi-ops-notifications
aws_resource_idAWS resource identifiersvpc-03deaddb7083cd59c, sg-050b617f93b2388f6

10c. Configuration Reference — Limits

10.3 Limits

ParameterValuePurpose
max_chunk_chars6000Default maximum characters per chunk (Latin scripts: es, pt, fr, de)
max_chunk_chars_by_langSee belowPer-language overrides for complex scripts
max_retries3Retry attempts per chunk on validation failure
request_timeout_seconds300Per-Bedrock-call timeout (5 minutes — large RTL chunks need headroom)
max_output_tokens8192Maximum tokens in Bedrock response

Per-language chunk size overrides:

LanguagesChunk SizeRationale
hi, ur, ar3000 charsDevanagari and Arabic scripts expand significantly during translation. Smaller chunks prevent Bedrock timeouts.
ja, zh, ru4500 charsCJK and Cyrillic have moderate expansion. Mid-range chunks balance throughput and reliability.
es, pt, fr, de, it6000 chars (default)Latin scripts translate quickly with minimal expansion.

11. Operations Guide

11.1 Submitting a Translation Job

Single document, all languages:

curl -s -X POST \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html"],"langs":"all"}' \
  https://api.cpmp-site.org/admin/translate | jq .

Multiple documents, specific languages:

curl -s -X POST \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html","Trinity-Beast-Architecture-Guide.html"],"langs":["es","pt","fr"]}' \
  https://api.cpmp-site.org/admin/translate | jq .

With idempotency key (safe to retry):

curl -s -X POST \
  -H "X-Admin-Key: $ADMIN_KEY" \
  -H "X-Idempotency-Key: api-ref-2026-05-16" \
  -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html"],"langs":"all"}' \
  https://api.cpmp-site.org/admin/translate | jq .

11.2 Monitoring Progress

# Check job status
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/status/{job_id} | jq .

# View queue
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/queue | jq .

# System health
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/health | jq .

# Recent history
curl -s -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/history | jq .

11.3 Troubleshooting

SymptomCauseResolution
Job stuck in queuedEventBridge Pipe not consumingCheck Pipe status in console; verify IAM role
429 on submitDaily spend cap hit ($600)Wait for 24h TTL expiry, or reset manually: SET autoops:bedrock:spend:daily 0
Partial completionSome languages failed validationPOST /admin/translate/retry-failed/{id}
Worker timeoutDocument too large (many chunks)Check Step Function execution history for the failing chunk index
Cancel returns 404Job only in Aurora, not ValkeyCancel handler falls back to Aurora — ensure latest code is deployed
No email notificationFinalize Lambda errorCheck CloudWatch logs for tbi-translate-finalize
Search not updatedSearch rebuild timed outRun bash scripts/kcc.sh build-search manually

Cancel a running job:

curl -s -X POST -H "X-Admin-Key: $ADMIN_KEY" \
  https://api.cpmp-site.org/admin/translate/cancel/{job_id}

This stops the Step Function execution immediately. Documents already translated and deployed remain live. The search index is rebuilt for whatever landed successfully.

12. Regional Failover

The translation engine implements automatic regional failover to maintain availability during Bedrock service disruptions. This was added after a us-east-2 outage during development exposed the single-region weakness — better to discover this before customers were affected.

12.1 Failover Chain

When a Bedrock call fails with a service-level error, the engine automatically retries in the next region:

PriorityRegionLocationRole
1us-east-2OhioPrimary — all normal traffic
2us-east-1N. VirginiaFirst fallback
3us-west-2OregonSecond fallback

The failover is transparent to the caller — the translation completes successfully as long as at least one region is available. A log message records when a fallback region was used.

12.2 Trigger Conditions

Failover is triggered for service-level errors and timeouts that indicate the region is unavailable or overloaded:

Error TypeMeaningAction
ServiceUnavailableExceptionBedrock service is down (503)Retry same region once, then failover
ThrottlingExceptionRate limit or capacity exceededRetry same region once, then failover
ModelStreamErrorExceptionModel streaming failureRetry same region once, then failover
ReadTimeoutErrorResponse took longer than 300sRetry same region once, then failover
ConnectTimeoutErrorCould not establish connection within 10sRetry same region once, then failover

Other errors (validation failures, authentication errors, malformed requests) are not retried — they would fail identically everywhere.

Per-Region Retry with Backoff

Each region gets 2 attempts before the engine moves to the next region. A 5-second backoff between attempts allows transient pressure to clear:

us-east-2 (attempt 1) → timeout → wait 5s →
us-east-2 (attempt 2) → timeout →
us-east-1 (attempt 1) → timeout → wait 5s →
us-east-1 (attempt 2) → timeout →
us-west-2 (attempt 1) → timeout → wait 5s →
us-west-2 (attempt 2) → timeout → FAIL (raise exception)

Total: 6 attempts across 3 regions. In practice, transient spikes clear within 5-10 seconds, so the retry within the same region usually succeeds without needing failover.

12.3 Cost Impact

Regional failover has negligible cost impact:

Resilience benefit: A complete regional outage no longer blocks translations. The May 2026 us-east-2 outage would have caused a 4-hour translation blackout without this feature. With failover, translations continued uninterrupted via us-east-1.

13. Document Preparation Guide

Proper document preparation ensures clean translations with minimal post-processing. This section covers the conventions that help the translation engine produce accurate results.

13.1 Code Tag Usage

The <code translate="no"> tag tells the translation engine to preserve content exactly as written. Use it correctly to avoid formatting artifacts in translated documents.

When to Use Code Tags

Use <code translate="no"> for technical identifiers that would break if translated:

When NOT to Use Code Tags

Do not wrap pure data values in code tags — they should appear as plain text:

Why this matters: The translation engine's sentinel system protects code-tagged content from translation. If you wrap "32 GB" in code tags, it survives translation — but so does the monospace formatting, which looks wrong in prose. The engine has a post-processor that strips spurious code wrappers from pure numeric values, but it's better to author correctly from the start.

Quick Test

Ask yourself: "If I changed this value, would the system break?" If yes, use code tags. If no (it's just a number or measurement), leave it as plain text.

ContentWould changing it break something?Use code tags?
tbi-ops-notifyYes — Lambda name✅ Yes
1770 MBNo — just a memory size❌ No
/admin/translateYes — API endpoint✅ Yes
$600No — just a dollar amount❌ No
max_retriesYes — config key✅ Yes
3 retriesNo — just a count❌ No

13.2 Protected Terms Submission

For documents with domain-specific terminology not in the global protected terms list, submit additional terms with the translation request:

POST /admin/translate
{
  "docs": ["Customer-Integration-Guide.html"],
  "langs": "all",
  "protected_terms": [
    "CustomerCorp",
    "ProjectPhoenix",
    "DataSync API",
    "IntegrationHub"
  ]
}

These terms are added to the global list for this job only. The engine will:

  1. Wrap each term in <span translate="no"> during preprocessing
  2. Replace with sentinel tokens before sending to Bedrock
  3. Restore the original terms after translation
  4. Validate that all terms survived intact

Best Practices for Protected Terms

13.3 Clarification Workflow

When the translation engine encounters ambiguous content, it may flag it for human review. This happens in the validation phase when:

Flagged content appears in the job status response under the warnings array:

{
  "status": "✅ [LPO] [us-east-2] [BeastMain] [/admin/translate/status/1747407720-a3f8b2c1d4e5] [200]",
  "status_code": 200,
  "endpoint": "/admin/translate/status/1747407720-a3f8b2c1d4e5",
  "cluster_node": "BeastMain",
  "region": "us-east-2",
  "language": "en",
  "timestamp": "2026-05-16T17:45:00Z",
  "data": {
    "job_id": "1747407720-a3f8b2c1d4e5",
    "state": "succeeded",
    "warnings": [
      "chunk 14 (ja): soft failure — protected term 'DataSync' may have been altered",
      "chunk 22 (ar): soft failure — version number format changed from X.Y.Z to X.Y"
    ]
  },
  "error": ""
}

Soft failures don't block the translation — the output is still deployed. Review the warnings and manually verify the flagged sections if needed.

Feedback loop: If you consistently see the same term flagged, add it to the global protected terms list in scripts/translation-config.json. This prevents future warnings and improves translation quality across all documents.

14. Pre-Scan Complexity Analysis

Before translation begins, the engine analyzes each document for complexity factors that may cause validation failures. This pre-scan identifies code-heavy sections and recommends whether to proceed, exercise caution, or split the document.

14.1 Complexity Metrics

The pre-scan calculates a complexity score for each section based on:

FactorWeightWhy It Matters
Code tags1.0 per tagEach code tag must survive translation intact — more tags = more validation points
Code tags in tables1.5 per tagTables with code examples are harder — model tends to merge or drop tags when reordering
Tables2.0 per tableTables with technical content require careful structure preservation
Pre blocks0.5 per blockUsually have translate="no" — lower risk but still tracked
Protected spans0.3 per spanHandled by sentinel system — low risk

Section Thresholds

14.2 Recommendations

Based on the analysis, the pre-scan returns one of three recommendations:

RecommendationCriteriaAction
PROCEEDScore < 20, no high-density sectionsTranslate normally — low failure risk
CAUTIONScore < 50, ≤ 2 high-density sectionsProceed but monitor — may need retries
SPLITScore ≥ 50 OR > 3 high-density sectionsConsider splitting document before translation

Pre-Scan Output Example

DOCUMENT TRANSLATION COMPLEXITY ANALYSIS
========================================
Total characters: 81,107
Total sections: 13
Total code tags: 287
Overall complexity score: 415.4
Recommendation: SPLIT

WARNINGS:
  ⚠️  Document has 287 code tags — high validation failure risk
  ⚠️  Section 'step-function' has 51 code tags — consider simplifying
  ⚠️  Section 'observability' has 48 code tags — consider simplifying

HIGH-DENSITY SECTIONS (9):
  • architecture: 11 code tags, score 22.7
  • sentinel-system: 22 code tags, score 34.5
  • step-function: 51 code tags, score 71.1
  ...

SUGGESTED SPLIT: 4 parts
  → Split after 'validators' (After 3 high-density sections)
  → Split after 'observability' (After 3 high-density sections)
  → Split after 'doc-prep' (After 3 high-density sections)

14.3 Document Splitting

When the pre-scan recommends splitting, it suggests natural break points at section boundaries. Options for handling complex documents:

Option 1: Split into Multiple Documents

Create separate HTML files for each part (e.g., Doc-Part1.html, Doc-Part2.html). Each part translates independently with lower failure risk. Link them together with navigation.

Option 2: Simplify High-Density Sections

Reduce code tag density in problematic sections:

Option 3: Translate in Batches

Submit fewer languages per job (e.g., 3 instead of 11). This reduces concurrent load and allows the model more capacity per translation. Retry failed languages individually.

Per-Language Split Thresholds (v2.5)

Complex scripts (Urdu, Arabic, Hindi) struggle with high tag density even when Latin-script languages handle the same chunk fine. The prescan now applies per-language code tag limits — tighter thresholds for scripts where the model is more likely to drop markup:

LanguageScriptMax Code Tags per Part
Default (Latin, CJK, Cyrillic)Latin / Kanji / Cyrillic30
Urdu (ur)Nastaliq18
Arabic (ar)Arabic18
Hindi (hi)Devanagari20

Configuration key: max_code_tags_per_part_by_lang in translation-config.json. When the prescan runs for a specific language, it uses that language's threshold to determine split points. A document that translates as one part for Spanish may automatically split into 2-3 parts for Urdu.

Result: The Translation Service document (22 code tags in the Architecture section) previously failed for Urdu on every attempt. With the per-language threshold of 18, the prescan splits Architecture and Observability into separate parts. All 11 languages now translate successfully.

This document is an edge case: The Translation Engine documentation itself has 287 code tags and a complexity score of 415 — it's documentation about a translation engine, so it's packed with code examples. Most documents score under 50.

14.4 Splitting Safety Valve (v2.8)

Even when code tag density is low, a single part that exceeds the model's effective output window will be silently truncated — sections at the end of the part simply disappear from the output. The safety valve enforces a hard character limit per part regardless of prescan recommendations.

The Problem

The Performance Report (75 KB) has 18 sections with moderate code density. The prescan recommended splitting into 3 parts based on code tag thresholds. But Part 1 was 36 KB of prose-heavy content — well under the code tag limit but far beyond the model's output token budget. The model translated the first ~24 KB faithfully, then its output simply stopped. Sections 7-8 (partner-sustained, udp-engine) vanished without any error signal.

The Fix

# Safety valve: max chars per part (prevents model output truncation)
MAX_CHARS_PER_PART = 24000  # ~6000 tokens, well within max_output_tokens

The splitter now enforces a 24 KB ceiling on every part. If a part exceeds this limit after the prescan-based split, it is further subdivided at the nearest section boundary. This is conservative — Latin scripts could handle ~30 KB, but 24 KB is safe for all languages including RTL and CJK where token efficiency is lower.

Impact

DocumentBefore (v2.6)After (v2.8)
Performance Report (75 KB)3 parts (Part 1: 36 KB — truncated)4 parts (largest: 22 KB — clean)
API Reference (180 KB)8 parts (all under 24 KB already)8 parts (no change — already safe)
Translation Engine (116 KB)11 parts (code-density driven)11 parts (no change — code splits dominate)

The safety valve only activates when the prescan's code-tag-based splitting produces oversized parts. For most documents, the code density split already keeps parts well under 24 KB.

Result: Performance Report went from dropping 3 entire sections (silent truncation) to a perfect 18/18 sections, 4/4 diagrams, 20/20 <br/> tags across all 11 languages.

15. Document-Level Preprocessor

The document-level preprocessor is a critical layer that runs before chunking. It extracts complex HTML elements from the entire document, replacing them with simple Unicode placeholders. After translation, the postprocessor restores the original elements. This eliminates the "model drops tags" failure mode entirely.

15.1 The Problem

The per-chunk sentinel system (Section 3) works well for most documents, but complex documents with many <code>, <strong>, and <em> tags exposed a fundamental limitation:

Example failure: A chunk with 27 <code translate="no"> tags consistently failed validation with tag count mismatch (27→23) — the model dropped 4 placeholders despite explicit instructions to preserve them.

15.2 The Solution

Extract ALL problematic elements from the entire document before chunking. The model never sees these elements — only simple Unicode placeholders that it cannot confuse with HTML structure.

Key insight: The model cannot corrupt what it never sees. By extracting elements at the document level, each chunk has zero complex tags to worry about. The model translates clean prose with obvious markers.

Before vs After

Pipeline StageBefore (v2.2)After (v2.3)
Document received290 code tags290 code tags
After preprocessing0 code tags (290 placeholders)
Per-chunk sentinels20+ placeholders per chunk0-2 placeholders per chunk
Model cognitive loadHigh (complex structure)Low (clean prose)
Validation failuresFrequent on complex docsRare

15.3 Processing Flow

The preprocessor integrates into the translation pipeline as the first step:

Document → PREPROCESS → Chunk → Translate → Reassemble → POSTPROCESS → Output
              ↓                                              ↓
     Extract ALL code/pre/strong/em tags        Restore placeholders
     Replace with ⟦CODE_001⟧, ⟦STRONG_002⟧     with original HTML
     Build manifest mapping                     from manifest

Integration in engine.py

def translate(text, target_lang, mode="html", ...):
    # Step 1: PREPROCESS — Extract elements (document-level)
    simplified_html, manifest = preprocess_for_translation(text)
    
    # Step 2: CHUNK — Split simplified document (zero complex tags now)
    head, chunks, tail = chunker.split_document(simplified_html, lang=target_lang)
    
    # Step 3: TRANSLATE — Each chunk through Bedrock
    for chunk in chunks:
        translated = _translate_chunk(chunk, ...)  # Per-chunk sentinels still run
    
    # Step 4: REASSEMBLE
    reassembled = chunker.reassemble(head, translated_chunks, tail)
    
    # Step 5: POSTPROCESS — Restore placeholders with original elements
    output = postprocess_translation(reassembled, manifest)

15.4 Element Extraction

The preprocessor extracts elements in order of specificity (most specific first) to handle nesting correctly:

PassElements ExtractedPlaceholder Format
1<pre translate="no"> blocks⟦PRE_001⟧
2<code translate="no"> tags⟦CODE_001⟧
3Other translate="no" elements⟦SPAN_001⟧
4<strong>, <em>, <b>, <i> tags⟦STRONG_001⟧, ⟦EM_001⟧
5Numeric patterns (memory sizes, percentages, versions)⟦MEM_001⟧, ⟦PCT_001⟧, ⟦VER_001⟧

Placeholder Format

Placeholders use Unicode brackets ( and ) that will never appear in real HTML content:

Nested Element Handling

The preprocessor handles arbitrary nesting depth by processing innermost elements first:

Source:
<span translate="no"><code translate="no">tbi-ops-notify</code> Lambda</span>

Pass 1: Extract inner code tag
<span translate="no">⟦CODE_001⟧ Lambda</span>

Pass 2: Extract outer span
⟦SPAN_002⟧

Model sees: ⟦SPAN_002⟧ (one token, no nesting)

Sibling Placeholder Awareness

When the preprocessor extracts elements from a container (e.g., a table cell), earlier passes leave placeholder text in the parent. Later passes must not be confused by these sibling placeholders — a <code translate="no"> tag in the same table cell as an already-extracted element is still a valid extraction target.

Bug fixed (v2.4): The original _is_inside_placeholder check walked up the DOM tree looking for the character in any parent's text. This caused false positives — if a sibling element had been extracted (leaving ⟦CODE_042⟧ in the parent's text), the check incorrectly skipped remaining <code translate="no"> tags in the same container. Those unextracted tags then overwhelmed the model during complex-script translation (Hindi, Urdu). Fix: the check now always returns false — if an element still exists in the DOM tree, it wasn't extracted and is a valid target.

15.5 Restoration

After translation, the postprocessor restores placeholders in reverse index order (high → low) to prevent prefix collisions:

Translated: ⟦SPAN_002⟧

Restore ⟦SPAN_002⟧:
<span translate="no">⟦CODE_001⟧ Lambda</span>

Restore ⟦CODE_001⟧:
<span translate="no"><code translate="no">tbi-ops-notify</code> Lambda</span>

Perfect reconstruction — model never had to understand nesting.

Manifest Structure

The manifest maps each placeholder to its original HTML, enabling exact restoration:

{
  "⟦CODE_001⟧": {
    "type": "CODE",
    "html": "<code translate=\"no\">tbi-ops-notify</code>",
    "index": 1
  },
  "⟦SPAN_002⟧": {
    "type": "SPAN",
    "html": "<span translate=\"no\">⟦CODE_001⟧ Lambda</span>",
    "index": 2
  }
}

Result: The Translation Engine document (290 code tags, complexity 423) now translates with 0 retries across all 11 parts. Previously it failed consistently on Part 8 (config section with 27 code tags).

15.6 Numeric Pattern Extraction

Pass 5 extracts numeric patterns from the text after HTML element extraction. This protects bare numbers in prose that weren't already inside code or span tags. The model cannot convert, localize, or drop what it never sees.

Why Numeric Extraction Matters

When translating to complex scripts (Arabic, Hindi, Urdu), the model occasionally:

These transformations break technical accuracy. The numeric extraction pass prevents all of them.

Patterns Extracted

Pattern TypeRegexExamplesPlaceholder
Memory sizes\d+(?:\.\d+)?\s?(?:GB|MB|KB|TB)32 GB, 1770 MB, 256 KB⟦MEM_001⟧
Percentages\d+(?:\.\d+)?%98.5%, 62%, 100%⟦PCT_001⟧
Version numbers\d+\.\d+(?:\.\d+)?4.6, 17.7, 2.3.1⟦VER_001⟧

Processing Order

Numeric extraction runs after HTML element extraction (Passes 1-4). This means:

Example: Hindi Translation

Source:
"The Lambda uses 1770 MB of memory and achieves 98.5% uptime."

After Pass 5:
"The Lambda uses ⟦MEM_042⟧ of memory and achieves ⟦PCT_043⟧ uptime."

Model translates prose, placeholders survive intact.

After restoration:
"लैम्ब्डा 1770 MB मेमोरी का उपयोग करता है और 98.5% अपटाइम प्राप्त करता है।"

Technical values preserved exactly — no localization, no conversion.

Result: Translation failures caused by numeric value loss (preserve_memory_size: missing: GB, MB) are now resolved across all 11 languages. Numeric values survive intact regardless of target script.

Placeholder Collision Prevention

The numeric extraction pass includes safeguards to prevent extracting numbers that are part of existing placeholder names (e.g., the "001" in ⟦CODE_001⟧):

Without these guards, the numeric regex would corrupt placeholder names by extracting their index numbers, producing nested placeholders like ⟦CODE___TBN10__⟧ that the model cannot handle.

16. Notification System

The translation engine sends email notifications via the AutoOps notification pipeline (tbi-ops-notify Lambda → SES). Notifications are consolidated across batch jobs and include detailed per-document breakdowns.

16.1 Email Format

Each notification email includes:

Example Notification

Subject: [INFO] Translation Complete: 2 docs × 11 langs — 22/22 pairs SUCCEEDED

Batch Summary:
• Jobs: 2
• Documents: 2
• Languages: 11
• Total Pairs: 22
• Succeeded: 22
• Failed: 0
• Final State: SUCCEEDED
• Total Time: 7m 12s

Documents Translated:
• Trinity-Beast-AutoOps-Translation-Engine.html
  ✓ Succeeded: es, pt, fr, de, ru, hi, ja, zh, ar, ur, it
• Trinity-Beast-Infrastructure-Overview.html
  ✓ Succeeded: es, pt, fr, de, ru, hi, ja, zh, ar, ur, it

Deployment:
• CloudFront Invalidations: 2
• All translated files deployed to S3

Search Index:
• Rebuilt successfully (all 11 languages)

Partial Success Example

Subject: [WARNING] Translation Complete: 1 doc × 11 langs — 10/11 pairs PARTIAL

Documents Translated:
• Complex-Technical-Guide.html
  ✓ Succeeded: es, pt, fr, de, ru, hi, ja, zh, ar, it
  ✗ Failed: ur

Error Details:
• Complex-Technical-Guide.html → ur: chunk 14 failed validation after 3 retries
  check_tag_counts: expected 27 code tags, found 23

16.2 Batch Consolidation

When multiple translation jobs are submitted together (e.g., translating 5 documents), the notification system consolidates them into a single email:

This prevents notification spam when translating multiple documents — you get one comprehensive email covering the entire batch, not 5 separate emails.

16.3 Document Resolver (v2.5)

When the same document appears in multiple jobs within a batch (e.g., initial run fails Urdu, retry succeeds), the notification resolves duplicate entries into a single final-state view:

Without the resolver, a retry job would show the same document twice — once with the failure and once with the fix — making the notification confusing and the counts misleading.

16.4 Tag Inventory (v2.8)

Every notification includes a Tag Inventory section showing source vs output tag counts per document. This lets you detect at a glance if the model is adding or dropping tags. As of v2.8, the inventory also reports Mermaid diagram counts:

Tag Inventory (source → output):
• Trinity-Beast-Translation-Service.html
  IN:  code:22 pre:5 strong:8 em:2 a:4 br:3 diagrams:1
  OUT: code:22 pre:5 strong:8 em:2 a:4 br:3 diagrams:1

If the model has a bad day and adds a <span> that wasn't in the source, or drops code tags, you'll see the mismatch immediately:

  IN:  code:23 pre:5 strong:8 diagrams:2
  OUT: code:20 pre:4 strong:8 diagrams:1    ← 3 code dropped, 1 diagram lost

Tag counts are logged per-language in Aurora (translation_job_events) with tags_in and tags_out fields. The notification shows the first successful language's counts (source tags are identical across all languages since it's the same source document).

Recipient: All translation notifications go to CoryDeanKalani@CPMP-Site.org via the unified AutoOps notification pipeline. The sender is CPMP Mission <No-Reply@CPMP-Site.org>.

17. Delta Translation (Incremental Updates)

Documents change frequently — a new endpoint, a revised architecture, an updated pricing table. Without delta translation, every edit requires re-translating the entire document across all 11 languages. Delta translation solves this by identifying exactly which sections changed and translating only those, reusing cached translations for everything else.

17.1 Concept

The delta translation system leverages two key properties of the document library:

By comparing the current English document against the version that was last translated, the system identifies which sections changed (by content hash) and only sends those to Bedrock. Unchanged sections are pulled directly from the existing translated document. Typical savings: 70–90% on incremental updates.

17.2 S3 Versioning as Diff Source

The website bucket (trinity-beast-website-east2) has versioning enabled. Every aws s3 cp or s3api put-object creates a new version with a unique VersionId. The delta system uses this to:

No separate manifest storage is required — S3 already has the full history. A lightweight metadata file (docs/delta/{doc}.{lang}.json) tracks which VersionId was last translated for each document-language pair.

17.3 Comment Preservation (Sentinel Pass 0)

For delta translation to work, <!-- TBI-CHUNK --> markers must survive the translation round-trip. Previously, Bedrock silently dropped HTML comments during translation. The sentinel system now includes a Pass 0 that protects all HTML comments:

# Pass 0: Before Bedrock sees the chunk
<!-- TBI-CHUNK -->  →  __TBP0__    (sentinel token)
<!-- Section 5 -->  →  __TBP1__    (sentinel token)

# After translation: sentinels restored
__TBP0__  →  <!-- TBI-CHUNK -->
__TBP1__  →  <!-- Section 5 -->

This is implemented as the first pass in _apply_sentinels() in engine.py, before the existing translate="no" element extraction (Pass 1), paired span sentinels (Pass 2), and numeric protection (Pass 3). Comments are treated as Type A (FULL) sentinels — extracted completely and restored verbatim.

17.4 Hash-Based Section Matching

The algorithm is position-independent — sections are matched by content hash, not by index. This means markers can be added, removed, or repositioned between versions without breaking the delta logic.

Diagram 17.1: Delta Translation Flow

flowchart TD
    A[Fetch Current English from S3] --> B[Split by TBI-CHUNK markers]
    B --> C[Hash each section SHA-256]
    D[Fetch Previous English version] --> E[Split by TBI-CHUNK markers]
    E --> F[Hash each section]
    C --> G{Compare hashes}
    F --> G
    G -->|Match found| H[Pull from existing translation]
    G -->|No match| I[Send to Bedrock]
    H --> J[Reassemble with TBI-CHUNK markers]
    I --> J
    J --> K[Deploy to S3 + Save metadata]

    style A fill:#1e3a5f,stroke:#60a5fa,color:#e0e0e0
    style D fill:#1e3a5f,stroke:#60a5fa,color:#e0e0e0
    style H fill:#064e3b,stroke:#10b981,color:#e0e0e0
    style I fill:#7c2d12,stroke:#f97316,color:#e0e0e0
    style K fill:#1e3a5f,stroke:#60a5fa,color:#e0e0e0
        

Marker repositioning example:

17.5 CLI Commands

Four KCC commands support delta translation and chunk management:

Delta Diff (Analysis Only)

# List available S3 versions
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html --list-versions

# Compare current vs previous version (auto-detects)
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html

# Compare against a specific version
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html --version-id ksYxUBZIUB8Roi2KQYje6ig9R7JesL9z

# Show delta for a specific language
bash scripts/kcc.sh delta-diff Trinity-Beast-API-Reference.html --lang ja

Delta Translate (Incremental Translation — Local CLI)

# Dry run — show what would change without calling Bedrock
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html es --dry-run

# Translate only changed sections for one language
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html es

# Translate changed sections for all languages
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html all

# Force full translation (creates fresh baseline)
bash scripts/kcc.sh delta-translate Trinity-Beast-API-Reference.html all --force

Delta via Remote API (options.delta)

The delta option is also available on POST /admin/translate — the worker skips any language pair where the translated file on S3 is already newer than the source document. No local CLI needed.

# Submit a delta job via the remote API — skips up-to-date pairs automatically
curl -s -X POST -H "X-Admin-Key: $ADMIN_KEY" -H "Content-Type: application/json" \
  -d '{"docs":["Trinity-Beast-API-Reference.html"],"langs":"all","options":{"delta":true}}' \
  https://api.cpmp-site.org/admin/translate | jq .

Delta Validate (Marker Preservation Check)

# Validate TBI-CHUNK markers survived translation for all delta-enabled docs
bash scripts/kcc.sh delta-validate all all

# Validate a specific doc across all languages
bash scripts/kcc.sh delta-validate Trinity-Beast-API-Reference.html all

# Validate a specific doc + language pair
bash scripts/kcc.sh delta-validate Trinity-Beast-API-Reference.html es

Reports pass/fail per doc×lang pair. Exit code 0 if all pass, 1 if any markers were lost. Run after any translation job to confirm Sentinel Pass 0 is working correctly.

Chunk Sizer (Auto-Placement Suggestions)

# Analyze a doc from S3 and suggest TBI-CHUNK marker placement
bash scripts/kcc.sh chunk-size Trinity-Beast-API-Reference.html

# Analyze a local file
bash scripts/kcc.sh chunk-size /path/to/local/doc.html

Scans the document for <section>, <h2>, <h3>, and .category-section boundaries. Reports current chunk sizes (if markers exist), identifies policy violations, and suggests where to insert markers to stay within the 15KB/18KB/12KB policy. Dense sections (high translate="no" density) automatically target the tighter 12KB limit.

17.6 Bootstrap Path

Existing translated documents do not contain <!-- TBI-CHUNK --> markers (they were stripped before the sentinel fix). The bootstrap sequence is:

  1. First run (full cost): Use --force to translate the entire document. The sentinel fix preserves markers in the output. Delta metadata is saved to S3.
  2. Subsequent runs (delta savings): The tool detects the existing translation has markers, loads metadata to identify the previous English version, and only translates changed sections.

After the bootstrap run, typical savings on incremental updates:

Change TypeTypical SavingsExample
Single section edit85–95%Fix a typo, update one endpoint
New section added70–85%Add a new feature section
Marker repositioned60–75%Split a large section in two
Major rewrite20–40%Restructure half the document

Cost model: At approximately $1.50 per section-language pair, a 9-section document across 11 languages costs ~$148.50 for a full translation. With delta (2 sections changed), the same update costs ~$33 — a 78% reduction.

Quick Reference

ItemValue
Modelus.anthropic.claude-sonnet-4-6 (cross-region inference profile)
Failover Regionsus-east-2us-east-1us-west-2
Target Languages11: es, pt, fr, de, ru, hi, ja, zh, ar, ur, it
Worker RuntimePython 3.11 (ECS Fargate task, container image)
Deploy/Finalize RuntimeGo (provided.al2023)
Worker Resources1 vCPU / 3 GB (Fargate — no timeout ceiling)
Memory (Lambdas)1770 MB
Worker TimeoutNone (runs to completion)
Finalize Timeout180s
Deploy Timeout60s
Max Docs per Request6
Max Active Jobs3
Daily Dollar Cap$600 (24h TTL auto-reset)
Daily Token Cap50M combined tokens (24h TTL auto-reset)
Chunk Size (Latin scripts)6000 chars
Chunk Size (CJK + Russian)4500 chars (ja, zh, ru)
Chunk Size (Indic + Arabic)3000 chars (hi, ur, ar)
Retries per Chunk3
Max Part Size24 KB (safety valve — prevents model output truncation)
MaxConcurrency (per-language)0 (unlimited — all language containers launch simultaneously)
ECR Repositorytbi-translate-worker
SQS Queuetrinity-beast-translation-queue
Step Functiontbi-translation-orchestrator
IAM Role (Worker + Lambdas)tbi-translate-role
IAM Role (Pipe)tbi-translate-pipe-role
IAM Role (Step Function)tbi-translate-orchestrator-role
Valkey Keystx:job:{id}, tx:active, tx:history, tx:idempotency:{key}, autoops:bedrock:spend:daily, autoops:bedrock:tokens:input:daily, autoops:bedrock:tokens:output:daily
Aurora Tablestranslation_jobs, translation_job_events
Delta Metadatadocs/delta/{doc}.{lang}.json (S3)
Delta CLIbash scripts/kcc.sh delta-diff, bash scripts/kcc.sh delta-translate, bash scripts/kcc.sh delta-validate, bash scripts/kcc.sh chunk-size
CloudWatch NamespaceTBI/Translation