--- title: "AWS-Native Frontend Operational Excellence" description: "Complete observability, release, on-call, and ORR strategy for a CloudFront + Lambda@Edge + S3 frontend application." date: 2026-04-14T00:00:00.000Z tags: [aws, cloudfront, observability, opex, orr, lambda-edge, rum, synthetics, waf, x-ray] --- # AWS-Native Frontend Operational Excellence Comprehensive operational strategy for a **Vite/React SPA** served via **CloudFront + S3** with **Lambda@Edge Cognito auth** (viewer-request token validation on every request). Team context: ~8 engineers with on-call rotation. Backend is a separate team/service. ## Priority Legend | Tag | Meaning | Criteria | |-----|---------|----------| | `[P0]` | **Critical** | Without this, outages go undetected or recovery is impossible. Direct user impact. | | `[P1]` | **High** | Significantly improves detection speed, reduces MTTR, or prevents incidents. | | `[P2]` | **Medium** | Improves operational maturity, catches edge cases, reduces noise. | | `[P3]` | **Low** | Optimization, deeper analysis, governance polish. Valuable but not urgent. | ## Architecture Overview ```mermaid graph TB subgraph "Edge (CloudFront)" WAF[AWS WAF] CF[CloudFront Distribution] LE_check[check-auth
viewer-request] LE_parse[parse-auth
viewer-request] LE_refresh[refresh-auth
viewer-request] LE_signout[sign-out
viewer-request] LE_headers[http-headers
viewer-response] end subgraph "Origin" S3[S3 Bucket
Vite Build Assets] end subgraph "Auth" COG[Cognito User Pool
Hosted UI + Token Endpoint] end User -->|request| WAF --> CF CF --> LE_check -->|authenticated| S3 LE_check -->|unauthenticated| COG COG -->|callback| LE_parse LE_check -->|expired token| LE_refresh User -->|logout| LE_signout S3 -->|response| LE_headers -->|+ security headers| User ``` --- ## 1. Observability Stack ### The Full Picture ```mermaid graph LR subgraph "Client-Side" RUM[CloudWatch RUM] APP[Custom App Metrics] end subgraph "Synthetic" SYN[CloudWatch Synthetics] R53[Route 53
Health Checks] end subgraph "Edge" CFM[CloudFront Metrics] CFRL[Real-Time Logs] IM[Internet Monitor] WAFM[WAF Metrics + Logs] end subgraph "Compute" LEL["Lambda@Edge Logs"] XRAY[X-Ray Traces] end subgraph "Origin" S3M[S3 Request Metrics] end subgraph "Governance" CFG[AWS Config] CT[CloudTrail] COST[Cost Anomaly Detection] end subgraph "Alerting" AD[Anomaly Detection] CWA[CloudWatch Alarms] COMP[Composite Alarms] SNS[SNS Topics] end RUM --> CWA APP --> CWA SYN --> CWA R53 --> CWA CFM --> CWA CFRL --> CWA IM --> CWA WAFM --> CWA LEL --> CWA XRAY --> CWA S3M --> CWA CFG --> CWA CT --> CWA COST --> CWA AD --> CWA CWA --> COMP COMP --> SNS ``` ### 1.1 CloudWatch RUM (Real User Monitoring) `[P0]` **What it captures:** Page load performance, Web Vitals (LCP, FID, CLS), unhandled JS errors with stack traces, HTTP errors, session-level correlation. **Setup:** - Create an App Monitor in CloudWatch RUM - Add the RUM web client snippet to the React app's entry point - Configure **100% sampling** for a team holding to a high standard (adjust down if cost is a concern) - Enable **session tracking** to correlate errors to specific user journeys - Send RUM data to a CloudWatch Log Group for retention beyond 30 days **Key metrics to alarm on:** - LCP p75 > 2.5s (poor user experience) - JS error rate > threshold (broken deployment) - HTTP 4xx/5xx rate from client perspective **Source maps:** `[P1]` - Upload Vite-generated source maps to RUM so JS errors show original file/line instead of minified output - Add source map upload as a CI/CD pipeline step after `vite build` - Do NOT serve source maps to users (security risk) — upload to RUM only **Cost:** ~$1 per 100K RUM events. Free tier: 1M events/month. ### 1.2 CloudWatch Synthetics (Canary Monitoring) `[P0]` Synthetic canaries catch outages *before* users report them. Create canaries for: | Canary | What it tests | Schedule | |--------|--------------|----------| | **Homepage load** | Full page load, assert key DOM elements render | Every 5 min | | **Auth flow** | Navigate to protected route, verify redirect to Cognito, complete login, verify app loads | Every 10 min | | **Critical user journeys** | Key workflows in the SPA (navigation, API calls) | Every 10 min | | **Multi-region probes** | Same homepage canary from 3+ regions | Every 5 min | - Use **multi-step blueprints** to bundle up to 10 checks per canary - Canaries auto-create CloudWatch alarms on failure - Enable **automatic retries** to reduce false positives - Store screenshots and HAR files in S3 for debugging ### 1.3 CloudFront Monitoring `[P0]` **Standard Logs (free):** `[P1]` - Delivered to S3, 5-30 min delay, 100% coverage - Use for compliance, historical analysis, Athena queries - Retain indefinitely in S3 with lifecycle policies **Real-Time Logs (paid):** `[P1]` - Streamed to Kinesis Data Stream within seconds - Configurable sampling rate and field selection - Use for operational alerting and live debugging - Route: Kinesis Data Stream -> Kinesis Firehose -> S3 + CloudWatch Logs **CloudFront native metrics to alarm on:** - `5xxErrorRate` > 1% - `4xxErrorRate` > 5% (may indicate broken asset paths after deploy) - `CacheHitRate` drop (origin overload risk) - `OriginLatency` p99 **CloudWatch Internet Monitor:** `[P2]` - Enable for the CloudFront distribution - Monitors internet path quality to edge locations - Surfaces ISP-level and geography-level issues (not your fault, but your users still feel it) - Useful for distinguishing "our app is broken" vs "Comcast is having issues in Ohio" ### 1.4 Lambda@Edge Logging `[P0]` **Critical gotcha:** Lambda@Edge executes in the region closest to the viewer. Logs historically went to CloudWatch Logs in *that* region, not your deployment region. This made log aggregation painful. **Solution — Advanced Logging (2025+):** - Configure a custom CloudWatch Log Group at the function level - Lambda@Edge now supports specifying log destinations, enabling centralized logging without subscription filter pipelines - Set this in CDK when defining the Lambda@Edge function **What to log:** - Auth failures (invalid/expired tokens, Cognito errors) - Token refresh events - Request latency within the function - Cognito endpoint response times (your auth depends on Cognito availability) **What to alarm on:** - Lambda@Edge `5xxError` rate - Lambda@Edge `Duration` p99 approaching timeout (viewer-request has a 5s limit) - Lambda@Edge `Throttles` (concurrent execution limits at edge) - Cognito token endpoint errors (from Lambda@Edge logs) ### 1.5 Dashboards `[P1]` Create **three CloudWatch dashboards:** | Dashboard | Audience | Contents | |-----------|----------|----------| | **Executive** | Leadership, stakeholders | SLO burn rate, availability %, Web Vitals trends, incident count | | **Operational** | On-call engineer | All alarms in one view, error rates, latency, cache hit rate, Lambda@Edge health | | **Debug** | Engineer investigating an issue | RUM session details, Lambda@Edge logs, CloudFront real-time logs, canary screenshots | ### 1.6 X-Ray Tracing `[P2]` Enable X-Ray on Lambda@Edge functions to trace the auth flow end-to-end: - Trace `check-auth` → Cognito token validation (identify if latency is JWT parsing vs network call) - Trace `parse-auth` → Cognito token endpoint (exchange code for tokens) - Trace `refresh-auth` → Cognito refresh endpoint - Visualize latency breakdown: Lambda cold start vs execution vs downstream calls - Correlate traces with RUM sessions for full client-to-edge visibility **Setup:** Enable active tracing on each Lambda@Edge function in CDK. X-Ray SDK adds ~few ms overhead. **Limitation:** X-Ray doesn't trace CloudFront itself — only the Lambda@Edge functions. No trace continuity from viewer request through CloudFront to origin. ### 1.7 WAF Monitoring `[P1]` **WAF Metrics (CloudWatch):** - `AllowedRequests` / `BlockedRequests` / `CountedRequests` per rule and WebACL - Alarm on sudden spike in `BlockedRequests` (possible attack or false-positive blocking real users) - Alarm on drop in `AllowedRequests` (WAF might be blocking legitimate traffic) **WAF Logs:** - Enable full logging to CloudWatch Logs, S3, or Kinesis Firehose - Log sampled requests at minimum; full logging for investigation capability - Use CloudWatch Logs Insights to query blocked requests by rule, IP, URI, country **Key things to monitor:** - Rate-based rule triggers (DDoS or scraping) - Managed rule group matches (which rules are firing and on what) - False positives — legitimate users hitting WAF blocks (auth callbacks, API paths) - Geographic blocking patterns ### 1.8 S3 Origin Monitoring `[P2]` Enable **S3 request metrics** (not on by default) for the origin bucket: - `4xxErrors` — broken asset paths, missing files after deploy - `5xxErrors` — S3 service issues - `TotalRequestLatency` — slow origin responses - `FirstByteLatency` — time to first byte from S3 **Alarm on:** - `5xxErrors` > 0 sustained (S3 issues are rare but impactful) - `4xxErrors` spike after deploy (missing assets = broken build pushed) ### 1.9 CloudWatch Anomaly Detection & Logs Insights `[P2]` **Anomaly Detection:** - Applies ML-based dynamic baselines instead of static thresholds - Use for metrics with natural variance: request count, cache hit rate, latency percentiles - Catches subtle regressions a fixed threshold would miss (e.g., LCP drifts from 1.2s to 1.8s over weeks) - Set anomaly detection on: CloudFront `Requests`, `CacheHitRate`, RUM `LCP`, Lambda@Edge `Duration` - Alarm when metric exits the anomaly detection band for > 2 data points **CloudWatch Logs Insights — saved queries:** | Query | Purpose | |-------|---------| | Auth failures by error type | Identify broken auth flow (expired certs, Cognito issues) | | Lambda@Edge p99 latency over time | Track performance trends | | Top blocked WAF rules | Tune WAF, identify false positives | | 4xx by URI path after deploy | Find broken asset references | | Token refresh rate over time | Detect session duration issues | ### 1.10 Custom Application Metrics `[P3]` Emit metrics from the React app itself via the RUM web client or CloudWatch PutMetricData: - **Client-side API latency** — time from frontend to backend API calls (even though backend is a separate team, you feel the latency) - **Feature usage** — which routes/features are active (informs what to cover with canaries) - **Client-side errors by category** — network errors vs render errors vs auth errors - **Time to interactive** — custom measurement beyond standard Web Vitals - **Bundle load timing** — individual chunk load times (detect CDN/caching issues) ### 1.11 Route 53 Health Checks `[P2]` If using a custom domain in front of CloudFront: - Create Route 53 health checks against the CloudFront domain - Checks from multiple AWS regions simultaneously - Alarm when health check fails (catches DNS, certificate, and edge-level issues that synthetics might miss) - Faster detection than Synthetics canaries (10-second check intervals vs 5-minute canary) - Use as an input to composite alarms --- ## 2. SLOs and Error Budgets `[P0]` Even without a formal SLA, define internal SLOs to measure against: | SLO | Target | Measurement | |-----|--------|-------------| | **Availability** | 99.95% (21.9 min/month downtime) | Synthetic canary success rate | | **LCP (p75)** | < 2.5s | CloudWatch RUM | | **FID (p75)** | < 100ms | CloudWatch RUM | | **CLS (p75)** | < 0.1 | CloudWatch RUM | | **JS Error Rate** | < 0.1% of sessions | CloudWatch RUM | | **Auth Latency (p99)** | < 1s | Lambda@Edge duration metric | | **Time to Detect (TTD)** | < 5 min | Synthetic canary interval | | **Time to Engage (TTE)** | < 15 min | On-call tooling metrics | **Error budget policy:** - Track SLO burn rate over 30-day rolling windows - If error budget is exhausted: freeze feature deployments, focus on reliability - Use CloudWatch metric math to compute burn rates on dashboards --- ## 3. Release Management ### 3.1 Pipeline Architecture `[P0]` ```mermaid graph LR subgraph "CI" PR[Pull Request] --> CHECKS[Lint + Type Check
+ Unit Tests] CHECKS --> BUILD[Vite Build] BUILD --> INT[Integration Tests
Playwright] end subgraph "CD - Staging" INT --> DEP_STG[CDK Deploy
Staging Stack] DEP_STG --> SMOKE[Synthetic Smoke
Tests] SMOKE --> BAKE[Bake Period
15 min] end subgraph "CD - Production" BAKE --> CANARY[CloudFront
Continuous Deployment
5% traffic] CANARY --> MONITOR[Monitor Alarms
15 min] MONITOR -->|healthy| PROMOTE[Promote
Staging → Prod] MONITOR -->|alarm| ROLLBACK[Rollback] end ``` ### 3.2 CloudFront Continuous Deployment `[P1]` Use CloudFront's native staging distribution for canary deployments: 1. **Staging distribution** mirrors production config 2. **Continuous deployment policy** routes 5% of traffic (weight-based) to staging 3. **Session stickiness** ensures users stay on the same distribution during their session 4. **Monitor** composite alarms (RUM errors, synthetic failures, 5xx rate) for 15 min 5. **Promote** staging to production via `update-distribution-with-staging-config` (no DNS change, no cache loss) 6. **Auto-rollback** if alarms fire during bake period **Limitations to know:** - Traffic weight caps at 15% - CDK only has L1 support (`CfnDistribution`) — no L2 constructs yet - Staging distribution incurs separate costs during testing ### 3.3 Rollback Strategy `[P0]` | Scenario | Rollback method | Time to recover | |----------|----------------|-----------------| | Bad JS bundle deployed | Re-deploy previous S3 assets + CloudFront invalidation | 2-5 min | | CloudFront config regression | CloudFront continuous deployment — don't promote, disable policy | < 1 min | | Lambda@Edge regression | CDK deploy previous Lambda version | 5-15 min (edge propagation) | | Cognito configuration issue | Restore Cognito settings (separate from frontend deploy) | Varies | **Note:** Lambda@Edge rollback is slow because edge replicas take time to update globally. This makes canary deployment *especially* important for Lambda@Edge changes. ### 3.4 Cache Invalidation Strategy `[P1]` - Vite produces content-hashed filenames (`assets/index-a1b2c3.js`) — these are immutable and cache-safe - Only `index.html` and service worker files need short TTLs or invalidation - After deploy: invalidate `/index.html` and `/` only (not `/*`) - Use `Cache-Control: no-cache` on `index.html`, `max-age=31536000,immutable` on hashed assets --- ## 4. Infrastructure Resilience ### 4.1 CloudFront Origin Failover `[P1]` Configure an **origin group** with primary and failover origins: - **Primary:** S3 bucket in deployment region - **Failover:** S3 bucket in a second region (cross-region replication enabled) - CloudFront automatically fails over on 5xx or connection timeout from primary - Failover is per-request, not a permanent switch — primary recovers automatically **Setup in CDK:** Use `OriginGroup` with `fallbackStatusCodes: [500, 502, 503, 504]`. ### 4.2 CloudFront Origin Shield `[P3]` Adds an additional caching layer between edge locations and the origin: - Reduces origin load by collapsing duplicate requests from multiple edge locations - Improves cache hit ratio - Pick the Origin Shield region closest to your S3 origin - Particularly valuable during cache invalidation (thundering herd to origin) **Cost:** Per-request charge on top of standard CloudFront pricing. Worth it at scale. ### 4.3 Content Freshness Monitoring `[P1]` After every deploy, verify users are actually getting the new `index.html`: - Add a build-time `` tag with commit SHA or timestamp - Synthetic canary checks that the build-id matches the expected deployed version - Alarm if build-id mismatch persists > 10 min after deploy (stale cache, invalidation failure) - RUM custom attribute with build-id to correlate errors to specific deployments --- ## 5. Operational Governance ### 5.1 AWS Config `[P2]` Detect infrastructure drift — changes made outside CDK: | Config Rule | What it catches | |-------------|----------------| | `cloudfront-origin-access-control-enabled` | OAC disabled on distribution | | `cloudfront-associated-with-waf` | WAF removed from distribution | | `cloudfront-default-root-object-configured` | Default root object removed | | `s3-bucket-public-access-prohibited` | Origin bucket made public | | `lambda-function-settings-check` | Lambda@Edge runtime/timeout/memory changed | - Auto-remediation: trigger SSM Automation to revert non-compliant changes - Alarm on any non-compliant evaluation ### 5.2 CloudTrail `[P2]` Audit who changed what in the infrastructure: - Ensure CloudTrail is logging CloudFront, Lambda, S3, WAF, and Cognito API calls - Create CloudWatch Logs metric filters for sensitive actions: - `UpdateDistribution` — CloudFront config change - `UpdateWebACL` — WAF rules modified - `PublishVersion` / `UpdateFunctionCode` — Lambda@Edge code change - `DeleteBucketPolicy` — S3 origin security change - Alarm on any manual console changes (all changes should flow through CDK/pipeline) ### 5.3 Cost Monitoring `[P3]` CloudFront + Lambda@Edge + observability stack costs can creep: | Service | Cost driver | What to watch | |---------|-------------|---------------| | CloudFront | Requests + data transfer | Anomaly detection on daily cost | | Lambda@Edge | Invocations + duration | Scales with every viewer request | | CloudWatch RUM | Events ingested | Scales with user traffic | | CloudWatch Synthetics | Canary runs | Fixed cost, scales with canary count | | CloudFront Real-Time Logs | Kinesis shard hours | Scales with log volume | | WAF | Requests evaluated | Scales with traffic | | X-Ray | Traces sampled | Configure sampling rate | **Setup:** - Enable **AWS Cost Anomaly Detection** for these services - Set **AWS Budgets** with threshold alerts (80%, 100%, 120%) - Review cost dashboard monthly in on-call review ### 5.4 Athena for Deep Log Analysis `[P3]` Query CloudFront standard logs in S3 with Athena for analysis beyond real-time dashboards: - Top requested paths and cache behavior - Geographic traffic distribution - User-Agent analysis (browser/device breakdown) - Error rate by edge location - Bandwidth usage patterns over time Create a Glue table over the CloudFront standard log bucket, then run SQL queries on demand. ### 5.5 Performance Budgets (CI) `[P2]` Catch performance regressions *before* they deploy: - **Bundle size check:** Fail CI if total JS bundle exceeds threshold (e.g., 500KB gzipped) - **Chunk count check:** Alert if code splitting produces too many or too few chunks - **Lighthouse CI:** Run Lighthouse in CI against the built app, fail on score regression - Track these metrics over time (store in CloudWatch custom metrics or S3) Vite's `build.rollupOptions` can enforce chunk size limits. Add a CI step that parses the Vite build output and compares against the budget. --- ## 6. Operational Readiness Review (ORR) ### 6.1 ORR Checklist Adapted from the AWS Well-Architected ORR framework for this frontend architecture: #### Dependencies - [ ] All dependencies documented (Cognito, S3, CloudFront, Lambda@Edge, backend API) - [ ] Each dependency has a health check or alarm - [ ] Cognito endpoint availability is monitored from Lambda@Edge logs - [ ] Backend API dependency is isolated (frontend degrades gracefully if backend is down) - [ ] Third-party scripts (analytics, etc.) loaded async and don't block rendering #### Alarms & Monitoring - [ ] Every component has at least one alarm (CloudFront, each Lambda@Edge function, S3 origin, RUM, Synthetics) - [ ] Composite alarms configured to reduce noise - [ ] Dashboards exist for executive, operational, and debug audiences - [ ] CloudWatch RUM configured with appropriate sampling - [ ] Synthetic canaries cover homepage, auth flow, and critical user journeys - [ ] CloudFront real-time logs enabled and flowing to Kinesis - [ ] Lambda@Edge logs centralized via advanced logging - [ ] Internet Monitor enabled for the distribution - [ ] X-Ray tracing enabled on Lambda@Edge functions - [ ] WAF metrics and logs monitored, false positives reviewed - [ ] S3 origin request metrics enabled - [ ] Anomaly detection configured for key metrics - [ ] Logs Insights saved queries created for common investigations - [ ] Route 53 health checks configured (if custom domain) - [ ] Source maps uploaded to RUM in CI/CD pipeline #### Runbooks & Response - [ ] Every SEV1/SEV2 alarm has a runbook - [ ] Runbooks are tested (run through them in staging) - [ ] On-call rotation defined with primary + secondary - [ ] On-call engineers have permissions to deploy, rollback, and invalidate cache #### Release & Deployment - [ ] CI/CD pipeline with staging environment - [ ] CloudFront continuous deployment (canary) configured - [ ] Automated rollback on alarm during bake period - [ ] Cache invalidation strategy documented and automated - [ ] Lambda@Edge deployment tested (edge propagation delay understood) - [ ] No manual steps in the deployment process - [ ] Performance budget enforced in CI (bundle size, Lighthouse score) - [ ] Content freshness verified post-deploy (build-id check) #### Infrastructure Resilience - [ ] Origin failover configured (S3 cross-region replication + origin group) - [ ] Origin Shield enabled - [ ] Content freshness monitoring in place (build-id meta tag) #### Failure Modes - [ ] Cognito outage: users can't authenticate — what happens to already-authenticated users? (Cookie-based, so existing sessions survive) - [ ] S3 origin outage: CloudFront serves stale cache — how long? (Depends on TTL config) - [ ] Lambda@Edge timeout: viewer gets 502 — is there a static fallback? - [ ] Cache invalidation failure: users see stale `index.html` — recovery plan? - [ ] Region-specific edge failure: Internet Monitor alerts, but is there action to take? - [ ] Certificate expiry: automated renewal via ACM? Alarm on days-to-expiry? #### Load & Scale - [ ] Lambda@Edge concurrent execution limits understood (regional limits at edge) - [ ] CloudFront request rate limits documented - [ ] Cognito token endpoint rate limits understood - [ ] Load testing performed against staging (not production CloudFront) - [ ] Behavior under cache-miss thundering herd documented #### Security & Compliance - [ ] Lambda@Edge security headers configured (CSP, HSTS, X-Frame-Options) - [ ] JWT validation logic reviewed for edge cases (clock skew, algorithm confusion) - [ ] Cookie attributes correct (HttpOnly, Secure, SameSite) - [ ] CloudFront access logs retained for compliance period - [ ] WAF rules reviewed and false positives addressed - [ ] WAF logging enabled #### Governance - [ ] AWS Config rules enabled for CloudFront, S3, Lambda, WAF - [ ] CloudTrail logging all relevant API calls - [ ] Metric filters on sensitive CloudTrail events (manual console changes) - [ ] Cost Anomaly Detection enabled - [ ] AWS Budgets set with threshold alerts - [ ] Athena table created over CloudFront standard logs ### 6.2 ORR Review Process 1. **Self-assessment**: Team fills out checklist (above) with evidence links 2. **Peer review**: Another team reviews and challenges assumptions 3. **Failure mode walkthrough**: Team walks through each failure mode live, demonstrating monitoring and response 4. **Game day**: Simulate a failure (e.g., break auth in staging, watch alarms fire, execute runbook) 5. **Sign-off**: ORR approved when all critical items are green --- ## 7. Implementation Priority Recommended order based on "do it right" while being pragmatic: ### Phase 1 — Foundation (Week 1-2) 1. `[P0]` Centralize Lambda@Edge logs (advanced logging config) 2. `[P2]` Enable X-Ray on Lambda@Edge functions 3. `[P0]` Define SLOs and create CloudWatch metric math for burn rates 4. `[P0]` Set up CloudWatch RUM with 100% sampling + source map uploads 5. `[P2]` Enable S3 origin request metrics ### Phase 2 — Detection (Week 3-4) 6. `[P0]` Deploy synthetic canaries (homepage, auth flow, critical paths) 7. `[P1]` Enable CloudFront real-time logs via Kinesis 8. `[P2]` Enable Internet Monitor 9. `[P1]` Enable WAF logging, review managed rule matches 10. `[P2]` Set up Route 53 health checks (if custom domain) 11. `[P0]` Create alarms for all components (individual + composite + anomaly detection) 12. `[P1]` Create the operational dashboard ### Phase 3 — Resilience (Week 5-6) 13. `[P1]` Configure S3 cross-region replication + CloudFront origin failover 14. `[P3]` Enable Origin Shield 15. `[P1]` Add content freshness monitoring (build-id meta tag + canary check) 16. `[P2]` Add performance budgets to CI pipeline ### Phase 4 — Release Safety (Week 7-8) 17. `[P1]` Implement CloudFront continuous deployment (staging distribution) 18. `[P0]` Add automated rollback on alarm during canary bake 19. `[P1]` Automate cache invalidation in the pipeline ### Phase 5 — Governance (Week 9-10) 20. `[P2]` Enable AWS Config rules for CloudFront, S3, Lambda, WAF 21. `[P2]` Set up CloudTrail metric filters for manual changes 22. `[P3]` Enable Cost Anomaly Detection + Budgets 23. `[P3]` Create Athena table over CloudFront standard logs 24. `[P2]` Build Logs Insights saved queries for common investigations ### Phase 6 — Validation (Week 11-12) 25. `[P0]` Write runbooks for all SEV1/SEV2 alarms 26. `[P0]` Complete ORR checklist 27. `[P1]` Run game day (simulated incident) 28. `[P1]` Tune alarms and WAF rules based on false positive rate