Architecture Review Checklist

Overview

This checklist follows the C4 model (Context, Containers, Components, Code) to review system architecture from high-level to detailed implementation. Use this for architecture reviews, system audits, and ensuring documentation completeness.

Part 1: System-Wide Review

1.1 System Context (Level 1)

Context Diagram

[ ] C4 Context diagram exists and is current (updated within 3 months)
[ ] All external systems and actors are identified
[ ] System boundaries are clearly defined
[ ] Business purpose and value proposition is documented
[ ] System scope is explicit about what’s included/excluded
[ ] Data flows between system and external entities are shown
[ ] Trust boundaries are identified (where data crosses security zones)

External Dependencies

[ ] All third-party services are documented with purpose
[ ] SLAs with external providers are documented
[ ] Integration patterns with external systems are defined
[ ] Fallback strategies exist if external systems fail
[ ] External system rate limits and quotas are known
[ ] Data sharing agreements with external parties exist
[ ] Vendor risk assessments are completed

Users & Actors

[ ] All user types/personas are identified
[ ] User authentication methods are documented
[ ] User authorization levels are defined
[ ] User journey maps exist for critical workflows

1.2 Cross-System Concerns

Security Architecture

[ ] Security model is documented (zero trust, perimeter-based, etc.)
[ ] Authentication strategy is defined system-wide
[ ] Authorization model is consistent across containers
[ ] Encryption strategy covers data at rest and in transit
[ ] Secrets management approach is standardized
[ ] Security incident response plan exists
[ ] Compliance requirements are documented (GDPR, SOC2, etc.)
[ ] Regular security assessments are scheduled

Network Architecture

[ ] Network topology diagram exists
[ ] Network segmentation strategy is documented
[ ] All traffic flows use appropriate encryption (TLS 1.2+)
[ ] Firewall rules follow least-privilege principle
[ ] DDoS protection strategy is in place
[ ] CDN strategy is documented for static content
[ ] DNS configuration and failover is documented

Data Architecture

[ ] Data model/schema documentation exists
[ ] Data flow diagrams show movement between containers
[ ] Data residency requirements are documented
[ ] Data classification scheme exists (public, internal, confidential, etc.)
[ ] Data retention policies are defined
[ ] Data backup strategy covers all critical data
[ ] Data recovery procedures are documented and tested

Operational Architecture

[ ] Deployment architecture is documented (regions, availability zones)
[ ] Disaster recovery strategy is defined (RTO/RPO targets)
[ ] Monitoring and observability strategy is consistent
[ ] Logging strategy is standardized across components
[ ] Alerting and on-call procedures are defined
[ ] Capacity planning is documented
[ ] Cost allocation and monitoring approach exists

Performance & Scalability

[ ] Performance SLAs/SLOs are defined
[ ] Load testing strategy exists with results documented
[ ] Bottlenecks have been identified and addressed
[ ] Horizontal and vertical scaling strategies are defined
[ ] Auto-scaling policies are configured and tested
[ ] Caching strategy is documented (application, CDN, database)

Quality Attributes

[ ] Architecture Decision Records (ADRs) document key decisions
[ ] Quality attribute requirements are documented (performance, security, availability)
[ ] Trade-offs between quality attributes are explicit
[ ] Non-functional requirements are testable

Part 2: Container Level Review (Level 2)

Review each container (applications, databases, file systems, etc.) individually

2.1 Generic Container Checklist

Container: ___________________

Container Documentation

[ ] C4 Container diagram shows this container in system context
[ ] Container purpose and responsibilities are documented
[ ] Technology choices are justified with ADRs
[ ] Container repository README is comprehensive
[ ] API documentation exists (OpenAPI/Swagger for APIs)
[ ] Deployment architecture diagram exists
[ ] Configuration management is documented

Container Boundaries

[ ] Container dependencies (other containers) are identified
[ ] Communication protocols are specified (REST, gRPC, message queue, etc.)
[ ] API contracts/interfaces are versioned
[ ] Integration patterns are documented (sync/async, request/response, pub/sub)
[ ] Container responsibilities don’t overlap with others
[ ] Data owned by this container is clearly defined

Security (Container Level)

[ ] Authentication mechanism is implemented and documented
[ ] Authorization is enforced for all operations
[ ] HTTPS/TLS is enforced for all network communication
[ ] Secrets are stored in secrets manager (not in code/config)
[ ] Security headers are configured (HSTS, CSP, etc.)
[ ] Input validation prevents injection attacks
[ ] Rate limiting prevents abuse
[ ] OWASP Top 10 vulnerabilities addressed

Networking

[ ] Network exposure is appropriate (public/private)
[ ] Firewall rules restrict access to necessary ports only
[ ] Load balancing strategy is configured
[ ] Health check endpoints are implemented
[ ] Service discovery mechanism is configured (if microservices)
[ ] Network policies restrict container-to-container traffic

Data Management

[ ] Database/storage technology choice is documented
[ ] Schema/data model is documented
[ ] Backup procedures are configured and tested
[ ] Data migration strategy exists
[ ] Connection pooling is optimized
[ ] Database credentials use principle of least privilege

Observability

[ ] Structured logging is implemented with correlation IDs
[ ] Metrics are exposed (requests, errors, latency)
[ ] Distributed tracing is configured (if applicable)
[ ] Health check endpoint returns meaningful status
[ ] Readiness probe endpoint exists
[ ] Dashboards exist for container health
[ ] Alerts are configured for critical failures

Resilience

[ ] Graceful degradation strategy exists
[ ] Circuit breakers prevent cascade failures
[ ] Retry logic with exponential backoff is implemented
[ ] Timeout values are configured for external calls
[ ] Bulkhead pattern isolates critical resources
[ ] Fallback mechanisms handle dependency failures

Deployment & Operations

[ ] CI/CD pipeline exists for this container
[ ] Automated testing covers critical paths
[ ] Deployment strategy is documented (blue-green, canary, rolling)
[ ] Rollback procedure is documented and tested
[ ] Infrastructure as Code defines container infrastructure
[ ] Environment-specific configuration is externalized
[ ] Feature flags enable gradual rollout
[ ] Runbooks exist for common operational tasks

Performance

[ ] Performance requirements are defined
[ ] Load testing has been performed
[ ] Resource limits are configured (CPU, memory)
[ ] Auto-scaling is configured if needed
[ ] Caching is implemented where appropriate
[ ] Async processing is used for long-running operations

Dependencies

[ ] All dependencies are documented (libraries, frameworks, services)
[ ] Dependency versions are pinned
[ ] Vulnerable dependencies are tracked and updated
[ ] License compliance is verified
[ ] Unused dependencies are removed

2.2 Frontend Web Application

Container Name: ___________________

Architecture Pattern: ⬜ Server-Side Rendered (SSR) ⬜ Static Site Generation (SSG) ⬜ Client-Side SPA ⬜ Hybrid

Rendering Approach:

Server-Side Rendering (SSR): HTML generated on server for each request. Recommended for GDS compliance, accessibility, and SEO.
Static Site Generation (SSG): HTML pre-generated at build time. Good for content-heavy sites with infrequent changes.
Client-Side SPA: JavaScript renders content in browser. May not meet GDS/accessibility standards without careful implementation.
Hybrid: Combination of approaches (e.g., SSR for public pages, SPA for authenticated areas).

Note: For UK Government services, GDS Service Manual requires progressive enhancement and services that work without JavaScript. Client-side SPAs typically do not meet these requirements unless using SSR/SSG with hydration.

Frontend-Specific Documentation

[ ] Rendering strategy is documented and justified
[ ] Component library/design system is documented
[ ] State management approach is documented
[ ] Routing strategy is documented
[ ] Build and bundling configuration is documented
[ ] Browser compatibility matrix is defined
[ ] Progressive enhancement strategy is documented

User Interface

[ ] Responsive design works across device sizes
[ ] Accessibility standards met (WCAG 2.1 Level AA minimum)
[ ] Keyboard navigation is fully functional
[ ] Screen reader compatibility is tested (JAWS, NVDA, VoiceOver)
[ ] Color contrast meets accessibility requirements
[ ] Loading states and error messages are user-friendly
[ ] Offline capability is implemented (if required)
[ ] Service works without JavaScript enabled (if GDS/progressive enhancement required)

Frontend Security

[ ] Content Security Policy (CSP) is configured
[ ] Subresource Integrity (SRI) for external scripts
[ ] XSS protection through proper escaping/sanitization (server-side and client-side)
[ ] CSRF tokens for state-changing operations
[ ] Sensitive data is not exposed in client-side code
[ ] API keys are not exposed in frontend code
[ ] Authentication tokens stored securely (httpOnly cookies or secure storage)
[ ] Input validation on client side (with server-side validation too)
[ ] Server-side rendering prevents data leakage in initial HTML

Frontend Performance

[ ] Core Web Vitals metrics are monitored (LCP, FID, CLS)
[ ] Initial page load uses server-rendered or pre-generated HTML
[ ] Code splitting reduces initial bundle size (for JavaScript enhancements)
[ ] Lazy loading for routes and components (if applicable)
[ ] Images are optimized and use appropriate formats (WebP, AVIF)
[ ] Critical CSS is inlined or prioritized
[ ] Third-party scripts are loaded asynchronously
[ ] Service worker for caching (if applicable)
[ ] Bundle size is monitored and optimized
[ ] Time to Interactive (TTI) is acceptable on slow connections

Server-Side Rendering (if applicable)

[ ] Server renders complete HTML on first request
[ ] Hydration strategy is documented (if client-side JavaScript added)
[ ] Server-side data fetching is optimized
[ ] Caching strategy for rendered pages
[ ] Streaming rendering used where appropriate
[ ] Error boundaries prevent white screen on errors
[ ] Server has appropriate resource limits (CPU, memory)

Progressive Enhancement

[ ] Core user journey works without JavaScript
[ ] JavaScript enhances experience but isn’t required
[ ] Forms submit via traditional POST (not just AJAX)
[ ] Navigation works without JavaScript router
[ ] Fallbacks exist for JavaScript-dependent features
[ ] NoScript tags provide alternative content where needed

API Integration (from Frontend)

[ ] API error handling with user-friendly messages
[ ] Loading states for async operations
[ ] Request cancellation for unmounted components
[ ] API response caching strategy
[ ] Retry logic for failed requests
[ ] Request deduplication for identical concurrent requests
[ ] CORS configuration documented (if cross-origin)

State Management

[ ] State management pattern is consistent (Redux, MobX, Context, etc.)
[ ] Global vs local state boundaries are clear
[ ] State persistence strategy is documented (if needed)
[ ] Side effects are managed consistently
[ ] Server state and client state are clearly separated

Testing

[ ] Unit tests for business logic and utilities
[ ] Component tests for UI components
[ ] Integration tests for user flows
[ ] E2E tests for critical paths
[ ] Visual regression testing (if applicable)
[ ] Cross-browser testing strategy
[ ] Accessibility testing automated (axe, Lighthouse)
[ ] JavaScript-disabled testing (if progressive enhancement required)

GDS Compliance (if applicable)

[ ] GOV.UK Design System components used
[ ] GDS Service Manual standards followed
[ ] Service Assessment criteria met
[ ] Cookie consent follows GDS patterns
[ ] Error pages follow GDS standards (404, 500, 503)
[ ] Start pages and service journey documented
[ ] Performance budget defined (2-second page load target)

2.3 Backend API Service

Container Name: ___________________

API Type: ⬜ REST ⬜ GraphQL ⬜ gRPC ⬜ Mixed

API-Specific Documentation

[ ] OpenAPI/Swagger specification is current
[ ] API versioning strategy is documented
[ ] Request/response examples are provided
[ ] Error codes and messages are documented
[ ] Rate limiting policies are documented
[ ] Authentication/authorization flows are documented

API Design

[ ] RESTful principles followed (or GraphQL/gRPC standards)
[ ] Consistent naming conventions across endpoints
[ ] HTTP verbs used correctly (GET, POST, PUT, PATCH, DELETE)
[ ] HTTP status codes used appropriately
[ ] Pagination implemented for list endpoints
[ ] Filtering and sorting capabilities exist
[ ] Versioning strategy allows backward compatibility
[ ] Deprecation policy and timeline defined

API Security

[ ] Authentication required for all non-public endpoints
[ ] Authorization checks at resource level
[ ] Input validation on all parameters and body
[ ] SQL injection prevention (parameterized queries/ORM)
[ ] Rate limiting per user/API key
[ ] API abuse detection and blocking
[ ] Sensitive data not logged or exposed in errors
[ ] CORS configured appropriately

Data Access Layer

[ ] Database queries are optimized with proper indexing
[ ] N+1 query problems are avoided
[ ] Database connection pooling is configured
[ ] Transactions are used for multi-step operations
[ ] Database migrations are versioned and tested
[ ] Read replicas used for read-heavy operations (if applicable)
[ ] Caching layer reduces database load

Business Logic

[ ] Business rules are centralized, not scattered
[ ] Domain model is well-defined
[ ] Validation rules are comprehensive
[ ] Side effects are handled explicitly
[ ] Idempotency for state-changing operations
[ ] Event publishing for domain events (if event-driven)

API Performance

[ ] Response time SLAs are defined and monitored
[ ] Slow query detection and alerting
[ ] Request/response compression enabled
[ ] ETags for conditional requests
[ ] Batch operations available for bulk updates
[ ] Async processing for long-running operations
[ ] Background jobs for non-critical tasks

Integration Testing

[ ] Contract tests validate API contracts
[ ] Integration tests cover happy paths and error cases
[ ] Load tests validate performance under stress
[ ] API compatibility tests prevent breaking changes

2.3a Hybrid Web Application + API Container

Container Name: ___________________

Use Case: Single container serving both web pages (HTML) and API endpoints (JSON/XML). Common for public-facing services that need both a user interface and programmatic access.

Examples:

Government service with public web pages AND an API for partners
Documentation site that also exposes data via API
Admin interface bundled with management API
Legacy monolith serving both concerns

Hybrid Container Documentation

[ ] Clear separation between web routes and API routes is documented
[ ] URL structure distinguishes web pages from API (e.g., /api/* for APIs)
[ ] Versioning strategy for APIs doesn’t affect web pages
[ ] Deployment strategy handles both concerns
[ ] Whether web and API share authentication/authorization is documented

Architecture Clarity

[ ] Web page routes are clearly identified (e.g., /, /about, /contact)
[ ] API routes are clearly identified (e.g., /api/v1/, /api/v2/)
[ ] Static asset routes documented (e.g., /assets/, /public/)
[ ] Health check endpoints separated (e.g., /health for web, /api/health for API)
[ ] Middleware/filters distinguish between web and API requests

Content Type Handling

[ ] Content negotiation correctly routes to HTML or JSON responses
[ ] Accept headers properly handled (text/html vs application/json)
[ ] Error responses appropriate for context (HTML error pages vs JSON errors)
[ ] 404 handling differs between web (error page) and API (JSON error)
[ ] CORS only applied to API routes, not web pages

Security Considerations

[ ] CSRF protection enabled for web forms but not API endpoints
[ ] API authentication separate from web session management (if different users)
[ ] Rate limiting configured separately for web vs API traffic
[ ] Security headers appropriate for each (CSP for web, CORS for API)
[ ] Session management doesn’t interfere with API token validation

Web Functionality (within Hybrid Container)

[ ] Server-side rendering of HTML (SSR pattern)
[ ] Template engine configured (Jinja2, EJS, Handlebars, etc.)
[ ] Progressive enhancement principles followed
[ ] Static file serving configured and optimized
[ ] Client-side JavaScript enhancement is optional, not required
[ ] Web forms work without JavaScript

API Functionality (within Hybrid Container)

[ ] JSON/XML serialization properly configured
[ ] API documentation accessible (e.g., /api/docs via Swagger UI)
[ ] API versioning doesn’t conflict with web routes
[ ] API responses don’t accidentally return HTML
[ ] OpenAPI spec generation automated

Performance Separation

[ ] Caching strategies differ between web pages and API
[ ] Static assets cached aggressively (long TTL)
[ ] API responses cached appropriately (shorter TTL, conditional caching)
[ ] CDN configuration handles both static files and dynamic content
[ ] Resource limits consider both web traffic and API calls

Monitoring & Observability

[ ] Metrics separated by traffic type (web vs API)
[ ] Logging distinguishes web requests from API calls
[ ] Dashboards show both web page performance and API performance
[ ] Alerts configured for both web availability and API availability
[ ] Usage analytics track web visitors separately from API consumers

Testing Strategy

[ ] E2E tests for web user journeys
[ ] Contract tests for API endpoints
[ ] Integration tests verify web and API don’t interfere
[ ] Load testing covers realistic mix of web and API traffic
[ ] Accessibility testing for web pages
[ ] API compatibility testing

When to Consider Splitting

Consider separating into distinct containers when:

[ ] Web and API have significantly different scaling needs
[ ] Different teams own web vs API
[ ] API versioning complicates web deployment
[ ] Security requirements differ substantially
[ ] Performance characteristics conflict (e.g., API needs high throughput, web needs low latency)
[ ] Want independent deployment cycles

Advantages of Hybrid Approach

Simpler infrastructure (single deployment)
Shared authentication/authorization logic
Single codebase for related concerns
Easier to maintain consistency
Lower operational overhead

Disadvantages of Hybrid Approach

Harder to scale independently
Deployment risk affects both concerns
Potential for security configuration conflicts
Mixing of concerns can complicate code
Different performance characteristics may conflict

2.4 API Gateway

Container Name: ___________________

Gateway-Specific Documentation

[ ] Routing rules are documented
[ ] Transformation logic is documented
[ ] Rate limiting policies per client/endpoint
[ ] Authentication/authorization flows
[ ] Backend service mapping

Gateway Configuration

[ ] Routes to backend services are defined
[ ] Service discovery is configured (if dynamic)
[ ] Health checks for backend services
[ ] Timeout configuration for upstream services
[ ] Retry and circuit breaker policies
[ ] Request/response transformation rules
[ ] API composition/aggregation logic (if applicable)

Security at Gateway

[ ] SSL/TLS termination configured
[ ] WAF rules protect against common attacks
[ ] API key validation
[ ] JWT validation and claims extraction
[ ] IP whitelisting/blacklisting
[ ] DDoS protection
[ ] OAuth 2.0 integration (if applicable)

Traffic Management

[ ] Rate limiting per client, endpoint, or globally
[ ] Quota management for API consumers
[ ] Traffic splitting for canary deployments
[ ] A/B testing capabilities (if needed)
[ ] Request prioritization or throttling

Observability at Gateway

[ ] Request/response logging with correlation IDs
[ ] API usage metrics per client/endpoint
[ ] Latency tracking for backend services
[ ] Error rate monitoring
[ ] Security event logging (failed auth, rate limit hits)

2.5 Message Broker / Event Bus

Container Name: ___________________

Messaging Documentation

[ ] Event schemas are documented
[ ] Topic/queue naming conventions
[ ] Message flow diagrams
[ ] Producer and consumer mappings
[ ] Event versioning strategy
[ ] Poison message handling

Message Broker Configuration

[ ] Message retention policies defined
[ ] Dead letter queue configured
[ ] Message ordering guarantees documented
[ ] Partitioning strategy (if applicable)
[ ] Replication configuration for high availability
[ ] Access control per topic/queue

Event Design

[ ] Event schemas are versioned
[ ] Events are immutable
[ ] Events contain sufficient context
[ ] Event names are meaningful and consistent
[ ] Event size is optimized
[ ] Backward compatibility is maintained

Reliability

[ ] At-least-once/exactly-once delivery guarantees documented
[ ] Idempotent consumer handling
[ ] Retry logic for failed message processing
[ ] Circuit breakers for downstream dependencies
[ ] Message acknowledgment strategy
[ ] Consumer lag monitoring and alerting

Security

[ ] Authentication for producers and consumers
[ ] Authorization per topic/queue
[ ] Encryption in transit
[ ] Encryption at rest (if required)
[ ] Sensitive data handling in messages

Performance

[ ] Throughput requirements defined
[ ] Consumer scaling strategy
[ ] Batch processing for efficiency
[ ] Message compression (if applicable)
[ ] Consumer group configuration optimized

2.6 Database / Data Store

Container Name: ___________________
Type: ⬜ Relational ⬜ Document ⬜ Key-Value ⬜ Graph ⬜ Time-Series ⬜ Other

Database Documentation

[ ] Entity-Relationship diagram (ERD) or data model
[ ] Schema documentation with descriptions
[ ] Index strategy documented
[ ] Query patterns documented
[ ] Constraints and relationships defined
[ ] Migration history maintained

Schema Design

[ ] Normalization appropriate for use case
[ ] Indexes support common query patterns
[ ] Foreign key constraints enforce referential integrity
[ ] Check constraints validate data quality
[ ] Appropriate data types chosen
[ ] NULL handling is explicit
[ ] Audit fields (created_at, updated_at, etc.) exist

Database Security

[ ] Encryption at rest enabled
[ ] Encryption in transit (TLS) enforced
[ ] Principle of least privilege for database users
[ ] Separate credentials per application/service
[ ] Row-level security (if required)
[ ] Column-level encryption for sensitive data
[ ] Database audit logging enabled
[ ] Regular security patching schedule

Backup & Recovery

[ ] Automated backup schedule configured
[ ] Backup retention policy defined
[ ] Point-in-time recovery capability
[ ] Backup encryption enabled
[ ] Backup restoration tested regularly (last test: _____)
[ ] Cross-region backup for disaster recovery
[ ] RPO and RTO defined and achievable

Performance

[ ] Query performance is monitored
[ ] Slow query log is reviewed regularly
[ ] Execution plans for critical queries are optimized
[ ] Connection pooling configured appropriately
[ ] Database statistics are updated regularly
[ ] Partitioning strategy (if applicable)
[ ] Read replicas for read-heavy workloads
[ ] Query caching configured

Maintenance

[ ] Database upgrade strategy documented
[ ] Migration testing process defined
[ ] Rollback procedures for failed migrations
[ ] Vacuum/analyze jobs scheduled (PostgreSQL)
[ ] Index maintenance procedures
[ ] Storage growth monitoring and alerting
[ ] Archival strategy for old data

2.7 Cache Layer (Redis, Memcached, etc.)

Container Name: ___________________

Cache Documentation

[ ] Cache strategy documented (cache-aside, write-through, etc.)
[ ] Cache key naming conventions
[ ] TTL policies per cache type
[ ] Cache invalidation strategy
[ ] Cache warming procedures

Cache Design

[ ] Cache key design prevents collisions
[ ] Appropriate TTL values set
[ ] Cache stampede prevention (lock, probabilistic early expiration)
[ ] Cache invalidation strategy handles updates
[ ] Cache size limits configured
[ ] Eviction policy appropriate (LRU, LFU, etc.)

Resilience

[ ] Application handles cache unavailability gracefully
[ ] Cache warming on startup (if needed)
[ ] Cache cluster configuration for high availability
[ ] Failover strategy documented
[ ] Monitoring for cache hit/miss ratio
[ ] Alerting for low hit ratios

Security

[ ] Authentication enabled
[ ] Network isolation (not public)
[ ] Encryption in transit (if required)
[ ] No sensitive data in cache keys

2.8 Background Job Processor / Worker

Container Name: ___________________

Worker Documentation

[ ] Job types and purposes documented
[ ] Job queue architecture diagram
[ ] Retry policies per job type
[ ] Job priority levels defined
[ ] Job scheduling strategy

Job Design

[ ] Jobs are idempotent
[ ] Job payloads are minimal (references, not large data)
[ ] Job timeout values configured
[ ] Job retry logic with exponential backoff
[ ] Dead letter queue for permanently failed jobs
[ ] Job status tracking for long-running jobs

Reliability

[ ] Job failure monitoring and alerting
[ ] Stuck job detection
[ ] Job replay capability for failures
[ ] Graceful shutdown handling
[ ] Job ordering guarantees (if required)

Performance

[ ] Worker scaling strategy (manual or auto)
[ ] Queue depth monitoring
[ ] Processing time per job type monitored
[ ] Resource limits per job type
[ ] Batch processing for efficiency

Observability

[ ] Job execution logs with context
[ ] Job metrics (queued, processing, completed, failed)
[ ] Job execution time tracking
[ ] Queue depth metrics and alerting
[ ] Failed job analysis and reporting

2.9 File Storage / Object Storage

Container Name: ___________________

Storage Documentation

[ ] Storage structure and organization documented
[ ] File naming conventions defined
[ ] Access patterns documented
[ ] Lifecycle policies documented
[ ] CDN integration (if applicable)

Storage Design

[ ] Appropriate storage class for access patterns (hot, warm, cold)
[ ] Directory/bucket structure is logical
[ ] File versioning enabled (if required)
[ ] Lifecycle policies archive or delete old files
[ ] Large file upload strategy (multipart, resumable)
[ ] File metadata strategy

Security

[ ] Bucket/container policies restrict access
[ ] Pre-signed URLs for temporary access
[ ] Encryption at rest enabled
[ ] Encryption in transit enforced
[ ] Access logging enabled
[ ] Public access blocked unless intentional
[ ] Cross-Origin Resource Sharing (CORS) configured properly

Performance

[ ] CDN configured for frequently accessed files
[ ] Appropriate caching headers set
[ ] File compression for compressible content
[ ] Transfer acceleration enabled (if available)

Cost Optimization

[ ] Storage class tiers used appropriately
[ ] Lifecycle policies reduce storage costs
[ ] Orphaned files cleaned up regularly
[ ] Storage costs monitored per bucket/container

2.10 Third-Party Integration Service

Container Name: ___________________

Integration Documentation

[ ] Integration architecture diagram
[ ] API documentation from third-party referenced
[ ] Authentication method documented
[ ] Rate limits from provider documented
[ ] Error codes and handling documented
[ ] Webhook configuration (if applicable)

Integration Design

[ ] Abstraction layer isolates third-party API
[ ] Fallback behavior if third-party unavailable
[ ] Idempotency keys used to prevent duplicates
[ ] Request retries with exponential backoff
[ ] Circuit breaker prevents cascade failures
[ ] Timeout values configured appropriately

Reliability

[ ] Third-party service SLA is known
[ ] Monitoring of third-party availability
[ ] Graceful degradation when service unavailable
[ ] Webhook signature verification
[ ] Webhook replay handling (idempotency)

Security

[ ] API keys stored in secrets manager
[ ] Credentials rotated regularly
[ ] Minimal permissions requested from third-party
[ ] Data shared with third-party is minimized
[ ] Data processing agreement in place (if applicable)

Cost Management

[ ] API usage tracked against rate limits
[ ] Cost per API call is monitored
[ ] Budget alerts configured
[ ] Optimization opportunities identified (caching, batching)

Part 3: Component Level Review (Level 3)

Review key components within each container

Component: ___________________

Container: ___________________

Component Documentation

[ ] C4 Component diagram shows this component in container context
[ ] Component purpose and responsibilities are documented
[ ] Component interfaces/APIs are documented
[ ] Design patterns used are identified
[ ] Code location is documented (repository, module, package)

Component Design

[ ] Single Responsibility Principle is followed
[ ] Component has clear, cohesive purpose
[ ] Dependencies on other components are minimal
[ ] Interfaces are well-defined and stable
[ ] Component is testable in isolation
[ ] Appropriate design patterns are used

Component Boundaries

[ ] Interactions with other components are documented
[ ] Data passed between components is specified
[ ] Error handling between components is defined
[ ] Component doesn’t reach across architectural layers

Security (Component Level)

[ ] Input validation is performed on all entry points
[ ] Authorization checks are performed where needed
[ ] Sensitive data is handled securely
[ ] Output encoding prevents injection vulnerabilities
[ ] Security-relevant actions are logged

Data Handling

[ ] Data transformations are documented
[ ] Business logic is separated from data access
[ ] Validation rules are enforced
[ ] Error states are handled appropriately

Testing

[ ] Unit tests exist with adequate coverage (>70%)
[ ] Integration tests cover component interactions
[ ] Test data management strategy exists
[ ] Edge cases and error conditions are tested
[ ] Performance tests exist for critical components

Part 4: Code Level Review (Level 4)

Review implementation quality within components

Code Quality

Code Standards

[ ] Coding standards are documented and followed
[ ] Linting rules are configured and passing
[ ] Code formatting is consistent and automated
[ ] Code review process is followed for all changes
[ ] Technical debt is documented with remediation plans

Code Structure

[ ] Code is organized logically (by feature/domain, not by type)
[ ] Naming conventions are clear and consistent
[ ] Functions/methods are appropriately sized
[ ] Cyclomatic complexity is managed
[ ] Code duplication is minimized (DRY principle)
[ ] SOLID principles are followed where applicable

Error Handling

[ ] Errors are handled at appropriate levels
[ ] Error messages are meaningful for debugging
[ ] Errors don’t expose sensitive information to users
[ ] Exceptions are used for exceptional cases only
[ ] Resource cleanup happens in finally blocks/defer statements

Testing (Code Level)

[ ] Unit tests are fast and isolated
[ ] Tests follow AAA pattern (Arrange, Act, Assert)
[ ] Test names clearly describe what’s being tested
[ ] Mock/stub usage is appropriate
[ ] Tests are maintainable and reliable (not flaky)
[ ] Code coverage metrics are tracked

Documentation (Code Level)

[ ] Complex algorithms are explained with comments
[ ] Public APIs have comprehensive documentation
[ ] TODOs are tracked and addressed
[ ] Code comments explain “why” not “what”
[ ] Examples exist for complex functionality

Dependencies (Code Level)

[ ] Dependencies are appropriate for the problem
[ ] Heavy dependencies are justified
[ ] Dependency injection is used where appropriate
[ ] Circular dependencies don’t exist

Part 5: Documentation Completeness Across Repositories

Repository-Specific Documentation

Main Application Repository

[ ] Comprehensive README with project overview
[ ] Architecture documentation (C4 diagrams, ADRs)
[ ] Getting started guide for new developers
[ ] Development environment setup instructions
[ ] Build and deployment instructions
[ ] Contributing guidelines
[ ] Code of conduct (if open source/large team)
[ ] License file
[ ] CHANGELOG documenting releases
[ ] API documentation (if applicable)

Infrastructure Repository (IaC)

[ ] README explaining infrastructure architecture
[ ] Environment configuration documentation
[ ] Terraform/CloudFormation module documentation
[ ] Network architecture diagrams
[ ] Security group/firewall rule documentation
[ ] Disaster recovery procedures
[ ] Cost optimization notes
[ ] Resource tagging strategy

CI/CD Pipeline Repository

[ ] Pipeline architecture documentation
[ ] Build process documentation
[ ] Deployment strategies explained
[ ] Environment promotion process
[ ] Rollback procedures
[ ] Pipeline troubleshooting guide
[ ] Secrets management approach
[ ] Test automation strategy

Documentation Repository (if separate)

[ ] Architecture Decision Records (ADRs)
[ ] System architecture diagrams (C4 models)
[ ] API documentation and specifications
[ ] Database schema documentation
[ ] Integration guides for external systems
[ ] Security documentation
[ ] Runbooks for operational procedures
[ ] Post-mortem reports
[ ] Troubleshooting guides

Shared Libraries/Packages Repository

[ ] README with library purpose and usage
[ ] API documentation for public interfaces
[ ] Examples demonstrating common use cases
[ ] Versioning and release notes
[ ] Breaking changes documented
[ ] Contributing guidelines
[ ] Testing documentation

Configuration Repository

[ ] Configuration schema documentation
[ ] Environment-specific configurations explained
[ ] Secret management documented
[ ] Configuration change procedures
[ ] Validation rules documented

Microservices Repositories (each service)

[ ] Service purpose and boundaries documented
[ ] API contract (OpenAPI/gRPC spec)
[ ] Dependencies on other services documented
[ ] Local development setup
[ ] Service-specific configuration
[ ] Health check implementation
[ ] Monitoring and alerting configuration
[ ] Service runbook for on-call

Cross-Repository Documentation

Wiki/Confluence/Central Documentation

[ ] System overview and context
[ ] Architecture diagrams accessible to all teams
[ ] Onboarding documentation for new team members
[ ] Development workflow documentation
[ ] Release process documentation
[ ] Incident response procedures
[ ] Contact information and team structure
[ ] Decision log and ADRs
[ ] Meeting notes and design reviews
[ ] Glossary of terms and acronyms

API Documentation Portal

[ ] All APIs are documented in one place
[ ] API versioning strategy is clear
[ ] Authentication/authorization guide
[ ] Code examples in multiple languages
[ ] Postman/Insomnia collections available
[ ] Rate limiting and quotas documented
[ ] Deprecation policy and timeline
[ ] Support and contact information

Operational Documentation

[ ] System dependencies map
[ ] Disaster recovery runbooks
[ ] Incident response playbooks
[ ] On-call procedures and escalation
[ ] Maintenance windows and procedures
[ ] Performance baselines and SLAs
[ ] Capacity planning documentation
[ ] Cost management and budgeting

Security Documentation

[ ] Security architecture overview
[ ] Authentication and authorization model
[ ] Data classification and handling
[ ] Compliance requirements and evidence
[ ] Security incident procedures
[ ] Vulnerability management process
[ ] Access control matrix
[ ] Security testing procedures

Review Summary

System Details

System Name: _____________________
Review Date: _____________________
Reviewer(s): _____________________
System Owner: _____________________
Architecture Maturity: ⬜ Initial ⬜ Developing ⬜ Defined ⬜ Managed ⬜ Optimized

Completeness by C4 Level

Context (Level 1): _____% complete
Container (Level 2): _____% complete
Component (Level 3): _____% complete
Code (Level 4): _____% complete

Documentation Completeness

In-Repository Docs: _____% complete
Central Documentation: _____% complete
Cross-Repository Docs: _____% complete

Risk Assessment by Area

Area	Risk Level	Notes
Security	⬜ Low ⬜ Medium ⬜ High ⬜ Critical
Resilience	⬜ Low ⬜ Medium ⬜ High ⬜ Critical
Performance	⬜ Low ⬜ Medium ⬜ High ⬜ Critical
Observability	⬜ Low ⬜ Medium ⬜ High ⬜ Critical
Documentation	⬜ Low ⬜ Medium ⬜ High ⬜ Critical
Operations	⬜ Low ⬜ Medium ⬜ High ⬜ Critical

Priority Action Items

Critical (Fix immediately)

Item	Owner	Target Date	Status

High (Fix within sprint)

Item	Owner	Target Date	Status

Medium (Plan for next quarter)

Item	Owner	Target Date	Status

Low (Tech debt backlog)

Item	Owner	Target Date	Status

Overall Recommendations

Strategic observations and improvement recommendations

Next Review Date

Scheduled for: _____________________

Sign-off

Architect: _____________________ Date: _____
System Owner: _____________________ Date: _____
Security Review: _____________________ Date: _____

Appendix: Glossary of Terms

A

AAA Pattern (Arrange, Act, Assert) - A testing pattern where tests are structured in three phases: Arrange (setup test data), Act (execute the behavior), Assert (verify the outcome).

Access Control Matrix - A table documenting which users, roles, or services have what level of access to which resources.

ADR (Architecture Decision Record) - A document that captures an important architectural decision along with its context and consequences.

API (Application Programming Interface) - A set of definitions and protocols for building and integrating application software.

API Composition - Combining multiple backend API calls into a single aggregated response, typically at the API gateway layer.

API Contract - A formal agreement defining the structure, behavior, and expectations of an API, including request/response formats.

API Gateway - A server that acts as an API front-end, receiving API requests, enforcing throttling and security policies, and routing requests to appropriate backend services.

APM (Application Performance Monitoring) - Tools and practices for monitoring and managing the performance and availability of software applications.

Async Processing - Executing operations asynchronously, allowing the main program flow to continue without waiting for the operation to complete.

At-Least-Once Delivery - A message delivery guarantee where messages may be delivered one or more times, but never lost.

Auto-Scaling - Automatically adjusting computing resources based on current demand or predefined metrics.

AVIF (AV1 Image File Format) - A modern image format offering better compression than JPEG and PNG.

B

Backend for Frontend (BFF) - An architectural pattern where each user-facing application has its own tailored backend service.

Bastion Host - A server positioned as a gateway between trusted and untrusted networks, providing controlled access.

Batch Processing - Processing multiple items together in a group rather than individually, typically for efficiency.

Blue-Green Deployment - A deployment strategy using two identical production environments, switching traffic between them for zero-downtime deployments.

Bulkhead Pattern - An isolation pattern that partitions resources to prevent failures in one part from cascading to others.

Bus Factor - The minimum number of team members who need to be unavailable before a project stalls due to lack of knowledge.

C

C4 Model - A hierarchical approach to software architecture diagramming with four levels: Context, Containers, Components, and Code.

Cache Stampede - A situation where many requests simultaneously try to regenerate the same cache entry, causing system overload.

Cache-Aside - A caching pattern where the application checks the cache before querying the data source, and populates the cache on a miss.

Canary Deployment - A deployment strategy that gradually rolls out changes to a small subset of users before full deployment.

CDN (Content Delivery Network) - A distributed network of servers that deliver web content based on geographic location of the user.

Circuit Breaker - A design pattern that prevents an application from repeatedly trying to execute an operation likely to fail, allowing it to recover.

CLS (Cumulative Layout Shift) - A Core Web Vital metric measuring visual stability by quantifying unexpected layout shifts.

Component (C4) - Level 3 of the C4 model; the building blocks within a container that work together to deliver functionality.

Connection Pooling - Maintaining a cache of database connections to be reused, reducing the overhead of creating new connections.

Container (C4) - Level 2 of the C4 model; an executable unit like a web application, database, or microservice.

Context (C4) - Level 1 of the C4 model; the highest level view showing the system, its users, and external dependencies.

Core Web Vitals - Google’s metrics for measuring user experience: Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS).

CORS (Cross-Origin Resource Sharing) - A security mechanism that allows or restricts resources requested from a different domain.

CSRF (Cross-Site Request Forgery) - An attack that tricks a victim into submitting malicious requests using their authenticated session.

CSP (Content Security Policy) - An HTTP header that helps prevent XSS attacks by specifying which dynamic resources are allowed to load.

Cyclomatic Complexity - A software metric measuring the number of independent paths through a program’s source code.

D

DAST (Dynamic Application Security Testing) - Security testing performed on running applications to find vulnerabilities.

Dead Letter Queue - A queue where messages that can’t be processed successfully are sent for later analysis.

Distributed Tracing - Tracking a request as it flows through multiple services in a distributed system.

DRY (Don’t Repeat Yourself) - A software development principle aimed at reducing repetition of code patterns.

DR (Disaster Recovery) - Processes and tools for recovering IT systems after a catastrophic failure.

E

E2E (End-to-End) Testing - Testing an application’s workflow from beginning to end to ensure it behaves as expected.

Entity-Relationship Diagram (ERD) - A diagram showing entities (tables) and their relationships in a database.

ETL (Extract, Transform, Load) - A process of extracting data from sources, transforming it, and loading it into a destination system.

ETag - An HTTP response header used for cache validation, allowing conditional requests.

Event-Driven Architecture - An architectural pattern where system components communicate through events.

Exactly-Once Delivery - A message delivery guarantee ensuring each message is delivered once and only once.

Exponential Backoff - A retry strategy where wait time increases exponentially between attempts.

External System - Any system, service, or application that exists outside the boundaries of the system being reviewed. This includes third-party APIs, partner systems, legacy systems, SaaS platforms, and any dependencies not owned or directly controlled by your organization.

F

Failover - Automatically switching to a redundant system when the primary system fails.

Feature Flag - A technique allowing features to be enabled or disabled without deploying new code.

FID (First Input Delay) - A Core Web Vital measuring the time from user interaction to browser response.

Foreign Key - A database constraint linking one table to another to maintain referential integrity.

G

GDPR (General Data Protection Regulation) - EU regulation on data protection and privacy.

Graceful Degradation - Maintaining limited functionality when parts of a system fail.

GraphQL - A query language for APIs providing a complete description of data in the API.

gRPC - A high-performance RPC framework using HTTP/2 and Protocol Buffers.

H

HIPAA (Health Insurance Portability and Accountability Act) - US regulation for protecting sensitive patient health information.

Horizontal Scaling - Adding more machines to handle increased load.

HSTS (HTTP Strict Transport Security) - A security header forcing browsers to use HTTPS connections only.

HTTP Status Codes - Standard response codes indicating the result of an HTTP request (200 OK, 404 Not Found, etc.).

I

IaC (Infrastructure as Code) - Managing and provisioning infrastructure through machine-readable definition files.

Idempotency - An operation that produces the same result regardless of how many times it’s executed.

Index (Database) - A data structure improving the speed of data retrieval operations.

Input Validation - Checking that user-supplied data meets expected criteria before processing.

Integration Testing - Testing how different parts of a system work together.

J

JWT (JSON Web Token) - A compact, URL-safe token format for securely transmitting information between parties.

K

Key-Value Store - A database storing data as a collection of key-value pairs.

Kubernetes - An open-source container orchestration platform for automating deployment, scaling, and management.

L

Lazy Loading - Deferring loading of resources until they’re actually needed.

LCP (Largest Contentful Paint) - A Core Web Vital measuring when the largest content element becomes visible.

Least Privilege - Security principle of granting only the minimum access needed to perform a task.

LFU (Least Frequently Used) - A cache eviction policy removing items accessed least frequently.

Lifecycle Policy - Rules defining how long data is retained and when it’s archived or deleted.

Load Balancer - A device distributing network traffic across multiple servers.

Load Testing - Testing system behavior under expected and peak load conditions.

LRU (Least Recently Used) - A cache eviction policy removing items accessed least recently.

M

Master Data Management - Processes ensuring critical business data is consistent across the organization.

Message Broker - Middleware facilitating communication between applications via message passing.

MFA (Multi-Factor Authentication) - Authentication requiring two or more verification factors.

Microservices - An architectural style structuring an application as a collection of loosely coupled services.

Mock/Stub - Test doubles replacing real dependencies with controlled implementations.

Monolith - An application architecture where all components are part of a single deployable unit.

mTLS (Mutual TLS) - Both client and server authenticate each other using certificates.

N

N+1 Query Problem - A performance issue where one query is executed, followed by N additional queries in a loop.

Network ACL (Access Control List) - Rules controlling traffic allowed in and out of network subnets.

Network Segmentation - Dividing a network into smaller segments to improve security and performance.

Normalization - Organizing database tables to reduce redundancy and improve data integrity.

NSG (Network Security Group) - Azure’s firewall for filtering network traffic to resources.

O

OAuth 2.0 - An authorization framework enabling applications to obtain limited access to user accounts.

Observability - The ability to understand internal system states from external outputs (logs, metrics, traces).

OpenAPI/Swagger - A specification for describing REST APIs in a machine-readable format.

ORM (Object-Relational Mapping) - Technique for converting data between incompatible type systems using object-oriented programming.

OWASP Top 10 - A list of the most critical web application security risks.

P

Partitioning - Dividing a database table into smaller, more manageable pieces.

PCI-DSS - Payment Card Industry Data Security Standard for handling credit card information.

Penetration Testing - Simulated cyber attacks to identify security vulnerabilities.

PHI (Protected Health Information) - Health information that can be linked to an individual, protected under HIPAA.

PII (Personally Identifiable Information) - Information that can identify an individual.

Point-in-Time Recovery - Database recovery to any specific point in time within the backup retention period.

Poison Message - A message that causes repeated processing failures, typically moved to a dead letter queue.

Post-Mortem - Analysis conducted after an incident to understand what happened and prevent recurrence.

Pre-Signed URL - A time-limited URL granting temporary access to a private resource.

Principle of Least Privilege - See Least Privilege.

Q

Queue Depth - The number of messages waiting to be processed in a queue.

R

Rate Limiting - Controlling the rate of requests a user or service can make.

RBAC (Role-Based Access Control) - Access control based on user roles within an organization.

Read Replica - A copy of a database used to offload read queries from the primary database.

Recovery Point Objective (RPO) - Maximum acceptable amount of data loss measured in time.

Recovery Time Objective (RTO) - Maximum acceptable time to restore a system after a failure.

Referential Integrity - Database constraint ensuring relationships between tables remain consistent.

REST (Representational State Transfer) - An architectural style for designing networked applications using HTTP.

Retry Logic - Automatically attempting a failed operation again after a delay.

Rollback - Reverting a system to a previous state, typically after a failed deployment.

Rolling Deployment - Gradually replacing instances of the previous version with the new version.

Row-Level Security - Database feature controlling which rows users can access.

RPC (Remote Procedure Call) - Protocol allowing a program to execute procedures on another computer.

Runbook - Documentation of routine procedures and operations for system administration.

S

SAST (Static Application Security Testing) - Security testing analyzing source code for vulnerabilities without executing it.

Secrets Manager - Service for securely storing and managing sensitive information like API keys and passwords.

Security Group - AWS’s virtual firewall controlling inbound and outbound traffic.

Separation of Concerns - Design principle separating a program into distinct sections, each addressing a separate concern.

Service Discovery - Automatically detecting devices and services on a network.

Service Mesh - Infrastructure layer handling service-to-service communication, often providing security, monitoring, and reliability features.

Serverless - Cloud computing model where the cloud provider manages infrastructure, allowing developers to focus on code.

Session Management - Process of keeping track of user activity across multiple requests.

Sidecar Proxy - An auxiliary container deployed alongside an application container to provide supporting features.

SLA (Service Level Agreement) - Contract defining expected service levels between provider and customer.

SLO (Service Level Objective) - Target value or range of values for a service level measured by SLI.

SOC2 (System and Organization Controls 2) - Audit procedure ensuring service providers manage data securely.

SOLID Principles - Five design principles (Single Responsibility, Open-Closed, Liskov Substitution, Interface Segregation, Dependency Inversion) for maintainable software.

SQL Injection - Attack inserting malicious SQL code into application queries.

SRI (Subresource Integrity) - Security feature allowing browsers to verify fetched resources haven’t been manipulated.

SSL/TLS (Secure Sockets Layer/Transport Layer Security) - Cryptographic protocols providing secure communication over networks.

SSO (Single Sign-On) - Authentication scheme allowing users to log in once and access multiple applications.

Structured Logging - Logging in a consistent, machine-readable format (typically JSON).

T

Terraform - Infrastructure as Code tool for building, changing, and versioning infrastructure.

Three-Tier Architecture - Architecture pattern separating presentation, application logic, and data layers.

TLS (Transport Layer Security) - See SSL/TLS.

Token - A piece of data representing authorization to access resources.

Tracing - See Distributed Tracing.

Transaction - A unit of work performed against a database that is treated atomically.

Trust Boundary - The border between trusted and untrusted parts of a system.

TTL (Time To Live) - Duration data remains valid in a cache or similar system.

U

Unit Testing - Testing individual units of code in isolation.

V

Vertical Scaling - Adding more power (CPU, RAM) to an existing machine.

VPC (Virtual Private Cloud) - Isolated section of cloud infrastructure where you can launch resources.

VPN (Virtual Private Network) - Encrypted connection between networks over the internet.

W

WAF (Web Application Firewall) - Security layer protecting web applications from common attacks.

WCAG (Web Content Accessibility Guidelines) - Guidelines for making web content accessible to people with disabilities.

WebP - Modern image format providing superior compression compared to JPEG and PNG.

Webhook - HTTP callbacks triggered by specific events, allowing real-time integration between systems.

Write-Through Cache - Caching pattern where data is written to cache and database simultaneously.

X

XSS (Cross-Site Scripting) - Attack injecting malicious scripts into web pages viewed by other users.

Z

Zero Trust Network Access (ZTNA) - Security model requiring verification for every access request regardless of location.

Zero-Downtime Deployment - Deployment strategy ensuring service remains available during updates.