Architecture Review Checklist
Overview
This checklist follows the C4 model (Context, Containers, Components, Code) to review system architecture from high-level to detailed implementation. Use this for architecture reviews, system audits, and ensuring documentation completeness.
Part 1: System-Wide Review
1.1 System Context (Level 1)
Context Diagram
- [ ] C4 Context diagram exists and is current (updated within 3 months)
- [ ] All external systems and actors are identified
- [ ] System boundaries are clearly defined
- [ ] Business purpose and value proposition is documented
- [ ] System scope is explicit about what’s included/excluded
- [ ] Data flows between system and external entities are shown
- [ ] Trust boundaries are identified (where data crosses security zones)
External Dependencies
- [ ] All third-party services are documented with purpose
- [ ] SLAs with external providers are documented
- [ ] Integration patterns with external systems are defined
- [ ] Fallback strategies exist if external systems fail
- [ ] External system rate limits and quotas are known
- [ ] Data sharing agreements with external parties exist
- [ ] Vendor risk assessments are completed
Users & Actors
- [ ] All user types/personas are identified
- [ ] User authentication methods are documented
- [ ] User authorization levels are defined
- [ ] User journey maps exist for critical workflows
1.2 Cross-System Concerns
Security Architecture
- [ ] Security model is documented (zero trust, perimeter-based, etc.)
- [ ] Authentication strategy is defined system-wide
- [ ] Authorization model is consistent across containers
- [ ] Encryption strategy covers data at rest and in transit
- [ ] Secrets management approach is standardized
- [ ] Security incident response plan exists
- [ ] Compliance requirements are documented (GDPR, SOC2, etc.)
- [ ] Regular security assessments are scheduled
Network Architecture
- [ ] Network topology diagram exists
- [ ] Network segmentation strategy is documented
- [ ] All traffic flows use appropriate encryption (TLS 1.2+)
- [ ] Firewall rules follow least-privilege principle
- [ ] DDoS protection strategy is in place
- [ ] CDN strategy is documented for static content
- [ ] DNS configuration and failover is documented
Data Architecture
- [ ] Data model/schema documentation exists
- [ ] Data flow diagrams show movement between containers
- [ ] Data residency requirements are documented
- [ ] Data classification scheme exists (public, internal, confidential, etc.)
- [ ] Data retention policies are defined
- [ ] Data backup strategy covers all critical data
- [ ] Data recovery procedures are documented and tested
Operational Architecture
- [ ] Deployment architecture is documented (regions, availability zones)
- [ ] Disaster recovery strategy is defined (RTO/RPO targets)
- [ ] Monitoring and observability strategy is consistent
- [ ] Logging strategy is standardized across components
- [ ] Alerting and on-call procedures are defined
- [ ] Capacity planning is documented
- [ ] Cost allocation and monitoring approach exists
Performance & Scalability
- [ ] Performance SLAs/SLOs are defined
- [ ] Load testing strategy exists with results documented
- [ ] Bottlenecks have been identified and addressed
- [ ] Horizontal and vertical scaling strategies are defined
- [ ] Auto-scaling policies are configured and tested
- [ ] Caching strategy is documented (application, CDN, database)
Quality Attributes
- [ ] Architecture Decision Records (ADRs) document key decisions
- [ ] Quality attribute requirements are documented (performance, security, availability)
- [ ] Trade-offs between quality attributes are explicit
- [ ] Non-functional requirements are testable
Part 2: Container Level Review (Level 2)
Review each container (applications, databases, file systems, etc.) individually
2.1 Generic Container Checklist
Container: ___________________
Container Documentation
- [ ] C4 Container diagram shows this container in system context
- [ ] Container purpose and responsibilities are documented
- [ ] Technology choices are justified with ADRs
- [ ] Container repository README is comprehensive
- [ ] API documentation exists (OpenAPI/Swagger for APIs)
- [ ] Deployment architecture diagram exists
- [ ] Configuration management is documented
Container Boundaries
- [ ] Container dependencies (other containers) are identified
- [ ] Communication protocols are specified (REST, gRPC, message queue, etc.)
- [ ] API contracts/interfaces are versioned
- [ ] Integration patterns are documented (sync/async, request/response, pub/sub)
- [ ] Container responsibilities don’t overlap with others
- [ ] Data owned by this container is clearly defined
Security (Container Level)
- [ ] Authentication mechanism is implemented and documented
- [ ] Authorization is enforced for all operations
- [ ] HTTPS/TLS is enforced for all network communication
- [ ] Secrets are stored in secrets manager (not in code/config)
- [ ] Security headers are configured (HSTS, CSP, etc.)
- [ ] Input validation prevents injection attacks
- [ ] Rate limiting prevents abuse
- [ ] OWASP Top 10 vulnerabilities addressed
Networking
- [ ] Network exposure is appropriate (public/private)
- [ ] Firewall rules restrict access to necessary ports only
- [ ] Load balancing strategy is configured
- [ ] Health check endpoints are implemented
- [ ] Service discovery mechanism is configured (if microservices)
- [ ] Network policies restrict container-to-container traffic
Data Management
- [ ] Database/storage technology choice is documented
- [ ] Schema/data model is documented
- [ ] Backup procedures are configured and tested
- [ ] Data migration strategy exists
- [ ] Connection pooling is optimized
- [ ] Database credentials use principle of least privilege
Observability
- [ ] Structured logging is implemented with correlation IDs
- [ ] Metrics are exposed (requests, errors, latency)
- [ ] Distributed tracing is configured (if applicable)
- [ ] Health check endpoint returns meaningful status
- [ ] Readiness probe endpoint exists
- [ ] Dashboards exist for container health
- [ ] Alerts are configured for critical failures
Resilience
- [ ] Graceful degradation strategy exists
- [ ] Circuit breakers prevent cascade failures
- [ ] Retry logic with exponential backoff is implemented
- [ ] Timeout values are configured for external calls
- [ ] Bulkhead pattern isolates critical resources
- [ ] Fallback mechanisms handle dependency failures
Deployment & Operations
- [ ] CI/CD pipeline exists for this container
- [ ] Automated testing covers critical paths
- [ ] Deployment strategy is documented (blue-green, canary, rolling)
- [ ] Rollback procedure is documented and tested
- [ ] Infrastructure as Code defines container infrastructure
- [ ] Environment-specific configuration is externalized
- [ ] Feature flags enable gradual rollout
- [ ] Runbooks exist for common operational tasks
Performance
- [ ] Performance requirements are defined
- [ ] Load testing has been performed
- [ ] Resource limits are configured (CPU, memory)
- [ ] Auto-scaling is configured if needed
- [ ] Caching is implemented where appropriate
- [ ] Async processing is used for long-running operations
Dependencies
- [ ] All dependencies are documented (libraries, frameworks, services)
- [ ] Dependency versions are pinned
- [ ] Vulnerable dependencies are tracked and updated
- [ ] License compliance is verified
- [ ] Unused dependencies are removed
2.2 Frontend Web Application
Container Name: ___________________
Architecture Pattern: ⬜ Server-Side Rendered (SSR) ⬜ Static Site Generation (SSG) ⬜ Client-Side SPA ⬜ Hybrid
Rendering Approach:
- Server-Side Rendering (SSR): HTML generated on server for each request. Recommended for GDS compliance, accessibility, and SEO.
- Static Site Generation (SSG): HTML pre-generated at build time. Good for content-heavy sites with infrequent changes.
- Client-Side SPA: JavaScript renders content in browser. May not meet GDS/accessibility standards without careful implementation.
- Hybrid: Combination of approaches (e.g., SSR for public pages, SPA for authenticated areas).
Note: For UK Government services, GDS Service Manual requires progressive enhancement and services that work without JavaScript. Client-side SPAs typically do not meet these requirements unless using SSR/SSG with hydration.
Frontend-Specific Documentation
- [ ] Rendering strategy is documented and justified
- [ ] Component library/design system is documented
- [ ] State management approach is documented
- [ ] Routing strategy is documented
- [ ] Build and bundling configuration is documented
- [ ] Browser compatibility matrix is defined
- [ ] Progressive enhancement strategy is documented
User Interface
- [ ] Responsive design works across device sizes
- [ ] Accessibility standards met (WCAG 2.1 Level AA minimum)
- [ ] Keyboard navigation is fully functional
- [ ] Screen reader compatibility is tested (JAWS, NVDA, VoiceOver)
- [ ] Color contrast meets accessibility requirements
- [ ] Loading states and error messages are user-friendly
- [ ] Offline capability is implemented (if required)
- [ ] Service works without JavaScript enabled (if GDS/progressive enhancement required)
Frontend Security
- [ ] Content Security Policy (CSP) is configured
- [ ] Subresource Integrity (SRI) for external scripts
- [ ] XSS protection through proper escaping/sanitization (server-side and client-side)
- [ ] CSRF tokens for state-changing operations
- [ ] Sensitive data is not exposed in client-side code
- [ ] API keys are not exposed in frontend code
- [ ] Authentication tokens stored securely (httpOnly cookies or secure storage)
- [ ] Input validation on client side (with server-side validation too)
- [ ] Server-side rendering prevents data leakage in initial HTML
Frontend Performance
- [ ] Core Web Vitals metrics are monitored (LCP, FID, CLS)
- [ ] Initial page load uses server-rendered or pre-generated HTML
- [ ] Code splitting reduces initial bundle size (for JavaScript enhancements)
- [ ] Lazy loading for routes and components (if applicable)
- [ ] Images are optimized and use appropriate formats (WebP, AVIF)
- [ ] Critical CSS is inlined or prioritized
- [ ] Third-party scripts are loaded asynchronously
- [ ] Service worker for caching (if applicable)
- [ ] Bundle size is monitored and optimized
- [ ] Time to Interactive (TTI) is acceptable on slow connections
Server-Side Rendering (if applicable)
- [ ] Server renders complete HTML on first request
- [ ] Hydration strategy is documented (if client-side JavaScript added)
- [ ] Server-side data fetching is optimized
- [ ] Caching strategy for rendered pages
- [ ] Streaming rendering used where appropriate
- [ ] Error boundaries prevent white screen on errors
- [ ] Server has appropriate resource limits (CPU, memory)
Progressive Enhancement
- [ ] Core user journey works without JavaScript
- [ ] JavaScript enhances experience but isn’t required
- [ ] Forms submit via traditional POST (not just AJAX)
- [ ] Navigation works without JavaScript router
- [ ] Fallbacks exist for JavaScript-dependent features
- [ ] NoScript tags provide alternative content where needed
API Integration (from Frontend)
- [ ] API error handling with user-friendly messages
- [ ] Loading states for async operations
- [ ] Request cancellation for unmounted components
- [ ] API response caching strategy
- [ ] Retry logic for failed requests
- [ ] Request deduplication for identical concurrent requests
- [ ] CORS configuration documented (if cross-origin)
State Management
- [ ] State management pattern is consistent (Redux, MobX, Context, etc.)
- [ ] Global vs local state boundaries are clear
- [ ] State persistence strategy is documented (if needed)
- [ ] Side effects are managed consistently
- [ ] Server state and client state are clearly separated
Testing
- [ ] Unit tests for business logic and utilities
- [ ] Component tests for UI components
- [ ] Integration tests for user flows
- [ ] E2E tests for critical paths
- [ ] Visual regression testing (if applicable)
- [ ] Cross-browser testing strategy
- [ ] Accessibility testing automated (axe, Lighthouse)
- [ ] JavaScript-disabled testing (if progressive enhancement required)
GDS Compliance (if applicable)
- [ ] GOV.UK Design System components used
- [ ] GDS Service Manual standards followed
- [ ] Service Assessment criteria met
- [ ] Cookie consent follows GDS patterns
- [ ] Error pages follow GDS standards (404, 500, 503)
- [ ] Start pages and service journey documented
- [ ] Performance budget defined (2-second page load target)
2.3 Backend API Service
Container Name: ___________________
API Type: ⬜ REST ⬜ GraphQL ⬜ gRPC ⬜ Mixed
API-Specific Documentation
- [ ] OpenAPI/Swagger specification is current
- [ ] API versioning strategy is documented
- [ ] Request/response examples are provided
- [ ] Error codes and messages are documented
- [ ] Rate limiting policies are documented
- [ ] Authentication/authorization flows are documented
API Design
- [ ] RESTful principles followed (or GraphQL/gRPC standards)
- [ ] Consistent naming conventions across endpoints
- [ ] HTTP verbs used correctly (GET, POST, PUT, PATCH, DELETE)
- [ ] HTTP status codes used appropriately
- [ ] Pagination implemented for list endpoints
- [ ] Filtering and sorting capabilities exist
- [ ] Versioning strategy allows backward compatibility
- [ ] Deprecation policy and timeline defined
API Security
- [ ] Authentication required for all non-public endpoints
- [ ] Authorization checks at resource level
- [ ] Input validation on all parameters and body
- [ ] SQL injection prevention (parameterized queries/ORM)
- [ ] Rate limiting per user/API key
- [ ] API abuse detection and blocking
- [ ] Sensitive data not logged or exposed in errors
- [ ] CORS configured appropriately
Data Access Layer
- [ ] Database queries are optimized with proper indexing
- [ ] N+1 query problems are avoided
- [ ] Database connection pooling is configured
- [ ] Transactions are used for multi-step operations
- [ ] Database migrations are versioned and tested
- [ ] Read replicas used for read-heavy operations (if applicable)
- [ ] Caching layer reduces database load
Business Logic
- [ ] Business rules are centralized, not scattered
- [ ] Domain model is well-defined
- [ ] Validation rules are comprehensive
- [ ] Side effects are handled explicitly
- [ ] Idempotency for state-changing operations
- [ ] Event publishing for domain events (if event-driven)
API Performance
- [ ] Response time SLAs are defined and monitored
- [ ] Slow query detection and alerting
- [ ] Request/response compression enabled
- [ ] ETags for conditional requests
- [ ] Batch operations available for bulk updates
- [ ] Async processing for long-running operations
- [ ] Background jobs for non-critical tasks
Integration Testing
- [ ] Contract tests validate API contracts
- [ ] Integration tests cover happy paths and error cases
- [ ] Load tests validate performance under stress
- [ ] API compatibility tests prevent breaking changes
2.3a Hybrid Web Application + API Container
Container Name: ___________________
Use Case: Single container serving both web pages (HTML) and API endpoints (JSON/XML). Common for public-facing services that need both a user interface and programmatic access.
Examples:
- Government service with public web pages AND an API for partners
- Documentation site that also exposes data via API
- Admin interface bundled with management API
- Legacy monolith serving both concerns
Hybrid Container Documentation
- [ ] Clear separation between web routes and API routes is documented
- [ ] URL structure distinguishes web pages from API (e.g., /api/* for APIs)
- [ ] Versioning strategy for APIs doesn’t affect web pages
- [ ] Deployment strategy handles both concerns
- [ ] Whether web and API share authentication/authorization is documented
Architecture Clarity
- [ ] Web page routes are clearly identified (e.g., /, /about, /contact)
- [ ] API routes are clearly identified (e.g., /api/v1/, /api/v2/)
- [ ] Static asset routes documented (e.g., /assets/, /public/)
- [ ] Health check endpoints separated (e.g., /health for web, /api/health for API)
- [ ] Middleware/filters distinguish between web and API requests
Content Type Handling
- [ ] Content negotiation correctly routes to HTML or JSON responses
- [ ] Accept headers properly handled (text/html vs application/json)
- [ ] Error responses appropriate for context (HTML error pages vs JSON errors)
- [ ] 404 handling differs between web (error page) and API (JSON error)
- [ ] CORS only applied to API routes, not web pages
Security Considerations
- [ ] CSRF protection enabled for web forms but not API endpoints
- [ ] API authentication separate from web session management (if different users)
- [ ] Rate limiting configured separately for web vs API traffic
- [ ] Security headers appropriate for each (CSP for web, CORS for API)
- [ ] Session management doesn’t interfere with API token validation
Web Functionality (within Hybrid Container)
- [ ] Server-side rendering of HTML (SSR pattern)
- [ ] Template engine configured (Jinja2, EJS, Handlebars, etc.)
- [ ] Progressive enhancement principles followed
- [ ] Static file serving configured and optimized
- [ ] Client-side JavaScript enhancement is optional, not required
- [ ] Web forms work without JavaScript
API Functionality (within Hybrid Container)
- [ ] JSON/XML serialization properly configured
- [ ] API documentation accessible (e.g., /api/docs via Swagger UI)
- [ ] API versioning doesn’t conflict with web routes
- [ ] API responses don’t accidentally return HTML
- [ ] OpenAPI spec generation automated
Performance Separation
- [ ] Caching strategies differ between web pages and API
- [ ] Static assets cached aggressively (long TTL)
- [ ] API responses cached appropriately (shorter TTL, conditional caching)
- [ ] CDN configuration handles both static files and dynamic content
- [ ] Resource limits consider both web traffic and API calls
Monitoring & Observability
- [ ] Metrics separated by traffic type (web vs API)
- [ ] Logging distinguishes web requests from API calls
- [ ] Dashboards show both web page performance and API performance
- [ ] Alerts configured for both web availability and API availability
- [ ] Usage analytics track web visitors separately from API consumers
Testing Strategy
- [ ] E2E tests for web user journeys
- [ ] Contract tests for API endpoints
- [ ] Integration tests verify web and API don’t interfere
- [ ] Load testing covers realistic mix of web and API traffic
- [ ] Accessibility testing for web pages
- [ ] API compatibility testing
When to Consider Splitting
Consider separating into distinct containers when:
- [ ] Web and API have significantly different scaling needs
- [ ] Different teams own web vs API
- [ ] API versioning complicates web deployment
- [ ] Security requirements differ substantially
- [ ] Performance characteristics conflict (e.g., API needs high throughput, web needs low latency)
- [ ] Want independent deployment cycles
Advantages of Hybrid Approach
- Simpler infrastructure (single deployment)
- Shared authentication/authorization logic
- Single codebase for related concerns
- Easier to maintain consistency
- Lower operational overhead
Disadvantages of Hybrid Approach
- Harder to scale independently
- Deployment risk affects both concerns
- Potential for security configuration conflicts
- Mixing of concerns can complicate code
- Different performance characteristics may conflict
2.4 API Gateway
Container Name: ___________________
Gateway-Specific Documentation
- [ ] Routing rules are documented
- [ ] Transformation logic is documented
- [ ] Rate limiting policies per client/endpoint
- [ ] Authentication/authorization flows
- [ ] Backend service mapping
Gateway Configuration
- [ ] Routes to backend services are defined
- [ ] Service discovery is configured (if dynamic)
- [ ] Health checks for backend services
- [ ] Timeout configuration for upstream services
- [ ] Retry and circuit breaker policies
- [ ] Request/response transformation rules
- [ ] API composition/aggregation logic (if applicable)
Security at Gateway
- [ ] SSL/TLS termination configured
- [ ] WAF rules protect against common attacks
- [ ] API key validation
- [ ] JWT validation and claims extraction
- [ ] IP whitelisting/blacklisting
- [ ] DDoS protection
- [ ] OAuth 2.0 integration (if applicable)
Traffic Management
- [ ] Rate limiting per client, endpoint, or globally
- [ ] Quota management for API consumers
- [ ] Traffic splitting for canary deployments
- [ ] A/B testing capabilities (if needed)
- [ ] Request prioritization or throttling
Observability at Gateway
- [ ] Request/response logging with correlation IDs
- [ ] API usage metrics per client/endpoint
- [ ] Latency tracking for backend services
- [ ] Error rate monitoring
- [ ] Security event logging (failed auth, rate limit hits)
2.5 Message Broker / Event Bus
Container Name: ___________________
Messaging Documentation
- [ ] Event schemas are documented
- [ ] Topic/queue naming conventions
- [ ] Message flow diagrams
- [ ] Producer and consumer mappings
- [ ] Event versioning strategy
- [ ] Poison message handling
Message Broker Configuration
- [ ] Message retention policies defined
- [ ] Dead letter queue configured
- [ ] Message ordering guarantees documented
- [ ] Partitioning strategy (if applicable)
- [ ] Replication configuration for high availability
- [ ] Access control per topic/queue
Event Design
- [ ] Event schemas are versioned
- [ ] Events are immutable
- [ ] Events contain sufficient context
- [ ] Event names are meaningful and consistent
- [ ] Event size is optimized
- [ ] Backward compatibility is maintained
Reliability
- [ ] At-least-once/exactly-once delivery guarantees documented
- [ ] Idempotent consumer handling
- [ ] Retry logic for failed message processing
- [ ] Circuit breakers for downstream dependencies
- [ ] Message acknowledgment strategy
- [ ] Consumer lag monitoring and alerting
Security
- [ ] Authentication for producers and consumers
- [ ] Authorization per topic/queue
- [ ] Encryption in transit
- [ ] Encryption at rest (if required)
- [ ] Sensitive data handling in messages
Performance
- [ ] Throughput requirements defined
- [ ] Consumer scaling strategy
- [ ] Batch processing for efficiency
- [ ] Message compression (if applicable)
- [ ] Consumer group configuration optimized
2.6 Database / Data Store
Container Name: ___________________
Type: ⬜ Relational ⬜ Document ⬜ Key-Value ⬜ Graph ⬜ Time-Series ⬜ Other
Database Documentation
- [ ] Entity-Relationship diagram (ERD) or data model
- [ ] Schema documentation with descriptions
- [ ] Index strategy documented
- [ ] Query patterns documented
- [ ] Constraints and relationships defined
- [ ] Migration history maintained
Schema Design
- [ ] Normalization appropriate for use case
- [ ] Indexes support common query patterns
- [ ] Foreign key constraints enforce referential integrity
- [ ] Check constraints validate data quality
- [ ] Appropriate data types chosen
- [ ] NULL handling is explicit
- [ ] Audit fields (created_at, updated_at, etc.) exist
Database Security
- [ ] Encryption at rest enabled
- [ ] Encryption in transit (TLS) enforced
- [ ] Principle of least privilege for database users
- [ ] Separate credentials per application/service
- [ ] Row-level security (if required)
- [ ] Column-level encryption for sensitive data
- [ ] Database audit logging enabled
- [ ] Regular security patching schedule
Backup & Recovery
- [ ] Automated backup schedule configured
- [ ] Backup retention policy defined
- [ ] Point-in-time recovery capability
- [ ] Backup encryption enabled
- [ ] Backup restoration tested regularly (last test: _____)
- [ ] Cross-region backup for disaster recovery
- [ ] RPO and RTO defined and achievable
Performance
- [ ] Query performance is monitored
- [ ] Slow query log is reviewed regularly
- [ ] Execution plans for critical queries are optimized
- [ ] Connection pooling configured appropriately
- [ ] Database statistics are updated regularly
- [ ] Partitioning strategy (if applicable)
- [ ] Read replicas for read-heavy workloads
- [ ] Query caching configured
Maintenance
- [ ] Database upgrade strategy documented
- [ ] Migration testing process defined
- [ ] Rollback procedures for failed migrations
- [ ] Vacuum/analyze jobs scheduled (PostgreSQL)
- [ ] Index maintenance procedures
- [ ] Storage growth monitoring and alerting
- [ ] Archival strategy for old data
2.7 Cache Layer (Redis, Memcached, etc.)
Container Name: ___________________
Cache Documentation
- [ ] Cache strategy documented (cache-aside, write-through, etc.)
- [ ] Cache key naming conventions
- [ ] TTL policies per cache type
- [ ] Cache invalidation strategy
- [ ] Cache warming procedures
Cache Design
- [ ] Cache key design prevents collisions
- [ ] Appropriate TTL values set
- [ ] Cache stampede prevention (lock, probabilistic early expiration)
- [ ] Cache invalidation strategy handles updates
- [ ] Cache size limits configured
- [ ] Eviction policy appropriate (LRU, LFU, etc.)
Resilience
- [ ] Application handles cache unavailability gracefully
- [ ] Cache warming on startup (if needed)
- [ ] Cache cluster configuration for high availability
- [ ] Failover strategy documented
- [ ] Monitoring for cache hit/miss ratio
- [ ] Alerting for low hit ratios
Security
- [ ] Authentication enabled
- [ ] Network isolation (not public)
- [ ] Encryption in transit (if required)
- [ ] No sensitive data in cache keys
2.8 Background Job Processor / Worker
Container Name: ___________________
Worker Documentation
- [ ] Job types and purposes documented
- [ ] Job queue architecture diagram
- [ ] Retry policies per job type
- [ ] Job priority levels defined
- [ ] Job scheduling strategy
Job Design
- [ ] Jobs are idempotent
- [ ] Job payloads are minimal (references, not large data)
- [ ] Job timeout values configured
- [ ] Job retry logic with exponential backoff
- [ ] Dead letter queue for permanently failed jobs
- [ ] Job status tracking for long-running jobs
Reliability
- [ ] Job failure monitoring and alerting
- [ ] Stuck job detection
- [ ] Job replay capability for failures
- [ ] Graceful shutdown handling
- [ ] Job ordering guarantees (if required)
Performance
- [ ] Worker scaling strategy (manual or auto)
- [ ] Queue depth monitoring
- [ ] Processing time per job type monitored
- [ ] Resource limits per job type
- [ ] Batch processing for efficiency
Observability
- [ ] Job execution logs with context
- [ ] Job metrics (queued, processing, completed, failed)
- [ ] Job execution time tracking
- [ ] Queue depth metrics and alerting
- [ ] Failed job analysis and reporting
2.9 File Storage / Object Storage
Container Name: ___________________
Storage Documentation
- [ ] Storage structure and organization documented
- [ ] File naming conventions defined
- [ ] Access patterns documented
- [ ] Lifecycle policies documented
- [ ] CDN integration (if applicable)
Storage Design
- [ ] Appropriate storage class for access patterns (hot, warm, cold)
- [ ] Directory/bucket structure is logical
- [ ] File versioning enabled (if required)
- [ ] Lifecycle policies archive or delete old files
- [ ] Large file upload strategy (multipart, resumable)
- [ ] File metadata strategy
Security
- [ ] Bucket/container policies restrict access
- [ ] Pre-signed URLs for temporary access
- [ ] Encryption at rest enabled
- [ ] Encryption in transit enforced
- [ ] Access logging enabled
- [ ] Public access blocked unless intentional
- [ ] Cross-Origin Resource Sharing (CORS) configured properly
Performance
- [ ] CDN configured for frequently accessed files
- [ ] Appropriate caching headers set
- [ ] File compression for compressible content
- [ ] Transfer acceleration enabled (if available)
Cost Optimization
- [ ] Storage class tiers used appropriately
- [ ] Lifecycle policies reduce storage costs
- [ ] Orphaned files cleaned up regularly
- [ ] Storage costs monitored per bucket/container
2.10 Third-Party Integration Service
Container Name: ___________________
Integration Documentation
- [ ] Integration architecture diagram
- [ ] API documentation from third-party referenced
- [ ] Authentication method documented
- [ ] Rate limits from provider documented
- [ ] Error codes and handling documented
- [ ] Webhook configuration (if applicable)
Integration Design
- [ ] Abstraction layer isolates third-party API
- [ ] Fallback behavior if third-party unavailable
- [ ] Idempotency keys used to prevent duplicates
- [ ] Request retries with exponential backoff
- [ ] Circuit breaker prevents cascade failures
- [ ] Timeout values configured appropriately
Reliability
- [ ] Third-party service SLA is known
- [ ] Monitoring of third-party availability
- [ ] Graceful degradation when service unavailable
- [ ] Webhook signature verification
- [ ] Webhook replay handling (idempotency)
Security
- [ ] API keys stored in secrets manager
- [ ] Credentials rotated regularly
- [ ] Minimal permissions requested from third-party
- [ ] Data shared with third-party is minimized
- [ ] Data processing agreement in place (if applicable)
Cost Management
- [ ] API usage tracked against rate limits
- [ ] Cost per API call is monitored
- [ ] Budget alerts configured
- [ ] Optimization opportunities identified (caching, batching)
Part 3: Component Level Review (Level 3)
Review key components within each container
Component: ___________________
Container: ___________________
Component Documentation
- [ ] C4 Component diagram shows this component in container context
- [ ] Component purpose and responsibilities are documented
- [ ] Component interfaces/APIs are documented
- [ ] Design patterns used are identified
- [ ] Code location is documented (repository, module, package)
Component Design
- [ ] Single Responsibility Principle is followed
- [ ] Component has clear, cohesive purpose
- [ ] Dependencies on other components are minimal
- [ ] Interfaces are well-defined and stable
- [ ] Component is testable in isolation
- [ ] Appropriate design patterns are used
Component Boundaries
- [ ] Interactions with other components are documented
- [ ] Data passed between components is specified
- [ ] Error handling between components is defined
- [ ] Component doesn’t reach across architectural layers
Security (Component Level)
- [ ] Input validation is performed on all entry points
- [ ] Authorization checks are performed where needed
- [ ] Sensitive data is handled securely
- [ ] Output encoding prevents injection vulnerabilities
- [ ] Security-relevant actions are logged
Data Handling
- [ ] Data transformations are documented
- [ ] Business logic is separated from data access
- [ ] Validation rules are enforced
- [ ] Error states are handled appropriately
Testing
- [ ] Unit tests exist with adequate coverage (>70%)
- [ ] Integration tests cover component interactions
- [ ] Test data management strategy exists
- [ ] Edge cases and error conditions are tested
- [ ] Performance tests exist for critical components
Part 4: Code Level Review (Level 4)
Review implementation quality within components
Code Quality
Code Standards
- [ ] Coding standards are documented and followed
- [ ] Linting rules are configured and passing
- [ ] Code formatting is consistent and automated
- [ ] Code review process is followed for all changes
- [ ] Technical debt is documented with remediation plans
Code Structure
- [ ] Code is organized logically (by feature/domain, not by type)
- [ ] Naming conventions are clear and consistent
- [ ] Functions/methods are appropriately sized
- [ ] Cyclomatic complexity is managed
- [ ] Code duplication is minimized (DRY principle)
- [ ] SOLID principles are followed where applicable
Error Handling
- [ ] Errors are handled at appropriate levels
- [ ] Error messages are meaningful for debugging
- [ ] Errors don’t expose sensitive information to users
- [ ] Exceptions are used for exceptional cases only
- [ ] Resource cleanup happens in finally blocks/defer statements
Testing (Code Level)
- [ ] Unit tests are fast and isolated
- [ ] Tests follow AAA pattern (Arrange, Act, Assert)
- [ ] Test names clearly describe what’s being tested
- [ ] Mock/stub usage is appropriate
- [ ] Tests are maintainable and reliable (not flaky)
- [ ] Code coverage metrics are tracked
Documentation (Code Level)
- [ ] Complex algorithms are explained with comments
- [ ] Public APIs have comprehensive documentation
- [ ] TODOs are tracked and addressed
- [ ] Code comments explain “why” not “what”
- [ ] Examples exist for complex functionality
Dependencies (Code Level)
- [ ] Dependencies are appropriate for the problem
- [ ] Heavy dependencies are justified
- [ ] Dependency injection is used where appropriate
- [ ] Circular dependencies don’t exist
Part 5: Documentation Completeness Across Repositories
Repository-Specific Documentation
Main Application Repository
- [ ] Comprehensive README with project overview
- [ ] Architecture documentation (C4 diagrams, ADRs)
- [ ] Getting started guide for new developers
- [ ] Development environment setup instructions
- [ ] Build and deployment instructions
- [ ] Contributing guidelines
- [ ] Code of conduct (if open source/large team)
- [ ] License file
- [ ] CHANGELOG documenting releases
- [ ] API documentation (if applicable)
Infrastructure Repository (IaC)
- [ ] README explaining infrastructure architecture
- [ ] Environment configuration documentation
- [ ] Terraform/CloudFormation module documentation
- [ ] Network architecture diagrams
- [ ] Security group/firewall rule documentation
- [ ] Disaster recovery procedures
- [ ] Cost optimization notes
- [ ] Resource tagging strategy
CI/CD Pipeline Repository
- [ ] Pipeline architecture documentation
- [ ] Build process documentation
- [ ] Deployment strategies explained
- [ ] Environment promotion process
- [ ] Rollback procedures
- [ ] Pipeline troubleshooting guide
- [ ] Secrets management approach
- [ ] Test automation strategy
Documentation Repository (if separate)
- [ ] Architecture Decision Records (ADRs)
- [ ] System architecture diagrams (C4 models)
- [ ] API documentation and specifications
- [ ] Database schema documentation
- [ ] Integration guides for external systems
- [ ] Security documentation
- [ ] Runbooks for operational procedures
- [ ] Post-mortem reports
- [ ] Troubleshooting guides
Shared Libraries/Packages Repository
- [ ] README with library purpose and usage
- [ ] API documentation for public interfaces
- [ ] Examples demonstrating common use cases
- [ ] Versioning and release notes
- [ ] Breaking changes documented
- [ ] Contributing guidelines
- [ ] Testing documentation
Configuration Repository
- [ ] Configuration schema documentation
- [ ] Environment-specific configurations explained
- [ ] Secret management documented
- [ ] Configuration change procedures
- [ ] Validation rules documented
Microservices Repositories (each service)
- [ ] Service purpose and boundaries documented
- [ ] API contract (OpenAPI/gRPC spec)
- [ ] Dependencies on other services documented
- [ ] Local development setup
- [ ] Service-specific configuration
- [ ] Health check implementation
- [ ] Monitoring and alerting configuration
- [ ] Service runbook for on-call
Cross-Repository Documentation
Wiki/Confluence/Central Documentation
- [ ] System overview and context
- [ ] Architecture diagrams accessible to all teams
- [ ] Onboarding documentation for new team members
- [ ] Development workflow documentation
- [ ] Release process documentation
- [ ] Incident response procedures
- [ ] Contact information and team structure
- [ ] Decision log and ADRs
- [ ] Meeting notes and design reviews
- [ ] Glossary of terms and acronyms
API Documentation Portal
- [ ] All APIs are documented in one place
- [ ] API versioning strategy is clear
- [ ] Authentication/authorization guide
- [ ] Code examples in multiple languages
- [ ] Postman/Insomnia collections available
- [ ] Rate limiting and quotas documented
- [ ] Deprecation policy and timeline
- [ ] Support and contact information
Operational Documentation
- [ ] System dependencies map
- [ ] Disaster recovery runbooks
- [ ] Incident response playbooks
- [ ] On-call procedures and escalation
- [ ] Maintenance windows and procedures
- [ ] Performance baselines and SLAs
- [ ] Capacity planning documentation
- [ ] Cost management and budgeting
Security Documentation
- [ ] Security architecture overview
- [ ] Authentication and authorization model
- [ ] Data classification and handling
- [ ] Compliance requirements and evidence
- [ ] Security incident procedures
- [ ] Vulnerability management process
- [ ] Access control matrix
- [ ] Security testing procedures
Review Summary
System Details
System Name: _____________________
Review Date: _____________________
Reviewer(s): _____________________
System Owner: _____________________
Architecture Maturity: ⬜ Initial ⬜ Developing ⬜ Defined ⬜ Managed ⬜ Optimized
Completeness by C4 Level
- Context (Level 1): _____% complete
- Container (Level 2): _____% complete
- Component (Level 3): _____% complete
- Code (Level 4): _____% complete
Documentation Completeness
- In-Repository Docs: _____% complete
- Central Documentation: _____% complete
- Cross-Repository Docs: _____% complete
Risk Assessment by Area
| Area | Risk Level | Notes |
|---|---|---|
| Security | ⬜ Low ⬜ Medium ⬜ High ⬜ Critical | |
| Resilience | ⬜ Low ⬜ Medium ⬜ High ⬜ Critical | |
| Performance | ⬜ Low ⬜ Medium ⬜ High ⬜ Critical | |
| Observability | ⬜ Low ⬜ Medium ⬜ High ⬜ Critical | |
| Documentation | ⬜ Low ⬜ Medium ⬜ High ⬜ Critical | |
| Operations | ⬜ Low ⬜ Medium ⬜ High ⬜ Critical |
Priority Action Items
Critical (Fix immediately)
| Item | Owner | Target Date | Status |
|---|---|---|---|
High (Fix within sprint)
| Item | Owner | Target Date | Status |
|---|---|---|---|
Medium (Plan for next quarter)
| Item | Owner | Target Date | Status |
|---|---|---|---|
Low (Tech debt backlog)
| Item | Owner | Target Date | Status |
|---|---|---|---|
Overall Recommendations
Strategic observations and improvement recommendations
Next Review Date
Scheduled for: _____________________
Sign-off
Architect: _____________________ Date: _____
System Owner: _____________________ Date: _____
Security Review: _____________________ Date: _____
Appendix: Glossary of Terms
A
AAA Pattern (Arrange, Act, Assert) - A testing pattern where tests are structured in three phases: Arrange (setup test data), Act (execute the behavior), Assert (verify the outcome).
Access Control Matrix - A table documenting which users, roles, or services have what level of access to which resources.
ADR (Architecture Decision Record) - A document that captures an important architectural decision along with its context and consequences.
API (Application Programming Interface) - A set of definitions and protocols for building and integrating application software.
API Composition - Combining multiple backend API calls into a single aggregated response, typically at the API gateway layer.
API Contract - A formal agreement defining the structure, behavior, and expectations of an API, including request/response formats.
API Gateway - A server that acts as an API front-end, receiving API requests, enforcing throttling and security policies, and routing requests to appropriate backend services.
APM (Application Performance Monitoring) - Tools and practices for monitoring and managing the performance and availability of software applications.
Async Processing - Executing operations asynchronously, allowing the main program flow to continue without waiting for the operation to complete.
At-Least-Once Delivery - A message delivery guarantee where messages may be delivered one or more times, but never lost.
Auto-Scaling - Automatically adjusting computing resources based on current demand or predefined metrics.
AVIF (AV1 Image File Format) - A modern image format offering better compression than JPEG and PNG.
B
Backend for Frontend (BFF) - An architectural pattern where each user-facing application has its own tailored backend service.
Bastion Host - A server positioned as a gateway between trusted and untrusted networks, providing controlled access.
Batch Processing - Processing multiple items together in a group rather than individually, typically for efficiency.
Blue-Green Deployment - A deployment strategy using two identical production environments, switching traffic between them for zero-downtime deployments.
Bulkhead Pattern - An isolation pattern that partitions resources to prevent failures in one part from cascading to others.
Bus Factor - The minimum number of team members who need to be unavailable before a project stalls due to lack of knowledge.
C
C4 Model - A hierarchical approach to software architecture diagramming with four levels: Context, Containers, Components, and Code.
Cache Stampede - A situation where many requests simultaneously try to regenerate the same cache entry, causing system overload.
Cache-Aside - A caching pattern where the application checks the cache before querying the data source, and populates the cache on a miss.
Canary Deployment - A deployment strategy that gradually rolls out changes to a small subset of users before full deployment.
CDN (Content Delivery Network) - A distributed network of servers that deliver web content based on geographic location of the user.
Circuit Breaker - A design pattern that prevents an application from repeatedly trying to execute an operation likely to fail, allowing it to recover.
CLS (Cumulative Layout Shift) - A Core Web Vital metric measuring visual stability by quantifying unexpected layout shifts.
Component (C4) - Level 3 of the C4 model; the building blocks within a container that work together to deliver functionality.
Connection Pooling - Maintaining a cache of database connections to be reused, reducing the overhead of creating new connections.
Container (C4) - Level 2 of the C4 model; an executable unit like a web application, database, or microservice.
Context (C4) - Level 1 of the C4 model; the highest level view showing the system, its users, and external dependencies.
Core Web Vitals - Google’s metrics for measuring user experience: Largest Contentful Paint (LCP), First Input Delay (FID), and Cumulative Layout Shift (CLS).
CORS (Cross-Origin Resource Sharing) - A security mechanism that allows or restricts resources requested from a different domain.
CSRF (Cross-Site Request Forgery) - An attack that tricks a victim into submitting malicious requests using their authenticated session.
CSP (Content Security Policy) - An HTTP header that helps prevent XSS attacks by specifying which dynamic resources are allowed to load.
Cyclomatic Complexity - A software metric measuring the number of independent paths through a program’s source code.
D
DAST (Dynamic Application Security Testing) - Security testing performed on running applications to find vulnerabilities.
Dead Letter Queue - A queue where messages that can’t be processed successfully are sent for later analysis.
Distributed Tracing - Tracking a request as it flows through multiple services in a distributed system.
DRY (Don’t Repeat Yourself) - A software development principle aimed at reducing repetition of code patterns.
DR (Disaster Recovery) - Processes and tools for recovering IT systems after a catastrophic failure.
E
E2E (End-to-End) Testing - Testing an application’s workflow from beginning to end to ensure it behaves as expected.
Entity-Relationship Diagram (ERD) - A diagram showing entities (tables) and their relationships in a database.
ETL (Extract, Transform, Load) - A process of extracting data from sources, transforming it, and loading it into a destination system.
ETag - An HTTP response header used for cache validation, allowing conditional requests.
Event-Driven Architecture - An architectural pattern where system components communicate through events.
Exactly-Once Delivery - A message delivery guarantee ensuring each message is delivered once and only once.
Exponential Backoff - A retry strategy where wait time increases exponentially between attempts.
External System - Any system, service, or application that exists outside the boundaries of the system being reviewed. This includes third-party APIs, partner systems, legacy systems, SaaS platforms, and any dependencies not owned or directly controlled by your organization.
F
Failover - Automatically switching to a redundant system when the primary system fails.
Feature Flag - A technique allowing features to be enabled or disabled without deploying new code.
FID (First Input Delay) - A Core Web Vital measuring the time from user interaction to browser response.
Foreign Key - A database constraint linking one table to another to maintain referential integrity.
G
GDPR (General Data Protection Regulation) - EU regulation on data protection and privacy.
Graceful Degradation - Maintaining limited functionality when parts of a system fail.
GraphQL - A query language for APIs providing a complete description of data in the API.
gRPC - A high-performance RPC framework using HTTP/2 and Protocol Buffers.
H
HIPAA (Health Insurance Portability and Accountability Act) - US regulation for protecting sensitive patient health information.
Horizontal Scaling - Adding more machines to handle increased load.
HSTS (HTTP Strict Transport Security) - A security header forcing browsers to use HTTPS connections only.
HTTP Status Codes - Standard response codes indicating the result of an HTTP request (200 OK, 404 Not Found, etc.).
I
IaC (Infrastructure as Code) - Managing and provisioning infrastructure through machine-readable definition files.
Idempotency - An operation that produces the same result regardless of how many times it’s executed.
Index (Database) - A data structure improving the speed of data retrieval operations.
Input Validation - Checking that user-supplied data meets expected criteria before processing.
Integration Testing - Testing how different parts of a system work together.
J
JWT (JSON Web Token) - A compact, URL-safe token format for securely transmitting information between parties.
K
Key-Value Store - A database storing data as a collection of key-value pairs.
Kubernetes - An open-source container orchestration platform for automating deployment, scaling, and management.
L
Lazy Loading - Deferring loading of resources until they’re actually needed.
LCP (Largest Contentful Paint) - A Core Web Vital measuring when the largest content element becomes visible.
Least Privilege - Security principle of granting only the minimum access needed to perform a task.
LFU (Least Frequently Used) - A cache eviction policy removing items accessed least frequently.
Lifecycle Policy - Rules defining how long data is retained and when it’s archived or deleted.
Load Balancer - A device distributing network traffic across multiple servers.
Load Testing - Testing system behavior under expected and peak load conditions.
LRU (Least Recently Used) - A cache eviction policy removing items accessed least recently.
M
Master Data Management - Processes ensuring critical business data is consistent across the organization.
Message Broker - Middleware facilitating communication between applications via message passing.
MFA (Multi-Factor Authentication) - Authentication requiring two or more verification factors.
Microservices - An architectural style structuring an application as a collection of loosely coupled services.
Mock/Stub - Test doubles replacing real dependencies with controlled implementations.
Monolith - An application architecture where all components are part of a single deployable unit.
mTLS (Mutual TLS) - Both client and server authenticate each other using certificates.
N
N+1 Query Problem - A performance issue where one query is executed, followed by N additional queries in a loop.
Network ACL (Access Control List) - Rules controlling traffic allowed in and out of network subnets.
Network Segmentation - Dividing a network into smaller segments to improve security and performance.
Normalization - Organizing database tables to reduce redundancy and improve data integrity.
NSG (Network Security Group) - Azure’s firewall for filtering network traffic to resources.
O
OAuth 2.0 - An authorization framework enabling applications to obtain limited access to user accounts.
Observability - The ability to understand internal system states from external outputs (logs, metrics, traces).
OpenAPI/Swagger - A specification for describing REST APIs in a machine-readable format.
ORM (Object-Relational Mapping) - Technique for converting data between incompatible type systems using object-oriented programming.
OWASP Top 10 - A list of the most critical web application security risks.
P
Partitioning - Dividing a database table into smaller, more manageable pieces.
PCI-DSS - Payment Card Industry Data Security Standard for handling credit card information.
Penetration Testing - Simulated cyber attacks to identify security vulnerabilities.
PHI (Protected Health Information) - Health information that can be linked to an individual, protected under HIPAA.
PII (Personally Identifiable Information) - Information that can identify an individual.
Point-in-Time Recovery - Database recovery to any specific point in time within the backup retention period.
Poison Message - A message that causes repeated processing failures, typically moved to a dead letter queue.
Post-Mortem - Analysis conducted after an incident to understand what happened and prevent recurrence.
Pre-Signed URL - A time-limited URL granting temporary access to a private resource.
Principle of Least Privilege - See Least Privilege.
Q
Queue Depth - The number of messages waiting to be processed in a queue.
R
Rate Limiting - Controlling the rate of requests a user or service can make.
RBAC (Role-Based Access Control) - Access control based on user roles within an organization.
Read Replica - A copy of a database used to offload read queries from the primary database.
Recovery Point Objective (RPO) - Maximum acceptable amount of data loss measured in time.
Recovery Time Objective (RTO) - Maximum acceptable time to restore a system after a failure.
Referential Integrity - Database constraint ensuring relationships between tables remain consistent.
REST (Representational State Transfer) - An architectural style for designing networked applications using HTTP.
Retry Logic - Automatically attempting a failed operation again after a delay.
Rollback - Reverting a system to a previous state, typically after a failed deployment.
Rolling Deployment - Gradually replacing instances of the previous version with the new version.
Row-Level Security - Database feature controlling which rows users can access.
RPC (Remote Procedure Call) - Protocol allowing a program to execute procedures on another computer.
Runbook - Documentation of routine procedures and operations for system administration.
S
SAST (Static Application Security Testing) - Security testing analyzing source code for vulnerabilities without executing it.
Secrets Manager - Service for securely storing and managing sensitive information like API keys and passwords.
Security Group - AWS’s virtual firewall controlling inbound and outbound traffic.
Separation of Concerns - Design principle separating a program into distinct sections, each addressing a separate concern.
Service Discovery - Automatically detecting devices and services on a network.
Service Mesh - Infrastructure layer handling service-to-service communication, often providing security, monitoring, and reliability features.
Serverless - Cloud computing model where the cloud provider manages infrastructure, allowing developers to focus on code.
Session Management - Process of keeping track of user activity across multiple requests.
Sidecar Proxy - An auxiliary container deployed alongside an application container to provide supporting features.
SLA (Service Level Agreement) - Contract defining expected service levels between provider and customer.
SLO (Service Level Objective) - Target value or range of values for a service level measured by SLI.
SOC2 (System and Organization Controls 2) - Audit procedure ensuring service providers manage data securely.
SOLID Principles - Five design principles (Single Responsibility, Open-Closed, Liskov Substitution, Interface Segregation, Dependency Inversion) for maintainable software.
SQL Injection - Attack inserting malicious SQL code into application queries.
SRI (Subresource Integrity) - Security feature allowing browsers to verify fetched resources haven’t been manipulated.
SSL/TLS (Secure Sockets Layer/Transport Layer Security) - Cryptographic protocols providing secure communication over networks.
SSO (Single Sign-On) - Authentication scheme allowing users to log in once and access multiple applications.
Structured Logging - Logging in a consistent, machine-readable format (typically JSON).
T
Terraform - Infrastructure as Code tool for building, changing, and versioning infrastructure.
Three-Tier Architecture - Architecture pattern separating presentation, application logic, and data layers.
TLS (Transport Layer Security) - See SSL/TLS.
Token - A piece of data representing authorization to access resources.
Tracing - See Distributed Tracing.
Transaction - A unit of work performed against a database that is treated atomically.
Trust Boundary - The border between trusted and untrusted parts of a system.
TTL (Time To Live) - Duration data remains valid in a cache or similar system.
U
Unit Testing - Testing individual units of code in isolation.
V
Vertical Scaling - Adding more power (CPU, RAM) to an existing machine.
VPC (Virtual Private Cloud) - Isolated section of cloud infrastructure where you can launch resources.
VPN (Virtual Private Network) - Encrypted connection between networks over the internet.
W
WAF (Web Application Firewall) - Security layer protecting web applications from common attacks.
WCAG (Web Content Accessibility Guidelines) - Guidelines for making web content accessible to people with disabilities.
WebP - Modern image format providing superior compression compared to JPEG and PNG.
Webhook - HTTP callbacks triggered by specific events, allowing real-time integration between systems.
Write-Through Cache - Caching pattern where data is written to cache and database simultaneously.
X
XSS (Cross-Site Scripting) - Attack injecting malicious scripts into web pages viewed by other users.
Z
Zero Trust Network Access (ZTNA) - Security model requiring verification for every access request regardless of location.
Zero-Downtime Deployment - Deployment strategy ensuring service remains available during updates.