14. Troubleshooting Guide

Common errors and solutions for MILU2 Infra Main.

Quick Diagnosis Checklist

  1. Check CloudWatch Logs - Always start here
  2. Check Service/Resource Status - Is it running?
  3. Check Security Groups - Is traffic allowed?
  4. Check IAM Permissions - Can it access resources?
  5. Check Recent Changes - Did Terraform just run?

ECS Service Issues

Service Not Starting

Symptoms: runningCount = 0, tasks keep failing

# 1. Check service events
aws ecs describe-services \
  --cluster milu2-test-cluster \
  --services milu2-test-api-service \
  --query 'services[0].events[:10]'

# 2. Check stopped tasks
aws ecs list-tasks --cluster milu2-test-cluster \
  --service-name milu2-test-api-service --desired-status STOPPED

# 3. Describe stopped task
aws ecs describe-tasks \
  --cluster milu2-test-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].{reason:stoppedReason,code:stopCode}'

Container Crash Loop

# 1. Check container logs
aws logs tail app/ecs/milu2-test-api-php --since 30m

# 2. Check exit codes
aws ecs describe-tasks \
  --cluster milu2-test-cluster \
  --tasks <task-arn> \
  --query 'tasks[0].containers[*].{name:name,exitCode:exitCode,reason:reason}'

Health Check Failing

# 1. Check ALB target health
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:ap-northeast-1:123456789012:targetgroup/milu2-test-api-blue/xxx

# 2. Test health endpoint from container
make ecs-exec-api-test
curl http://localhost:8082/healthcheck

# 3. Check security groups
aws ec2 describe-security-groups --group-ids sg-xxx

Common Errors

ErrorCauseSolution
CannotPullContainerErrorECR image doesn't exist or no permissionsCheck ECR image exists, IAM permissions
ResourceInitializationErrorVPC endpoints or NAT Gateway issuesCheck VPC endpoints, NAT Gateway
OutOfMemoryErrorTask doesn't have enough memoryIncrease task memory in task definition
HealthCheckFailureContainer not responding to health checkCheck container health, ALB target health

RDS Connection Issues

Cannot Connect to Database

# 1. Verify RDS status
aws rds describe-db-clusters \
  --db-cluster-identifier milu2-test-db \
  --query 'DBClusters[0].Status'

# 2. Check security group rules
aws ec2 describe-security-groups \
  --group-ids <rds-sg-id> \
  --query 'SecurityGroups[0].IpPermissions'

# 3. Verify endpoint
aws rds describe-db-clusters \
  --db-cluster-identifier milu2-test-db \
  --query 'DBClusters[0].Endpoint'

# 4. Test from bastion
ssh -i key.pem ec2-user@<bastion-ip>
mysql -h <endpoint> -P 3307 -u admin -p
IssueSolution
Security group blocksAdd ingress rule from app SG to RDS SG
Wrong portUse 3307 (not default 3306)
Credentials expiredCheck Secrets Manager
RDS pausedWait for auto-resume or modify cluster

DDL Bootstrap Failed

# 1. Check null_resource trigger
terragrunt state show null_resource.db_initializer

# 2. Re-run manually
terragrunt taint null_resource.db_initializer
terragrunt apply -target=null_resource.db_initializer

# 3. Check bastion connectivity
ssh -i key.pem ec2-user@<bastion-ip>
mysql -h <endpoint> -P 3307 -u admin -p < init.sql

Global Accelerator Issues

GA Traffic Not Flowing

# 1. Check accelerator status
aws globalaccelerator describe-accelerator \
  --accelerator-arn <arn>

# 2. Check allow traffic setting
aws globalaccelerator describe-custom-routing-endpoint-group \
  --endpoint-group-arn <arn>

# 3. Re-run allow traffic
terraform taint terraform_data.allow_traffic
terragrunt apply

# 4. Test Lambda port resolver
aws lambda invoke \
  --function-name milu2-test-ga-port-mapping-resolver \
  --payload '{"ec2_private_ip":"10.0.1.100","dest_port":7551}' \
  out.json
cat out.json

Port Mapping Not Found

# 1. List all port mappings
aws globalaccelerator list-custom-routing-port-mappings \
  --accelerator-arn <arn>

# 2. Check subnet association
aws globalaccelerator describe-custom-routing-endpoint-group \
  --endpoint-group-arn <arn> \
  --query 'EndpointGroup.EndpointDescriptions'

CloudFront Issues

503 Service Unavailable

# 1. Check ALB origin
aws cloudfront get-distribution --id EXXXXXXXXX \
  --query 'Distribution.DistributionConfig.Origins'

# 2. Check VPC origin status
aws cloudfront list-vpc-origins

# 3. Check ALB health
aws elbv2 describe-target-health --target-group-arn <arn>

# 4. Test ALB directly (from bastion)
curl -H "X-Forwarded-App: api" http://<alb-dns>/healthcheck

Cache Not Invalidating

# 1. Check invalidation status
aws cloudfront list-invalidations --distribution-id EXXXXXXXXX

# 2. Create manual invalidation
aws cloudfront create-invalidation \
  --distribution-id EXXXXXXXXX \
  --paths "/*"

# 3. Check cache-invalidator Lambda
aws logs tail /aws/lambda/milu2-test-cache-invalidator --since 1h

Lambda Issues

Lambda Timing Out

# 1. Check function configuration
aws lambda get-function-configuration \
  --function-name milu2-test-image-validator

# 2. Increase timeout (via Terraform)
# In lambda.tf: timeout = 60

# 3. Check VPC connectivity
aws lambda get-function \
  --function-name milu2-test-image-validator \
  --query 'Configuration.VpcConfig'

DLQ Messages Growing

# 1. Check DLQ count
aws sqs get-queue-attributes \
  --queue-url <dlq-url> \
  --attribute-names ApproximateNumberOfMessages

# 2. View sample DLQ message
aws sqs receive-message --queue-url <dlq-url>

# 3. Check Lambda errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/milu2-test-image-validator \
  --filter-pattern "ERROR"

# 4. Replay DLQ messages
aws sqs start-message-move-task \
  --source-arn <dlq-arn> \
  --destination-arn <main-queue-arn>

CodeDeploy Issues

Deployment Stuck

# 1. Check deployment status
aws deploy get-deployment --deployment-id d-XXXXXXXXX

# 2. Check deployment targets
aws deploy list-deployment-targets --deployment-id d-XXXXXXXXX

# 3. Continue if waiting for approval
aws deploy continue-deployment \
  --deployment-id d-XXXXXXXXX \
  --deployment-wait-type READY_WAIT

# 4. Stop and rollback
aws deploy stop-deployment \
  --deployment-id d-XXXXXXXXX \
  --auto-rollback-enabled

Rollback Failed

# 1. Check original task definition
aws ecs describe-task-definition \
  --task-definition milu2-test-api:XX

# 2. Manually update service
aws ecs update-service \
  --cluster milu2-test-cluster \
  --service milu2-test-api-service \
  --task-definition milu2-test-api:XX

Terraform/Terragrunt Issues

State Lock Error

# 1. Check lock file
aws s3 ls s3://milu2-test-tfstate/ | grep tflock

# 2. View lock details
aws s3 cp s3://milu2-test-tfstate/core.tfstate.tflock -

# 3. Remove lock (CAREFUL!)
aws s3 rm s3://milu2-test-tfstate/core.tfstate.tflock

Only remove the lock when you're sure no one else is running Terraform!

Plan Shows Unexpected Changes

# 1. Refresh state
terragrunt refresh

# 2. Check specific resource
terragrunt state show aws_ecs_service.api

# 3. Import if missing
terragrunt import aws_ecs_service.api <service-arn>

Apply Failed Mid-Way

# 1. Check current state
terragrunt state list

# 2. Run targeted apply
terragrunt apply -target=module.app.aws_ecs_service.api

# 3. Full apply after fixing
terragrunt apply

Network Issues

VPC Endpoint Not Working

# 1. Check endpoint status
aws ec2 describe-vpc-endpoints \
  --filters Name=vpc-id,Values=vpc-xxx

# 2. Check security group
aws ec2 describe-security-groups \
  --group-ids <endpoint-sg>

# 3. Check route table
aws ec2 describe-route-tables \
  --filters Name=vpc-id,Values=vpc-xxx

NAT Gateway Issues

# 1. Check NAT status
aws ec2 describe-nat-gateways \
  --filter Name=vpc-id,Values=vpc-xxx

# 2. Check Elastic IP
aws ec2 describe-addresses --allocation-ids eipalloc-xxx

# 3. Check route table
aws ec2 describe-route-tables --route-table-ids rtb-xxx

GitHub Actions Issues

OIDC AssumeRole Failed

# 1. Check OIDC provider
aws iam list-open-id-connect-providers

# 2. Check role trust policy
aws iam get-role --role-name milu2-github-actions-infra \
  --query 'Role.AssumeRolePolicyDocument'

# 3. Verify repo claim matches
# Check "sub" claim format: repo:org/repo:ref:refs/heads/main

# 4. Re-apply OIDC provider
cd tofu/envs/shared/github_provider
terragrunt apply

Log Group Reference

ServiceLog Group
API PHPapp/ecs/milu2-test-api-php
API Nginxapp/ecs/milu2-test-api-nginx
Webapp/ecs/milu2-test-web-*
Adminapp/ecs/milu2-test-admin-*
Pushapp/ecs/milu2-test-push-node
Lambdas/aws/lambda/milu2-test-*
VPC Flownetwork/milu2-test-vpc-flow
WAFaws-waf-logs-milu2-test-cloudfront-php
RDS/aws/rds/cluster/milu2-test-db/*

Quick Reference Commands

OperationCommand
Check ECSaws ecs describe-services --cluster milu2-test-cluster --services <svc>
Scale ECSaws ecs update-service --cluster <c> --service <s> --desired-count <n>
Restart ECSaws ecs update-service --cluster <c> --service <s> --force-new-deployment
View Logsaws logs tail <log-group> --since 1h
Exec Shellmake ecs-exec-api-test
Check RDSaws rds describe-db-clusters --db-cluster-identifier milu2-test-db
Invalidate CFaws cloudfront create-invalidation --distribution-id <id> --paths "/*"
Deploy Continueaws deploy continue-deployment --deployment-id <id>
Deploy Rollbackaws deploy stop-deployment --deployment-id <id> --auto-rollback-enabled