14. Troubleshooting Guide
Common errors and solutions for MILU2 Infra Main.
Quick Diagnosis Checklist
- Check CloudWatch Logs - Always start here
- Check Service/Resource Status - Is it running?
- Check Security Groups - Is traffic allowed?
- Check IAM Permissions - Can it access resources?
- Check Recent Changes - Did Terraform just run?
ECS Service Issues
Service Not Starting
Symptoms: runningCount = 0, tasks keep failing
# 1. Check service events
aws ecs describe-services \
--cluster milu2-test-cluster \
--services milu2-test-api-service \
--query 'services[0].events[:10]'
# 2. Check stopped tasks
aws ecs list-tasks --cluster milu2-test-cluster \
--service-name milu2-test-api-service --desired-status STOPPED
# 3. Describe stopped task
aws ecs describe-tasks \
--cluster milu2-test-cluster \
--tasks <task-arn> \
--query 'tasks[0].{reason:stoppedReason,code:stopCode}'Container Crash Loop
# 1. Check container logs
aws logs tail app/ecs/milu2-test-api-php --since 30m
# 2. Check exit codes
aws ecs describe-tasks \
--cluster milu2-test-cluster \
--tasks <task-arn> \
--query 'tasks[0].containers[*].{name:name,exitCode:exitCode,reason:reason}'Health Check Failing
# 1. Check ALB target health
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:ap-northeast-1:123456789012:targetgroup/milu2-test-api-blue/xxx
# 2. Test health endpoint from container
make ecs-exec-api-test
curl http://localhost:8082/healthcheck
# 3. Check security groups
aws ec2 describe-security-groups --group-ids sg-xxx
Common Errors
| Error | Cause | Solution |
|---|
| CannotPullContainerError | ECR image doesn't exist or no permissions | Check ECR image exists, IAM permissions |
| ResourceInitializationError | VPC endpoints or NAT Gateway issues | Check VPC endpoints, NAT Gateway |
| OutOfMemoryError | Task doesn't have enough memory | Increase task memory in task definition |
| HealthCheckFailure | Container not responding to health check | Check container health, ALB target health |
RDS Connection Issues
Cannot Connect to Database
# 1. Verify RDS status
aws rds describe-db-clusters \
--db-cluster-identifier milu2-test-db \
--query 'DBClusters[0].Status'
# 2. Check security group rules
aws ec2 describe-security-groups \
--group-ids <rds-sg-id> \
--query 'SecurityGroups[0].IpPermissions'
# 3. Verify endpoint
aws rds describe-db-clusters \
--db-cluster-identifier milu2-test-db \
--query 'DBClusters[0].Endpoint'
# 4. Test from bastion
ssh -i key.pem ec2-user@<bastion-ip>
mysql -h <endpoint> -P 3307 -u admin -p
| Issue | Solution |
|---|
| Security group blocks | Add ingress rule from app SG to RDS SG |
| Wrong port | Use 3307 (not default 3306) |
| Credentials expired | Check Secrets Manager |
| RDS paused | Wait for auto-resume or modify cluster |
DDL Bootstrap Failed
# 1. Check null_resource trigger
terragrunt state show null_resource.db_initializer
# 2. Re-run manually
terragrunt taint null_resource.db_initializer
terragrunt apply -target=null_resource.db_initializer
# 3. Check bastion connectivity
ssh -i key.pem ec2-user@<bastion-ip>
mysql -h <endpoint> -P 3307 -u admin -p < init.sql
Global Accelerator Issues
GA Traffic Not Flowing
# 1. Check accelerator status
aws globalaccelerator describe-accelerator \
--accelerator-arn <arn>
# 2. Check allow traffic setting
aws globalaccelerator describe-custom-routing-endpoint-group \
--endpoint-group-arn <arn>
# 3. Re-run allow traffic
terraform taint terraform_data.allow_traffic
terragrunt apply
# 4. Test Lambda port resolver
aws lambda invoke \
--function-name milu2-test-ga-port-mapping-resolver \
--payload '{"ec2_private_ip":"10.0.1.100","dest_port":7551}' \
out.json
cat out.jsonPort Mapping Not Found
# 1. List all port mappings
aws globalaccelerator list-custom-routing-port-mappings \
--accelerator-arn <arn>
# 2. Check subnet association
aws globalaccelerator describe-custom-routing-endpoint-group \
--endpoint-group-arn <arn> \
--query 'EndpointGroup.EndpointDescriptions'
CloudFront Issues
503 Service Unavailable
# 1. Check ALB origin
aws cloudfront get-distribution --id EXXXXXXXXX \
--query 'Distribution.DistributionConfig.Origins'
# 2. Check VPC origin status
aws cloudfront list-vpc-origins
# 3. Check ALB health
aws elbv2 describe-target-health --target-group-arn <arn>
# 4. Test ALB directly (from bastion)
curl -H "X-Forwarded-App: api" http://<alb-dns>/healthcheck
Cache Not Invalidating
# 1. Check invalidation status
aws cloudfront list-invalidations --distribution-id EXXXXXXXXX
# 2. Create manual invalidation
aws cloudfront create-invalidation \
--distribution-id EXXXXXXXXX \
--paths "/*"
# 3. Check cache-invalidator Lambda
aws logs tail /aws/lambda/milu2-test-cache-invalidator --since 1h
Lambda Issues
Lambda Timing Out
# 1. Check function configuration
aws lambda get-function-configuration \
--function-name milu2-test-image-validator
# 2. Increase timeout (via Terraform)
# In lambda.tf: timeout = 60
# 3. Check VPC connectivity
aws lambda get-function \
--function-name milu2-test-image-validator \
--query 'Configuration.VpcConfig'
DLQ Messages Growing
# 1. Check DLQ count
aws sqs get-queue-attributes \
--queue-url <dlq-url> \
--attribute-names ApproximateNumberOfMessages
# 2. View sample DLQ message
aws sqs receive-message --queue-url <dlq-url>
# 3. Check Lambda errors
aws logs filter-log-events \
--log-group-name /aws/lambda/milu2-test-image-validator \
--filter-pattern "ERROR"
# 4. Replay DLQ messages
aws sqs start-message-move-task \
--source-arn <dlq-arn> \
--destination-arn <main-queue-arn>
CodeDeploy Issues
Deployment Stuck
# 1. Check deployment status
aws deploy get-deployment --deployment-id d-XXXXXXXXX
# 2. Check deployment targets
aws deploy list-deployment-targets --deployment-id d-XXXXXXXXX
# 3. Continue if waiting for approval
aws deploy continue-deployment \
--deployment-id d-XXXXXXXXX \
--deployment-wait-type READY_WAIT
# 4. Stop and rollback
aws deploy stop-deployment \
--deployment-id d-XXXXXXXXX \
--auto-rollback-enabled
Rollback Failed
# 1. Check original task definition
aws ecs describe-task-definition \
--task-definition milu2-test-api:XX
# 2. Manually update service
aws ecs update-service \
--cluster milu2-test-cluster \
--service milu2-test-api-service \
--task-definition milu2-test-api:XX
Terraform/Terragrunt Issues
State Lock Error
# 1. Check lock file
aws s3 ls s3://milu2-test-tfstate/ | grep tflock
# 2. View lock details
aws s3 cp s3://milu2-test-tfstate/core.tfstate.tflock -
# 3. Remove lock (CAREFUL!)
aws s3 rm s3://milu2-test-tfstate/core.tfstate.tflock
Only remove the lock when you're sure no one else is running Terraform!
Plan Shows Unexpected Changes
# 1. Refresh state
terragrunt refresh
# 2. Check specific resource
terragrunt state show aws_ecs_service.api
# 3. Import if missing
terragrunt import aws_ecs_service.api <service-arn>
Apply Failed Mid-Way
# 1. Check current state
terragrunt state list
# 2. Run targeted apply
terragrunt apply -target=module.app.aws_ecs_service.api
# 3. Full apply after fixing
terragrunt apply
Network Issues
VPC Endpoint Not Working
# 1. Check endpoint status
aws ec2 describe-vpc-endpoints \
--filters Name=vpc-id,Values=vpc-xxx
# 2. Check security group
aws ec2 describe-security-groups \
--group-ids <endpoint-sg>
# 3. Check route table
aws ec2 describe-route-tables \
--filters Name=vpc-id,Values=vpc-xxx
NAT Gateway Issues
# 1. Check NAT status
aws ec2 describe-nat-gateways \
--filter Name=vpc-id,Values=vpc-xxx
# 2. Check Elastic IP
aws ec2 describe-addresses --allocation-ids eipalloc-xxx
# 3. Check route table
aws ec2 describe-route-tables --route-table-ids rtb-xxx
GitHub Actions Issues
OIDC AssumeRole Failed
# 1. Check OIDC provider
aws iam list-open-id-connect-providers
# 2. Check role trust policy
aws iam get-role --role-name milu2-github-actions-infra \
--query 'Role.AssumeRolePolicyDocument'
# 3. Verify repo claim matches
# Check "sub" claim format: repo:org/repo:ref:refs/heads/main
# 4. Re-apply OIDC provider
cd tofu/envs/shared/github_provider
terragrunt apply
Log Group Reference
| Service | Log Group |
|---|
| API PHP | app/ecs/milu2-test-api-php |
| API Nginx | app/ecs/milu2-test-api-nginx |
| Web | app/ecs/milu2-test-web-* |
| Admin | app/ecs/milu2-test-admin-* |
| Push | app/ecs/milu2-test-push-node |
| Lambdas | /aws/lambda/milu2-test-* |
| VPC Flow | network/milu2-test-vpc-flow |
| WAF | aws-waf-logs-milu2-test-cloudfront-php |
| RDS | /aws/rds/cluster/milu2-test-db/* |
Quick Reference Commands
| Operation | Command |
|---|
| Check ECS | aws ecs describe-services --cluster milu2-test-cluster --services <svc> |
| Scale ECS | aws ecs update-service --cluster <c> --service <s> --desired-count <n> |
| Restart ECS | aws ecs update-service --cluster <c> --service <s> --force-new-deployment |
| View Logs | aws logs tail <log-group> --since 1h |
| Exec Shell | make ecs-exec-api-test |
| Check RDS | aws rds describe-db-clusters --db-cluster-identifier milu2-test-db |
| Invalidate CF | aws cloudfront create-invalidation --distribution-id <id> --paths "/*" |
| Deploy Continue | aws deploy continue-deployment --deployment-id <id> |
| Deploy Rollback | aws deploy stop-deployment --deployment-id <id> --auto-rollback-enabled |