안녕하세요, 다람쥐입니다.
가끔 EC2 인스턴스에서 몇 년 전부터 영문도 모르게 죽는 현상이 있습니다.
로그를 보며 차근 차근 봤습니다.
영문도 모른 채 죽는 인스턴스
얼마 전, API 헬스 체크를 하는 워크플로우를 구축했습니다.
오늘 오후 메일로 갑작스레 날라오더군요.
헬스 체크가 실패했다는 메시지가요.
워크 플로우를 구축 안했으면 몰랐을 텐데, 뿌듯한 마음과
도대체 왜 실패했는지 궁금해졌습니다.
우선 EC2 인스턴스에 SSH 접속하고자 했습니다.
그런데 접속이 안되더군요.
EC2 콘솔에 접속했습니다.
아래처럼 상태 검사가 1 / 2개 검사 통과가 나타났습니다.
몇 년 전부터 괴롭혀온 그 이슈였습니다.
우선 침착하게 인스턴스를 중지 -> 시작합니다.
'인스턴스 재부팅'은 동작이 안됩니다.
반드시 '인스턴스 중지' 버튼을 눌러야 합니다.
그 후로 인스턴스에 정상적으로 접속이 됩니다.
여태 1년에 정말 2~3 번 정도만 발생했는데요.
로그가 메모리에 많이 쌓여서 고갈됐나 싶었어요.
보통 재부팅까지만 해도 별 문제 없어서(?) 다시 일상으로 돌아갔습니다.
모니터링을 봐도 CPU 사용률이 99.9% 까지 치솟았다는 것 외에는
크게 문제점을 모르겠더라고요.
다른 블로그를 참고하면
배포 도중에 메모리를 과도하게 사용해
인스턴스가 먹통되고 접속도 안되는 현상이 있다고 합니다.
스왑 영역을 설정해 해결할 수 있다고 하네요.
혹시 메모리 문제일지 모르니 설정했습니다.
공식 문서 : https://aws.amazon.com/ko/premiumsupport/knowledge-center/ec2-memory-swap-file/
# swap 메모리 할당
$ sudo dd if=/dev/zero of=/swapfile bs=128M count=16
16+0 records in
16+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 32.3344 s, 66.4 MB/s
# 스왑파일 읽기 / 쓰기 권한 할당
$ sudo chmod 600 /swapfile
# 스왑 영역 설정
$ sudo mkswap /swapfile
Setting up swapspace version 1, size = 2 GiB (2147479552 bytes)
no label, UUID=35d6c9cf-8d17-4d91-a238-de51e77983af
# 스왑 공간에 스왑 파일을 추가
$ sudo swapon /swapfile
# 확인
$ sudo swapon -s
Filename Type Size Used Priority
/swapfile file 2097148 0 -2
# 부팅 시 스왑 파일 활성화
$ sudo vi /etc/fstab
# 마지막 라인에 추가
/swapfile swap swap defaults 0 0
# 스왑 사용률 확인
$ sudo free -h
total used free shared buff/cache available
Mem: 949Mi 596Mi 165Mi 1.0Mi 187Mi 186Mi
Swap: 2.0Gi 0B 2.0Gi
# ---
# 스왑 삭제
sudo rm -r swapfile
# 스왑파일로 스왑 비활성화
$ sudo swapoff swapfile
# 모든 스왑 메모리 비활성화
$ sudo swapoff -a
다만 아무도 인스턴스에 접속하지 않았고
애플리케이션 배포도, 트래픽도 없었기에 이상한 현상입니다.
AWS 시스템 내부적으로 평상시보다 네트워크 입출력을 많이 하는
어떤 데몬이 이상한 게 아닌가 하는 의심이 컸어요.
시스템 로그를 열어봤습니다.
Page Up / Page Down 으로 시간대(UTC 기준)를 찾습니다.
또는 Vi(m)에서 '/Aug 2 07:19' 으로 문제 시간대를 검색합니다.
$ vim /var/log/syslog
Aug 2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:30:33 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 4ef77717-0404-46da-a883-f2ae588ba7f7 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug 2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 4ef77717-0404-46da-a883-f2ae588ba7f7
Aug 2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:30:33 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug 2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 859dfb08-be51-4ca6-8652-268a1d0f1c19
Aug 2 05:36:11 ip-172-31-0-13 snapd[413]: storehelpers.go:923: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core18", "core20", "core22", "lxd", "snapd"
Aug 2 05:58:15 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:58:15 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 754ad390-1464-4c21-9397-a6ee41d9426e Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug 2 05:58:15 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 754ad390-1464-4c21-9397-a6ee41d9426e
Aug 2 05:58:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:58:15 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug 2 05:58:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 066b66ec-63fc-4ffe-b9d6-a732bbd7ffac
Aug 2 06:17:01 ip-172-31-0-13 CRON[99296]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:24:16 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: bc0f2d74-9ab0-408d-9dec-92a3a1fdaeb6 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug 2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: bc0f2d74-9ab0-408d-9dec-92a3a1fdaeb6
Aug 2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:24:17 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug 2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: de46fb38-0084-478f-9d95-fd4882d61bc9
Aug 2 06:25:01 ip-172-31-0-13 CRON[99303]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Aug 2 06:33:34 ip-172-31-0-13 systemd[1]: Starting Daily apt upgrade and clean activities...
Aug 2 06:52:08 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:52:07 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 804e19aa-db17-4953-a9a9-b5be78678f47 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug 2 06:52:08 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 804e19aa-db17-4953-a9a9-b5be78678f47
Aug 2 06:52:12 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:52:11 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug 2 06:52:12 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: c5e8645a-22ad-409a-a5bf-cd900b8858dc
Aug 2 07:17:02 ip-172-31-0-13 CRON[99521]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 2 07:19:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 07:19:16 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: ed332665-26e9-4b0b-b188-63beafcee846 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug 2 07:19:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: ed332665-26e9-4b0b-b188-63beafcee846
Aug 2 07:19:21 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 07:19:20 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug 2 07:19:21 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 5c89ae8f-d1a8-45c8-946f-538ee44a9b2f
SSM role 관련한 오류가 지속적으로 발생하고 있습니다
지속적이지만 여태 별 문제는 없었는데요.
그러나 먹통되기 전 시간(16시 20분) 근처에서 발생한 걸로 보아 용의자일 수도 있어 보입니다.
AWS Questions에서 위 오류로 인스턴스가 종료되었다고 하시는 분이 계십니다.
그 분도 권한 오류를 로그에서 보셨는데요,
EC2 에 적용된 IAM Role 에서 AWS Systems Manager 권한을 아래처럼 추가했습니다.
그 뒤 amazon-ssm-agent 서비스를 재시작합니다.
$ sudo snap restart amazon-ssm-agent
2024-08-02T15:00:33Z INFO Waiting for "snap.amazon-ssm-agent.amazon-ssm-agent.service" to stop.
Restarted.
$ sudo systemctl status snap.amazon-ssm-agent.amazon-ssm-agent.service
● snap.amazon-ssm-agent.amazon-ssm-agent.service - Service for snap application amazon-ssm-agent.amazon-ssm-agent
Loaded: loaded (/etc/systemd/system/snap.amazon-ssm-agent.amazon-ssm-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2024-08-02 14:25:33 UTC; 34min ago
Main PID: 378 (amazon-ssm-agen)
Tasks: 18 (limit: 1120)
Memory: 78.6M
CPU: 2.525s
CGroup: /system.slice/snap.amazon-ssm-agent.amazon-ssm-agent.service
├─ 378 /snap/amazon-ssm-agent/7993/amazon-ssm-agent
└─1872 /snap/amazon-ssm-agent/7993/ssm-agent-worker
Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: status code: 400, request id: 5e3422fd-85c1-4b34-b168-12d22c84b7e6
Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:25:39 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity. Default Host Management E>
Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: status code: 400, request id: 5e3422fd-85c1-4b34-b168-12d22c84b7e6
Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:25:39 INFO [CredentialRefresher] Sleeping for 27m38s before retrying retrieve credentials
Aug 02 14:53:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:17 INFO EC2RoleProvider Successfully connected with instance profile role credentials
Aug 02 14:53:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:17 INFO [CredentialRefresher] Credentials ready
Aug 02 14:53:18 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:17 INFO [CredentialRefresher] Next credential rotation will be in 29.999759138800002 minutes
Aug 02 14:53:18 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:18 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
Aug 02 14:53:19 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:18 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:1872) started
Aug 02 14:53:19 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:18 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds
마무리
스왑 파일도 지정해주고 Amazon Systems Manager Role 도 추가했습니다.
어떤 원인이 범인일지는 조금 더 지켜봐야겠네요.
속시원한 해결은 아니지만, 최대한 할 수 있는 건 해봤습니다.
혹시 아시는 분이 계시면 링크로 알려주세요~
'프로젝트 > 장기 프로젝트' 카테고리의 다른 글
[트러블슈팅] Git Push 오류 (error: RPC failed; HTTP 400 curl 22 The requested URL returned error: 400) (0) | 2024.10.08 |
---|---|
부스패치 #4. PostgreSQL Docker로 로컬 개발 환경 구축하기 (0) | 2024.07.25 |
부스패치 #3. TypeORM DB 칼럼 스네이크 케이스 변경 대응 (0) | 2024.07.25 |
MOTI #6. Github Actions 활용한 API 서버 헬스 체크 (0) | 2024.07.23 |
MOTI #5. 인증서 만료로 인한 API 서버 접속 장애 해결 (0) | 2024.07.23 |
댓글