프로젝트/장기 프로젝트

EC2 인스턴스가 영문도 모른 채 접속 안되고 죽는 현상

Chipmunks 2024. 8. 3. 00:38
728x90

안녕하세요, 다람쥐입니다.

가끔 EC2 인스턴스에서 몇 년 전부터 영문도 모르게 죽는 현상이 있습니다.

로그를 보며 차근 차근 봤습니다.

 

영문도 모른 채 죽는 인스턴스

얼마 전, API 헬스 체크를 하는 워크플로우를 구축했습니다.

 

MOTI #6. Github Actions 활용한 API 서버 헬스 체크

지난 글에서 API 서버 접속이 한 달간 안되는 장애가 있었다.  [MOTI] 인증서 만료로 인한 API 서버 접속 장애 해결한 달 전부터 MOTI 데이터가 안 쌓였다.꽤 최근에 신경을 썼었는데...잠깐 신경을

itchipmunk.tistory.com

 

오늘 오후 메일로 갑작스레 날라오더군요.

헬스 체크가 실패했다는 메시지가요.

워크 플로우를 구축 안했으면 몰랐을 텐데, 뿌듯한 마음과

도대체 왜 실패했는지 궁금해졌습니다.

 

우선 EC2 인스턴스에 SSH 접속하고자 했습니다.

그런데 접속이 안되더군요.

EC2 콘솔에 접속했습니다.

 

아래처럼 상태 검사가 1 / 2개 검사 통과가 나타났습니다.

몇 년 전부터 괴롭혀온 그 이슈였습니다.

임의로 변경했습니다. 실제와 다를 수 있습니다.

 

우선 침착하게 인스턴스를 중지 -> 시작합니다.

'인스턴스 재부팅'은 동작이 안됩니다.

반드시 '인스턴스 중지' 버튼을 눌러야 합니다.

 

그 후로 인스턴스에 정상적으로 접속이 됩니다.

여태 1년에 정말 2~3 번 정도만 발생했는데요.

로그가 메모리에 많이 쌓여서 고갈됐나 싶었어요.

보통 재부팅까지만 해도 별 문제 없어서(?) 다시 일상으로 돌아갔습니다.

 

모니터링을 봐도 CPU 사용률이 99.9% 까지 치솟았다는 것 외에는

크게 문제점을 모르겠더라고요.

 

다른 블로그를 참고하면

배포 도중에 메모리를 과도하게 사용해

인스턴스가 먹통되고 접속도 안되는 현상이 있다고 합니다.

스왑 영역을 설정해 해결할 수 있다고 하네요.

혹시 메모리 문제일지 모르니 설정했습니다.

공식 문서 : https://aws.amazon.com/ko/premiumsupport/knowledge-center/ec2-memory-swap-file/

# swap 메모리 할당
$ sudo dd if=/dev/zero of=/swapfile bs=128M count=16
16+0 records in
16+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 32.3344 s, 66.4 MB/s

# 스왑파일 읽기 / 쓰기 권한 할당
$ sudo chmod 600 /swapfile

# 스왑 영역 설정
$ sudo mkswap /swapfile
Setting up swapspace version 1, size = 2 GiB (2147479552 bytes)
no label, UUID=35d6c9cf-8d17-4d91-a238-de51e77983af

# 스왑 공간에 스왑 파일을 추가
$ sudo swapon /swapfile

# 확인
$ sudo swapon -s
Filename				Type		Size		Used		Priority
/swapfile                               file		2097148		0		-2

# 부팅 시 스왑 파일 활성화
$ sudo vi /etc/fstab

# 마지막 라인에 추가
/swapfile swap swap defaults 0 0

# 스왑 사용률 확인
$ sudo free -h
               total        used        free      shared  buff/cache   available
Mem:           949Mi       596Mi       165Mi       1.0Mi       187Mi       186Mi
Swap:          2.0Gi          0B       2.0Gi


# ---

# 스왑 삭제
sudo rm -r swapfile

# 스왑파일로 스왑 비활성화
$ sudo swapoff swapfile

# 모든 스왑 메모리 비활성화
$ sudo swapoff -a

 

다만 아무도 인스턴스에 접속하지 않았고

애플리케이션 배포도, 트래픽도 없었기에 이상한 현상입니다.

AWS 시스템 내부적으로 평상시보다 네트워크 입출력을 많이 하는

어떤 데몬이 이상한 게 아닌가 하는 의심이 컸어요.

 

시스템 로그를 열어봤습니다.

Page Up / Page Down 으로 시간대(UTC 기준)를 찾습니다.

또는 Vi(m)에서 '/Aug  2 07:19' 으로 문제 시간대를 검색합니다.

$ vim /var/log/syslog
Aug  2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:30:33 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 4ef77717-0404-46da-a883-f2ae588ba7f7 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug  2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 4ef77717-0404-46da-a883-f2ae588ba7f7
Aug  2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:30:33 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug  2 05:30:33 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 859dfb08-be51-4ca6-8652-268a1d0f1c19
Aug  2 05:36:11 ip-172-31-0-13 snapd[413]: storehelpers.go:923: cannot refresh: snap has no updates available: "amazon-ssm-agent", "core18", "core20", "core22", "lxd", "snapd"
Aug  2 05:58:15 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:58:15 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 754ad390-1464-4c21-9397-a6ee41d9426e Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug  2 05:58:15 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 754ad390-1464-4c21-9397-a6ee41d9426e
Aug  2 05:58:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 05:58:15 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug  2 05:58:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 066b66ec-63fc-4ffe-b9d6-a732bbd7ffac
Aug  2 06:17:01 ip-172-31-0-13 CRON[99296]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug  2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:24:16 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: bc0f2d74-9ab0-408d-9dec-92a3a1fdaeb6 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug  2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: bc0f2d74-9ab0-408d-9dec-92a3a1fdaeb6
Aug  2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:24:17 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug  2 06:24:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: de46fb38-0084-478f-9d95-fd4882d61bc9
Aug  2 06:25:01 ip-172-31-0-13 CRON[99303]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Aug  2 06:33:34 ip-172-31-0-13 systemd[1]: Starting Daily apt upgrade and clean activities...
Aug  2 06:52:08 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:52:07 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: 804e19aa-db17-4953-a9a9-b5be78678f47 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug  2 06:52:08 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 804e19aa-db17-4953-a9a9-b5be78678f47
Aug  2 06:52:12 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 06:52:11 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug  2 06:52:12 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: c5e8645a-22ad-409a-a5bf-cd900b8858dc
Aug  2 07:17:02 ip-172-31-0-13 CRON[99521]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug  2 07:19:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 07:19:16 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. RequestId: ed332665-26e9-4b0b-b188-63beafcee846 Error: AccessDeniedException: User: arn:aws:sts::1234567890:assumed-role/ec2-codedeploy-role/i-1234567890 is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:ap-northeast-2:1234567890:instance/i-1234567890 because no identity-based policy allows the ssm:UpdateInstanceInformation action
Aug  2 07:19:16 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: ed332665-26e9-4b0b-b188-63beafcee846
Aug  2 07:19:21 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: 2024-08-02 07:19:20 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
Aug  2 07:19:21 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[407]: #011status code: 400, request id: 5c89ae8f-d1a8-45c8-946f-538ee44a9b2f

 

SSM role 관련한 오류가 지속적으로 발생하고 있습니다

지속적이지만 여태 별 문제는 없었는데요.

그러나 먹통되기 전 시간(16시 20분) 근처에서 발생한 걸로 보아 용의자일 수도 있어 보입니다.

CPU 사용률 초단위

 

AWS Questions에서 위 오류로 인스턴스가 종료되었다고 하시는 분이 계십니다.

그 분도 권한 오류를 로그에서 보셨는데요,

EC2 에 적용된 IAM Role 에서 AWS Systems Manager 권한을 아래처럼 추가했습니다.

 

Why my instance stop working?

this is the whole error: amazon-ssm-agent.amazon-ssm-agent[365]: 2024-02-15 18:43:32 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestMana...

repost.aws

 

 

그 뒤 amazon-ssm-agent 서비스를 재시작합니다.

$ sudo snap restart amazon-ssm-agent
2024-08-02T15:00:33Z INFO Waiting for "snap.amazon-ssm-agent.amazon-ssm-agent.service" to stop.
Restarted.

$ sudo systemctl status snap.amazon-ssm-agent.amazon-ssm-agent.service
● snap.amazon-ssm-agent.amazon-ssm-agent.service - Service for snap application amazon-ssm-agent.amazon-ssm-agent
     Loaded: loaded (/etc/systemd/system/snap.amazon-ssm-agent.amazon-ssm-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-08-02 14:25:33 UTC; 34min ago
   Main PID: 378 (amazon-ssm-agen)
      Tasks: 18 (limit: 1120)
     Memory: 78.6M
        CPU: 2.525s
     CGroup: /system.slice/snap.amazon-ssm-agent.amazon-ssm-agent.service
             ├─ 378 /snap/amazon-ssm-agent/7993/amazon-ssm-agent
             └─1872 /snap/amazon-ssm-agent/7993/ssm-agent-worker

Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]:         status code: 400, request id: 5e3422fd-85c1-4b34-b168-12d22c84b7e6
Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:25:39 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity. Default Host Management E>
Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]:         status code: 400, request id: 5e3422fd-85c1-4b34-b168-12d22c84b7e6
Aug 02 14:25:41 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:25:39 INFO [CredentialRefresher] Sleeping for 27m38s before retrying retrieve credentials
Aug 02 14:53:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:17 INFO EC2RoleProvider Successfully connected with instance profile role credentials
Aug 02 14:53:17 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:17 INFO [CredentialRefresher] Credentials ready
Aug 02 14:53:18 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:17 INFO [CredentialRefresher] Next credential rotation will be in 29.999759138800002 minutes
Aug 02 14:53:18 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:18 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker is not running, starting worker process
Aug 02 14:53:19 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:18 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] [WorkerProvider] Worker ssm-agent-worker (pid:1872) started
Aug 02 14:53:19 ip-172-31-0-13 amazon-ssm-agent.amazon-ssm-agent[378]: 2024-08-02 14:53:18 INFO [amazon-ssm-agent] [LongRunningWorkerContainer] Monitor long running worker health every 60 seconds

 

마무리

스왑 파일도 지정해주고 Amazon Systems Manager Role 도 추가했습니다.

어떤 원인이 범인일지는 조금 더 지켜봐야겠네요.

속시원한 해결은 아니지만, 최대한 할 수 있는 건 해봤습니다.

혹시 아시는 분이 계시면 링크로 알려주세요~