筆記 Debug PostgreSQL WAL 滿ㄌ

前情提要

這是一台架設在 Kubernetes 上,使用 Postgres Operator 部署的單節點資料庫,SRE 大大發現他的硬碟空間滿了,趕快來看一下 由於硬碟滿了 SQL Query 也下不了,於是先幫他加了 2 GB 的 pv。

這是他的 log

2022-01-27 03:09:49,785 INFO: Lock owner: None; I am MY_DB_NAME-0
2022-01-27 03:09:49,786 INFO: starting as a secondary
2022-01-27 03:09:50 UTC [413]: [1-1] 61f20cfe.19d 0     FATAL:  could not write lock file "postmaster.pid": No space left on device
2022-01-27 03:09:50,085 INFO: postmaster pid=413
/var/run/postgresql:5432 - no response
2022-01-27 03:09:50,102 WARNING: Postgresql is not running.
2022-01-27 03:09:50,102 INFO: Lock owner: None; I am MY_DB_NAME-0
2022-01-27 03:09:50,107 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202007201
  Database system identifier: 7024396656819753043
  Database cluster state: in archive recovery
  pg_control last modified: Thu Jan 27 02:04:18 2022
  Latest checkpoint location: 10/C800AA10
  Latest checkpoint's REDO location: 10/C800A9D8
  Latest checkpoint's REDO WAL file: 0000000900000010000000C8
  Latest checkpoint's TimeLineID: 9
  Latest checkpoint's PrevTimeLineID: 9
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:52538
  Latest checkpoint's NextOID: 50198
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 479
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 52538
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jan 27 00:10:32 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 10/C9000000
  Min recovery ending loc's timeline: 9
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: da09cd91aff4fc04822b4475aa0944dee4f3a297badcac611f47ff91071988b9

健康檢查時間

總之先 psql 進去資料庫 Query 一些資訊出來看看吧!

pg_database_size 9009 kB 而已呀少少

SELECT pg_size_pretty(pg_database_size('YOUR_DB_NAME'));

退出 psql 直接看一下儲存空間,發現是 WAL log 占用掉了資源。

du -h | sort -h

回到 psql 再來查看一下 WAL 的相關設定

SHOW ALL;
-- 或是
SELECT * FROM pg_settings WHERE name LIKE '%wal%'

注意在 PostgreSQL 13 將 wal_keep_segments 換成了 wal_keep_size,不過基本上還是類似功能的東西

wal_keep_size = wal_keep_segments * wal_segment_size (typically 16MB)

WAL 的 SIZE 設定看上去沒有甚麼問題,不過另外發現了他的同步機制是開啟的,因此只開了單台的 Postgres 沒辦法同步變更到另外一台 Replicate

按照 Postgres Operator Issue 這邊寫的,把 archive_mode 關掉吧!!!

期待有大大幫修這張 Issue,在這之前都要自己記得喔 👍

Ref

https://isdaniel.github.io/postgresql-wal-introduce/

https://kknews.cc/zh-tw/code/g4b4nny.html

https://www.slideshare.net/wenchen3/postgre-sql-wal

https://www.postgresql.fastware.com/blog/how-to-solve-the-problem-if-pg_xlog-is-full

https://github.com/zalando/postgres-operator/issues/1664#issuecomment-961761544