Recently I started to get crashes from my postgres instances:
The three crashes happened on three different postgres clusters, ruling out an application induced issue. The machine is stable for years, currently at 14.3-RELEASE, no significant changes.
The second crash happened right between port upgrading and rebooting, and while the ports commit log showed that the upgrade did change only the packing list of postgres (and therefore the *_N freebsd revision, but not the binary), I ruled that one out for "spurious circumstances". From the first one the coredump was not retained, but the third one looks as such:
AllocSetContextCreateInternal does something with memory allocation.
Given this I assumed a problem with the extra shared memory used for inter-process communication of shared workers. (Postgres allocates that memory differently, because it is fetched on demand and due to KASLR it will not have the same location everywhere, and therefore variable pointers must be recomputed).
Yesterday happend two more crashes:
This is again two different and independent database clusters. These happening at almost the same time points to a common issue, probably hardware related.
More interesting are the backtraces:
Even without debugging symbols one can clearly see that this is three times the same location in libc.
The invocation paths are each different, but they are all somehow related with some memory allocation.
Now what to make of this?
How likely is it that a machine flaw would always only concern postgres, and always manifest at the very same location in libc?
Code:
Feb 2 03:52:57 LOG: background worker "parallel worker" (PID 79324) was terminated by signal 10: Bus error
Feb 2 14:48:20 LOG: server process (PID 97673) was terminated by signal 11: Segmentation fault
Feb 12 03:38:03 LOG: background worker "parallel worker" (PID 26340) was terminated by signal 10: Bus error
The three crashes happened on three different postgres clusters, ruling out an application induced issue. The machine is stable for years, currently at 14.3-RELEASE, no significant changes.
The second crash happened right between port upgrading and rebooting, and while the ports commit log showed that the upgrade did change only the packing list of postgres (and therefore the *_N freebsd revision, but not the binary), I ruled that one out for "spurious circumstances". From the first one the coredump was not retained, but the third one looks as such:
Code:
* thread #1, name = 'postgres', stop reason = signal SIGBUS
* frame #0: 0x0000000829930ac3 libc.so.7`___lldb_unnamed_symbol5890 + 131
frame #1: 0x000000082992da28 libc.so.7`___lldb_unnamed_symbol5865 + 504
frame #2: 0x000000082992e889 libc.so.7`___lldb_unnamed_symbol5871 + 2617
frame #3: 0x000000082990ca84 libc.so.7`___lldb_unnamed_symbol5446 + 644
frame #4: 0x000000082990c6b7 libc.so.7`___lldb_unnamed_symbol5445 + 839
frame #5: 0x0000000829952945 libc.so.7`___lldb_unnamed_symbol6064 + 21
frame #6: 0x0000000829900013 libc.so.7`___lldb_unnamed_symbol5410 + 755
frame #7: 0x00000000009c0577 postgres`AllocSetContextCreateInternal + 199
frame #8: 0x00000000006d588c postgres`ExecAssignExprContext + 108
frame #9: 0x00000000006faab9 postgres`ExecInitSeqScan + 73
frame #10: 0x00000000006cf188 postgres`ExecInitNode + 248
frame #11: 0x00000000006c8440 postgres`standard_ExecutorStart + 1056
AllocSetContextCreateInternal does something with memory allocation.
Given this I assumed a problem with the extra shared memory used for inter-process communication of shared workers. (Postgres allocates that memory differently, because it is fetched on demand and due to KASLR it will not have the same location everywhere, and therefore variable pointers must be recomputed).
Yesterday happend two more crashes:
Code:
Feb 28 04:38:36 LOG: server process (PID 30492) was terminated by signal 10: Bus error
Feb 28 04:39:17 LOG: server process (PID 31023) was terminated by signal 11: Segmentation fault
This is again two different and independent database clusters. These happening at almost the same time points to a common issue, probably hardware related.
More interesting are the backtraces:
Code:
* thread #1, name = 'postgres', stop reason = signal SIGBUS
* frame #0: 0x000000082aad3ac3 libc.so.7`___lldb_unnamed_symbol5890 + 131
frame #1: 0x000000082aad0a28 libc.so.7`___lldb_unnamed_symbol5865 + 504
frame #2: 0x000000082aad1889 libc.so.7`___lldb_unnamed_symbol5871 + 2617
frame #3: 0x000000082aaae44d libc.so.7`___lldb_unnamed_symbol5434 + 525
frame #4: 0x000000082aad65f0 libc.so.7`___lldb_unnamed_symbol5920 + 400
frame #5: 0x000000082aaa3177 libc.so.7`___lldb_unnamed_symbol5410 + 1111
frame #6: 0x00000000009c76a7 postgres`GenerationAlloc + 311
frame #7: 0x00000000009c88db postgres`palloc0 + 43
frame #8: 0x000000000050d950 postgres`heap_form_minimal_tuple + 192
frame #9: 0x00000000009d02ae postgres`copytup_heap + 62
frame #10: 0x00000000009d293a postgres`tuplesort_puttupleslot + 58
frame #11: 0x00000000006fb7fc postgres`ExecSort + 316
frame #12: 0x00000000006c880b postgres`standard_ExecutorRun + 299
Code:
* thread #1, name = 'postgres', stop reason = signal SIGSEGV
* frame #0: 0x000000082a5fbac3 libc.so.7`___lldb_unnamed_symbol5890 + 131
frame #1: 0x000000082a5f8a28 libc.so.7`___lldb_unnamed_symbol5865 + 504
frame #2: 0x000000082a5f9889 libc.so.7`___lldb_unnamed_symbol5871 + 2617
frame #3: 0x000000082a5d7a84 libc.so.7`___lldb_unnamed_symbol5446 + 644
frame #4: 0x000000082a5d76b7 libc.so.7`___lldb_unnamed_symbol5445 + 839
frame #5: 0x000000082a61d945 libc.so.7`___lldb_unnamed_symbol6064 + 21
frame #6: 0x000000082a5cb013 libc.so.7`___lldb_unnamed_symbol5410 + 755
frame #7: 0x00000000009c0711 postgres`AllocSetAlloc + 49
frame #8: 0x00000000009c8664 postgres`MemoryContextAllocExtended + 68
frame #9: 0x000000000086ee3a postgres`pgstat_entry_ref_hash_create + 74
frame #10: 0x000000000086e18a postgres`pgstat_get_entry_ref + 1930
frame #11: 0x000000000086aebd postgres`pgstat_prep_pending_entry + 45
frame #12: 0x000000000086c7c9 postgres`pgstat_assoc_relation + 57
Even without debugging symbols one can clearly see that this is three times the same location in libc.
The invocation paths are each different, but they are all somehow related with some memory allocation.
Now what to make of this?
How likely is it that a machine flaw would always only concern postgres, and always manifest at the very same location in libc?