This is an automated archive.
The original was posted on /r/sysadmin by /u/SilvaHaloOne on 2024-01-24 04:23:54+00:00.
Hi,
We had a SQL server crash in mid December during operating hours. The databases on it were all related to customer facing services, and we are still dealing with some fallout from it a month later.
It turns out this server, a Dell Poweredge R740, with 128gb Registered ECC DDR 4 RAM (8x HMA82GR7CJR8N-XN) was having a bunch of single bit memory errors and eventually had a memory error it could not recover from. The SupportAssist logs I gathered after the event showed 4 of the DIMMs malfunctioning and unable self-heal, so Dell dispatched replacement DIMMs, we got them replaced and the server hasn’t had a full on crash since then.
However, now we are getting several events a day in Windows where we are told a single bit memory error was corrected or that a corrected hardware error for Memory error type 13 has occurred. The SupportAssist logs now show that 2 different DIMMs (meaning DIMM’s that were not among the 4 replaced) are going through degradation and then self healing after reboot.
Our Dell support guy says that the issue is addressed and that what is left here is normal and under expected parameters. That doesn’t really seem right to me… like to me, normal and expected parameters might be one of these every couple of months and none of the tens of our other servers have these kinds of errors, however this server has twice the memory of its closest stablemate and I also have other fires that I’m trying to put out as well.
So… do I say “Ok, thank you, I will continue to monitor” and start another service request when/if things change or do I push more now?
Thank you for the advice!