12-10-2024 10:40 AM
This morning, I discovered my system had mostly locked up. It responded by network (ping and listening ports) yet was not operational. After pressing the reset button, it could not find the root file system upon reboot. A power cycle solved that. According to the OS logs, one of the SSD controllers (nvme0) had become inaccessible along with the underlying drive (nda0). My first fear was that another drive was dying on me as I have had a lot of issues with drives this year. However, I recalled that the two SSDs are in a ZFS mirror and vaguely recalled that I did not see either during the boot from the BIOS. I believe that both SSD drive controllers went South. It just so happens that nvme0 was the first noticed by the system.
No S.M.A.R.T. errors (smartctl -a /dev/nvme0 after short self-test):
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code
0 Short Completed without error 8282 - - - - -
No logged errors stored on the drives (nvmecontrol logpage -p 1 nvme0):
Error Information Log
=====================
No error entries found
No errors prior to the drives falling off the bus (/var/log/messages):
...
Dec 10 03:27:07 workstation kernel: nvme0: Resetting controller due to a timeout.
Dec 10 03:27:08 workstation kernel: nvme0: resetting controller
Dec 10 03:27:28 workstation kernel: nvme0: Waiting for reset to complete
Dec 10 03:27:28 workstation syslogd: last message repeated 147 times
Dec 10 03:27:28 workstation kernel: nvme0: controller ready did not become 0 within 20500 ms
Dec 10 03:27:28 workstation kernel: nvme0: failing outstanding i/o
Dec 10 03:27:28 workstation kernel: nvme0: FLUSH sqid:5 cid:117 nsid:1
Dec 10 03:27:28 workstation kernel: nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:5 cid:117 cdw0:0
Dec 10 03:27:28 workstation kernel: nvme0: failing queued i/o
Dec 10 03:27:28 workstation kernel: nvme0: WRITE sqid:10 cid:0 nsid:1 lba:3054061456 len:8
Dec 10 03:27:28 workstation kernel: nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:10 cid:0 cdw0:0
Dec 10 03:27:28 workstation kernel: nvme0: failing outstanding i/o
...
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): Error 5, Retries exhausted
Dec 10 03:27:28 workstation kernel: nda0: <Samsung SSD 990 PRO 2TB 4B2QJXD7 "serial_number"> s/n "serial_number" detached
...
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): Error 6, Periph was invalidated
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): Periph destroyed
...
After the power cycle, everything appears to be working. It could have been a fluke, but I am worried my motherboard was upset last night and will do this again.
Has anyone had any odd issues with their NVMe drives disappearing?
Notes/thoughts:
12-13-2024 08:27 AM
It's a motherboard defect.
After installing an SSD in the M.2 3rd slot
When EXPO is activated, slots 2 and 3 are deactivated.
Only slot 1 is alive.
When EXPO is activated, slots 2 and 3 are dead.
It's a motherboard defect.
Please replace the motherboard
The symptom will disappear.
I discovered it on October 30, 2024 and told ASUS about the motherboard defect on October 31, 2024, but they ignored it and said that it was a SSD and CPU defect. However, it was a motherboard defect.
12-13-2024 05:13 PM
That is interesting. Perhaps, yours is only AMD related since you mention EXPO? Did it use to work with your motherboard? Did ASUS replace it under warranty?
It works for me nearly all the time; it was just a single occurrence. I used to have XMP enabled but now just bump the speed to 4800 (JEDEC) due to having 128GB of RAM.
OTOH, mine could be trying to do something similar with Intel/XMP that yours did. It could be a BIOS bug too. I am watching it carefully.
12-13-2024 05:16 PM
4800 is fine. Anything higher than expo 4800 causes problems.
12-13-2024 05:27 PM
I did experience those problems with anything above 4800, but I was also having some oddities that I could not pin down with XMP+4800. Non-XMP+4800 seems to help there since I did not want to spend forever trying to find stable settings for tweaking the memory.
Also, XMP I at 5600 for my memory was definitely wrong. It was setting the MC voltage to just above 1.2V when 5200 had over 1.3V. XMP II at 5600 was saner. Regardless, MemTest86 failed quickly. I actually run a lot of code compilations to test the memory settings because MemTest86 will not detect issues at 4800 for me. Spinning off builds of OpenJDK 8 and 17 plus a few other programs seems to detect things the best.