cancel
Showing results for 
Search instead for 
Did you mean: 

NVMe controllers disappeared

Hari_Seldon
Level 9

This morning, I discovered my system had mostly locked up.  It responded by network (ping and listening ports) yet was not operational.  After pressing the reset button, it could not find the root file system upon reboot.  A power cycle solved that.  According to the OS logs, one of the SSD controllers (nvme0) had become inaccessible along with the underlying drive (nda0).  My first fear was that another drive was dying on me as I have had a lot of issues with drives this year.  However, I recalled that the two SSDs are in a ZFS mirror and vaguely recalled that I did not see either during the boot from the BIOS.  I believe that both SSD drive controllers went South.  It just so happens that nvme0 was the first noticed by the system.

No S.M.A.R.T. errors (smartctl -a /dev/nvme0 after short self-test):
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Short             Completed without error                8282            -     -   -   -    -

No logged errors stored on the drives (nvmecontrol logpage -p 1 nvme0):
Error Information Log
=====================
No error entries found

No errors prior to the drives falling off the bus (/var/log/messages):
...
Dec 10 03:27:07 workstation kernel: nvme0: Resetting controller due to a timeout.
Dec 10 03:27:08 workstation kernel: nvme0: resetting controller
Dec 10 03:27:28 workstation kernel: nvme0: Waiting for reset to complete
Dec 10 03:27:28 workstation syslogd: last message repeated 147 times
Dec 10 03:27:28 workstation kernel: nvme0: controller ready did not become 0 within 20500 ms
Dec 10 03:27:28 workstation kernel: nvme0: failing outstanding i/o
Dec 10 03:27:28 workstation kernel: nvme0: FLUSH sqid:5 cid:117 nsid:1
Dec 10 03:27:28 workstation kernel: nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:5 cid:117 cdw0:0
Dec 10 03:27:28 workstation kernel: nvme0: failing queued i/o
Dec 10 03:27:28 workstation kernel: nvme0: WRITE sqid:10 cid:0 nsid:1 lba:3054061456 len:8
Dec 10 03:27:28 workstation kernel: nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:10 cid:0 cdw0:0
Dec 10 03:27:28 workstation kernel: nvme0: failing outstanding i/o
...
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): Error 5, Retries exhausted
Dec 10 03:27:28 workstation kernel: nda0: <Samsung SSD 990 PRO 2TB 4B2QJXD7 "serial_number"> s/n "serial_number" detached
...
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): Error 6, Periph was invalidated
Dec 10 03:27:28 workstation kernel: (nda0:nvme0:0:0:1): Periph destroyed
...

After the power cycle, everything appears to be working.  It could have been a fluke, but I am worried my motherboard was upset last night and will do this again.

Has anyone had any odd issues with their NVMe drives disappearing?

Notes/thoughts:

  • Motherboard:  Z790-F WiFi I (BIOS:  v2703)
  • CPU:  i7-14700K
  • Both Samsung 990 Pro drives are at the latest firmware:  4B2QJXD7
  • Drives are in a ZFS mirror for redundancy, so the system should have booted if only one was present.
  • CPU is undervolted.  I strongly doubt, yet cannot rule it out, this is the cause since the BIOS should have seen the drives after the soft reset.
  • This is my first time to see this.  I have been running BIOS v2703 for a couple of months and the system since January.
  • I did not see nvme1 complain in the system logs, but it was missing too after the soft reset.
  • The drive light on the case was flashing continuously as if the drives were very busy.  There are some spinning rust drives too, but I did not hear them at all.
229 Views
4 REPLIES 4

Inugami0909
Level 7

It's a motherboard defect.
After installing an SSD in the M.2 3rd slot
When EXPO is activated, slots 2 and 3 are deactivated.
Only slot 1 is alive.
When EXPO is activated, slots 2 and 3 are dead.
It's a motherboard defect.
Please replace the motherboard
The symptom will disappear.
I discovered it on October 30, 2024 and told ASUS about the motherboard defect on October 31, 2024, but they ignored it and said that it was a SSD and CPU defect. However, it was a motherboard defect.

That is interesting.  Perhaps, yours is only AMD related since you mention EXPO?  Did it use to work with your motherboard?  Did ASUS replace it under warranty?

It works for me nearly all the time; it was just a single occurrence.  I used to have XMP enabled but now just bump the speed to 4800 (JEDEC) due to having 128GB of RAM.

OTOH, mine could be trying to do something similar with Intel/XMP that yours did.  It could be a BIOS bug too.  I am watching it carefully.

Inugami0909
Level 7

4800 is fine. Anything higher than expo 4800 causes problems.

I did experience those problems with anything above 4800, but I was also having some oddities that I could not pin down with XMP+4800.  Non-XMP+4800 seems to help there since I did not want to spend forever trying to find stable settings for tweaking the memory.

Also, XMP I at 5600 for my memory was definitely wrong.  It was setting the MC voltage to just above 1.2V when 5200 had over 1.3V.  XMP II at 5600 was saner.  Regardless, MemTest86 failed quickly.  I actually run a lot of code compilations to test the memory settings because MemTest86 will not detect issues at 4800 for me.  Spinning off builds of OpenJDK 8 and 17 plus a few other programs seems to detect things the best.