cancel
Showing results for 
Search instead for 
Did you mean: 

FIXED: ROG STRIX X870-E and 5070 Ti Kernel Dumps / TDR_RECOVERY_CONTEXT not found.

Nelno
Level 7

Hello everyone.

I've been experiencing black screens which are kernel dumps due to NVIDIA drivers having TDR_RECOVERY_CONTEXT errors. 

I recently found a solution to this issue, so I want to share it here in case anyone else experiences the problem.

BACKGROUND
I recently purchased and built a new PC with a Ryzen 9 9950X3D, ROG STRIX X870-E MB, 128 GB RAM and a Zotac RTX 5070 Ti Solid SFF OC. The machine is watercooled and uses a Thermal Grizzly thermal pad which keeps the CPU temps around 63C under max load, which is phenomenal. I say this to make the point that there's no lingering heat issue causing the problems I have been having.

I had to wait a week or two for the 5070 Ti to arrive. During that time I spent multiple hours playing games on this PC as well as doing several hours of burn in tests with OCCT and FurMark, as well as 3DMark. I experienced zero issues. Yes, the onboard graphics are slow, but sufficed for most 2D games at < 4K resolution.

When the Zotac card arrived I installed it and almost immediately started experiencing black screens. I am a software engineer working on high-performance realtime 3D software so I immediately checked the dump files being produced and they always showed a crash in the NVIDIA drivers with "TDR_RECOVERY_CONTEXT not found."

This first occurred with driver 576.80, then 576.88, the latest 580.88, and even the first 5070 Ti compatible driver, 572.47 (always did a DDU uninstall from safe mode between driver changes).

These issues always occurred when the GPU was not under load. I could play GPU-intensive games for hours, but when I was doing something like non-intensive like writing an email in my browser, the machine would blackscreen. This was an important clue.

FIXES TRIED

  1. Seeing consisted crashes in the NVIDIA driver, and having no other hardware issues detected over two weeks of use, also, making sure the crashes when away when the card was removed, I suspected an issue with the card and RMA'd it. That took another 2 weeks. During that time, I used the onboard graphics again without a single issue with the PC.
  2. Upon installing the new Zotac Solid SFF OC 5070 Ti, I immediately started getting crashes again.
  3. I tried all of the aforementioned drivers without any luck. Often the machine would black screen within a few minutes of a reboot or restart (even after a full power down and restart).
  4. I installed the latest Windows update on the advice of corresponding with NVIDIA driver feedback, as they said some users reported this helped. Didn't help at all.
  5. I installed the latest BIOS on my ROG STRIX X870-E. No change whatsoever.
  6. I tried both overclocked and not overclocked CPU. No difference.
  7. After doing some research with Grok, I suspected this could be power related, so I:
    1. Double-checked and reseated all of the cables to motherboard and GPU. No change.
    2. Began logging my GPU 12V input voltage with HWInfo. No voltage problems indicated at all, especially no anomalies associated with the crash.
  8. More research led my do try disabling Windows MPO (Multi-Plane Overlay). No change.

THE FIX
By now it was very clear this was only happening when the GPU was not under high load. My next step was to go into the NVIDIA Control Panel and under 3D Settings -> Manage 3D Settings I changed "Power management mode" to "Prefer maximum performance".

This finally had an effect. After black screening multiple times per day, I have not had a crash in 3 days since changing this setting.

So it appears there is some issue / incompatibility with 5070 Ti's or these Zotac cards, with going into low power mode. I have no reason to suspect it's the Zotac card, but I don't have another brand of 5070 Ti card laying around to try.

The upside is no black screens or any other weirdness. The downside is my video card is running around 36 - 50C when idling or under minimal load. The heat coming off the machine is noticably more than when the GPU was allowed to go into low power mode.

So, that's it. If you're experiencing black screens and the minidumps show TDR_RECOVERY_CONTEXT as the cause AND you notice this doesn't seem to happen under heavy GPU load, then setting the GPU to "Prefer maximum performance" may fix the issue for you.

There may be other ways to fix this, like changing Windows' PCIe Link State Power Management to "off", or undervolting the GPU, but I haven't tried any of that. I'll keep the PC as is for a couple more days, then revert back to "Normal" for the Power management mode in the NVIDIA Control Panel, at which point I will expect the crashes to start happening again.

If I try any other fixes or it turns out I'm just lucky and this didn't fix it, I'll post more info.

Also, if anyone has any other ideas besides preventing my GPU from down clocking, happy to hear them.

1,305 Views
1 ACCEPTED SOLUTION

Accepted Solutions

Nelno
Level 7

A few more data points:

  1. PCIe Link State Power Management "off" with "Prefer maximum performance" and PCIe Gen 5 doesn't fix it.
  2. Setting the GPU slot to PCIe Gen 3 fixes it even with "Power management mode" set to "Normal". This allows the GPU to run cooler at idle.
  3. Setting the GPU slot to PCIe Gen 4 also seems fixed it, even with "Power management mode" set to "Normal", but I've only been running that config for 1-2 days.

So it appears the combination of PCIe Gen 5 + "Power management mode" "Normal" exhibits the problem.

View solution in original post

1 REPLY 1

Nelno
Level 7

A few more data points:

  1. PCIe Link State Power Management "off" with "Prefer maximum performance" and PCIe Gen 5 doesn't fix it.
  2. Setting the GPU slot to PCIe Gen 3 fixes it even with "Power management mode" set to "Normal". This allows the GPU to run cooler at idle.
  3. Setting the GPU slot to PCIe Gen 4 also seems fixed it, even with "Power management mode" set to "Normal", but I've only been running that config for 1-2 days.

So it appears the combination of PCIe Gen 5 + "Power management mode" "Normal" exhibits the problem.