3 weeks ago
Specs:
ASUS ROG Strix G513 Advantage Edition
AMD Ryzen 9 5980HX
AMD Radeon RX 6800M
(x2) 16gb DDR4 SDRAM
Issue:
The computer will hard shutdown intermittently while under heavy load. It does not restart. System event viewer shows a 6008 unexpected shutdown.
Troubleshooting Steps Taken:
Repasted thermal paste with Noctua Nt-H2 and cleaned interior
Using a cooling pad
Unistalled Armory Crate
Used HWinfo to confirm not overheating or throttling at time of shutdown
Computer passed OCCT CPU and Power stress tests
Power throttling to 45W for CPU and 145W total
Undervolted by -1 in GHelper and restricting temp control to 85 degrees Celsius
Computer passed Microsoft RAM diagnostics
Factory reset computer
Notes:
I'm not very tech savvy but I have been trying to resolve this issue for a while. It started about two months ago, the computer was hard shutting off while playing New World, Ark: Survival Ascended, and Diablo 4. All pretty heavy games in their own right so I immediately thought it was overheating, that seemed logical to me. I started watching the temps and sure enough the temps were rising to 94-96 degrees Celsius (which seemed awfully high to me), but I read that for this AMD CPU, that temp is high but not in the range of hard shutting down. I read the manufacturer shut down limit to be 110 degrees. Additionally, the machine would run at that constant temperature for a random amount of time before shutting down with no consistent pattern suggesting a certain point of overheating. I even got lucky twice to be actively looking at the temperature in HWinfo when the shutdown happened and confirmed it was consistently at 94 degrees. Nonetheless I repasted the CPU and GPU since the liquid metal had been pushed to the side (thermal paste should be better than half applied liquid metal, I thought). At first I thought that solved the issue but within a few days it was back to shutting down intermittently. The GPU stays at around 60-70 degrees under load as well, so I didn't suspect it.
So, I am now at a loss at what to try next. I read that a bad power supply could be causing this but I am confused on how the battery wouldn't take over in the AC adapter is plugged in and the device shows a full charge rate in HWinfo.
Any suggestions or ideas are welcome. As mentioned, I am not very tech savvy, I have just been following guides and trying to think logically about it.
3 weeks ago
It is really hard to tell with this type of thing. Could be a driver issue, but it's hard to say since it's been so long since. Did you perhaps have a bios update at that point that might have started this? Are the temps on the CPU itself still that high?
The easiest thing you can do is to uninstall all drives, or at least any you might have installed around that time and reinstall fresh ones, either newer if available, or an older version that used to be stable and test it then.
How long was the stress test you ran? If it was at least 30min to an hour long, then the issue might even be related to the games themselves or how they interact with your config/drivers. So does it happen in other games, or just those 3? Can you try something like Cyberpunk or some other heavy game and see if it also crashes in that?
What looks to be like a software issue, is often hard to diagnose in person, let alone over chat on a forum with limited info.
3 weeks ago
Hello,
I don't believe I had any BIOS updates around when the problem started. I did, however, update the driver with AMD Adrenaline around the time of the problem starting but I did the update in response to the problem. I checked the driver and I was quite a few version back so I thought updating it might solve the issue, it did not.
The CPU temps are still that high, yes. I started using Ryzen Controller to control the temps more but without that software, the temps for the CPU will average at 93 degrees Celsius.
Does factory resetting uninstall an reinstall all drivers? If so, I did d that and the problem persisted. If that doesn't reinstall the drivers, I would be happy to do that but would you be able to point me to a guide on how to properly do that? I am not very tech savvy but I can usually follow guides well enough.
The stress tests were for 30min each. I did a 1-hour long Cinebench test too and it had no errors but it did run very hot during that (like 96-98).
The issue so far has only happened with those games mentioned plus World of Warcraft once and Last Epoch rarely. Diablo 4 is my most played game and the largest offender. In terms of time, for understanding:
I have played probably around 100hr of Diablo 4 in the last two months and the crash has happened 20+ times during this play.
I played Ark about 30hr and it has happened maybe 4-5 times during that play.
New World about 30hr and happened also about 4-5 times during that play.
WoW for about 30hr and it happened 1 time.
Last Epoch about 50hr and happened 4-5 times during that play.
Satisfactory about 20hr and never.
Factorio about 20hr and never.
AoE4 about 20hr and never (though is now lags rather badly)
I hope this information helps and I greatly appreciate your feedback and support. Does this all seem consistent with a software issue? If it is, it also survived or came back after I factory reset the machine so it must be a software that I use regularly if it is.
3 weeks ago
Hello,
Did you take any pictures of the motherboard with the heatsink removed? Would be good to see how it was and will help with explaining points of interest.
When the laptop turns off are you able to power on straight away or is there a delay?
Delay usually means a protection circuit was triggered, over voltage, over current, over power draw or a thermal limit reached. You have more than just the CPU and GPU temps to consider, both are powered by circuitry on the motherboard and when you removed the heatsink you would have seen a row of square / rectangular parts covered in a paste (maybe blue). These power the CPU and GPU, both have thermal protection that isn't usually measured in software unless the manufacturer provided sensor readings. (VR temps).
Did you clean and repaste any of this? (example of area I am talking about below)
The power supply and battery are a combination used when the laptop is under a high load, should the power supply fail the battery will not cope with the power draw and might shutdown. (I have not tested this and to simulate you could play one of the games and pull the plug to see what happens)
What options do you have for reducing the CPU frequencies?
If you can only run it at 3.3GHz and disable all boost clocks what happens?
3 weeks ago
Hello,
Thank you for your reply. I attached a picture of the CPU and GPU area I took a picture of before repasting. I don't have any other, better pictures though, sadly. I did clean and add thermal paste (non electrically-conductive) to all of the little chips as well that are off to the side of the GPU and CPU.
The laptop turns off straightaway and I can reboot straightaway. As soon as it turns off I can hit the power button and it will boot up with no issues at all.
I have read about the power supply combination but have not tried to unplug it during load to see what happens. I can/will give that a try and see if that causes the problem. Thank you for that suggestion, it is a good idea.
As for reducing the CPU speeds, I have GHelper and Ryzen Controller. I used Ryzen Controller yesterday to limit the CPU temp to 80 (I believe it also lets me limit clock speed, though I did not do that directly) degrees and then played Diablo for about 2 hours (on medium settings) and did not have a shutdown, though I did fairly fairly noticeable performance reduction. I will also need to test this more as the problem is intermittent and could surface in a day or two with more time.
On the above, in the past I did us GHelper to disable boost and limited Wattage usage to 45W and it still crashed.
Thank you so much for your help and advice. I really appreciate it!
3 weeks ago
As long as you are sure the paste is making good contact with the VRM (yellow and green bits) and GPU Memory chips it should be ok.
From your image something looks odd, wrong angle and appears to be lifting off the board, the blue square outline.
And from what you saw did anything look like it was baked? browning to thermal paste near the chips or has the motherboard browned in the areas near the yellow, green and blue markings?
3 weeks ago
Here are photos of the motherboard as is, right now after I changed the paste. The blue square piece seems to be okay, just a weird camera angle. I didn't see any noticeable browning either.
For the thermal paste, I used Noctura NT-H1 since it was well reviewed.
I downloaded "WhoCrashed" and it confirmed that no dump files are being generated when it shutdowns down due to the issue. Is that a good thing?
I can try undervolting more and in general will try and use it more to see if it still shuts down with the temp restricting. I will also try abruptly pulling the power supply to see if I can stimulate the issue. I am attending an event today so it may be a couple days before I conclusive results on that. Do you have any other suggestions or information I can provide in the meantime? Again, thank you so much your help!
3 weeks ago
I would say the temps are too high and an undervolt could maybe help, and it might still be the correct thing to try, but I'm not too familiar with amd undervolting to give advice on that. And also the bigger issue is the system being stable in 1 hour cinebench test. Cinebench will force your cpu a lot more then any game ever can, so that is why I can't see how that would be the issue.
I can see the voltage being an issue maybe still and the undervolt could help, just cause it might be a problem when both gpu and cpu are running and maybe it is causing an issue in those games cause the cpu draws too much extra... But I can't think why it would change out of the blue, the fact that limiting the max temp helps, kind of implies that might be it, but you did only run it for 2 hours once, so it could be a coincidence or who knows. Either way, attempting an undervolt can't hurt.
As for the picture of the motherboard... is that after you cleaned the memory modules and inductors? Cause they look very dry to me, I mean, there's usually more paste on the heatsink side, but even so, that is very little... Also, what thermal compound did you use to replace them? Cause just using normal cpu thermal paste isn't good. Best thing to use is probably Upsiren U6 pro, or something similar if not.
Also, do you get a dmp file after crashes? You should find those in windows/minidumps. That could give more insight if you have any related to those crashes.
3 weeks ago
I combined my reply to both you and the other fellow into one post above. Please see above.
3 weeks ago
Just an observation as this needs to be resolved to rule out any form of heat issue.
You need to do a full thermal paste removal on the heatsink and CPU/GPU (mirror finish required) and VRM area
The blue stuff is thermal putty that is to far gone. (use new putty)
Your thermal paste on the CPU / GPU is not looking good, the way it has spread away from the center is due to the hot spots on the CPU/GPU. Liquid metal will withstand this for much longer and your original screenshot also shows the same issue, over the years the heat has caused it to dry out near the center.
One of the reasons why liquid metal is used is to prevent the above issue / minimise, note your thermal paste is rated with an operating temp up to 110 DegC, liquid metal is over 200 DegC and your CPU temps are over 90 DegC, so a wise choice would be to use something that can withstand the heat.