cancel
Showing results for 
Search instead for 
Did you mean: 

Z790 boards / intel 14th gen system instability

drduf
Level 7

First new build in a while after a decade of pre-built systems. Used to think of myself as pretty savvy (built systems regularly late 90s to late 2000s).

Build initially as follow:

Tuf gaming Z790-plus wifi (now swapped for a ROG STRIX Z790-E WIFI), Intel Core i7-14700f, 2x32gb Crucial Pro DDR5-5600, PNY RTX4070 TI SUPER, MSI MPG A1000G, Sandisk 1tb NVMe drive, in a 5000D Airflow case from Corsair. It's a GPU-centric build for some specialized AI toolchains, therefore I don't need an unlocked CPU but that might be the issue(?!?).

Initial install of Ubuntu (24.04 LTE after bios update as CPU was not posting with factory shipped bios) went pretty well. Did some light gaming on Steam for a few hours, Nvidia CUDA + Docker installed, tested some of my AI toolchain. Stopped the system after 12h to move the computer.

Restarted - passed POST, had the Ubuntu splash screen then got a black screen, monitor turned to powersave mode, fan turning near-idle. Couldn't wake the system up. Trying to reset the computer randomly would reproduce this vs giving me a no boot drive error past the initial bios screen (with no boot partition showing up in the BIOS at the time). Clean install of the OS results in the same behavior. 

Windows 11 install consistently halts as above at the downloading and applying Windows updates. Installing without network allows to access a working windows session, but from there trying to apply windows updates, installing Nvidia drivers or Armor Crate reliably reproduce above behavior, sometimes with a screen flicker right before. Intel processor diagnostic tool fails to reveal any issue. Memtest86+ also passes without finding any error.

Tried reseating the CPU, removing either of the RAM modules, swapping the RTX 4070 to an old Quadro P2200, switching NVMe positions, using a SATA SSD instead of NVMe.

Finally went ahead and exchanged the TUF MB to a ROG STRIX Z790-E WIFI as it was pretty much the only piece of hardware I hadn't tried changing (and while the Crucial Pro RAM wasn't on the compatibility list for the TUF Z790-plus, it is for the ROG STRIX). Applied latest bios update. Tried to install Windows 11, still consistently getting the same behavior as above.

I'm at a loss - having a hard time believing the CPU is at fault but I'm on the verge of asking for a return and swapping for a 14700K instead, or even downgrading to a 12th gen. The fact that Windows 11 install fails reliably at the exact same moment  every time makes me think this is not the power supply failing, and either some instruction set in the CPU that is faulty or some bios config that causes an issue under a specific set of circumstances - and seeing the number of random halts reported on this forum with the Z790 boards, I'm thinking it's the latter and not the former. Tried disabling any OC settings I could find and applying intel defaults, but that hasn't solved the problem.

Any brilliant idea?

275 Views
10 REPLIES 10

drduf
Level 7

Some progress:

Extra steps tried: 

Got hold of a different CPU - 12th gen i3 with onboard graphics. Stripped the build to as barebones as I could have it - 1 NVMe drive, 1 stick of ram, i3 using onboard video. Unplugged aRGB headers, unplugged all fans but the CPU fan from the motherboard (front case fans still running powered by SATA adapter). Even unplugged the front USB modules in case there was a short. 

So we have MOBO, 1 stick of ram, CPU (different from original CPU), CPU fan, power button, nothing else. In fact, short of the power supply and the case, nothing remains of the original computer except a stick of RAM (tried one, then the other - would be surprised to have two failing sticks of RAM that both pass Memtest86+).

Tried to reinstall Ubuntu. Normal live-USB mode doesn't boot, but boots with nomodeset argument (disables frame buffer, DRM, most resolutions, enables only generic display driver). Install proceeds properly. Reboot - splash screen then black as above.

Caught a glimpse of an ACPI error during a prior failed installed so passed the acpi=off argument in grub and Voila! Linux boots and gives me a stable system. We're getting somewhere. Seriously handicapped system however, as acpi=off means only one core is used, Nvidia drivers can't use IRQ to identify and use the card therefore only generic drivers are used, power management is minimal, etc. But it does strongly suggest there is a setting in the bios that is incompatible with ACPI, or some sort of hardware address conflict on the board(s) (two different boards, both Z790 chipsets but different product lines - TUF Gaming z790 plus and ROG Strix Z790-E). I'll be going over grub documentation to find settings less stringent than ACPI=off I could use. 

Any ideas?

Nate152
Moderator

Hello drduf

That is certainly a nice build you have there.

So, after reading your very detailed post, you get the same behavior with two different cpu's. This would say it's likely not the cpu.

Unless it's something specific with ubuntu, it seems there is some instability as you're able to boot with one core. You say you updated the bios, did you update to bios 2703 or 2801? These have the intel microcode 0x12B to address elevated cpu voltage requests.

Having a look at the QVL for the ROG Strix Z790-F Gaming Wifi, I don't see any crucial memory kits listed. I'm not saying your memory won't work, but it could be something to consider. When testing with one stick of memory, did you have it installed in the recommended slot A2 (2nd slot from the cpu)?

Under Size, select 2x32GB, then click the red search box. 

ROG STRIX Z790-E GAMING WIFI | Gaming motherboards|ROG - Republic of Gamers|ROG Global

 

 

 

 

 

 

 

Will be a great build once I get it working!

The Crucial kit is indeed in the QVL once you select 14th gen CPUs (part # CP32G56C46U5.C16B2)

I just *might* have solved the issue - as acpi=off seem to work, I tried a bunch of settings related to IRQ, both in the bios and in the boot parameters, to no avail - until replacing acpi=off by acpi=noirq worked. Now firing on all cores, but still unable to properly address my GPU.

As IRQ addresses are managed by the BIOS and CPU, I tried altering this. I had two boards, two CPUs, but the only common element was that both were fully upgraded - except the i7 first time I used it. I had upgraded that MB using a 12th gen i3 as it has no flashback capacity, and I guess the microcode didn't properly apply that first time but then applied after a full reboot - that reboot having involved KEK registration of the Nvidia drivers, I assume microcode applied then.

Went to Tweaker's paradise, applied the 0x104 microcode, and everything booted smoothly - no acpi=off or acpi=noirq. Nvidia drivers are properly loading. Everything seemed to be working. While the 14700f shouldn't be affected by the K-line voltage bug, I didn't want to use that microcode long term, so I tried just rolling back the whole bios, figuring I'd roll back until I find one that works.

Seems like the 2703 bios with default bios settings works beautifully. Next step will be to try a Windows 11 install, but that is a challenge for another day!

Nate152
Moderator

Awesome, this is sounding better.

To get a full view of how your pc is running with temperatures and voltages, you can install HWinfo, it's a free monitoring program. - Free Download HWiNFO Sofware | Installer & Portable for Windows, DOS

When you get to it, let us know how the windows install went.

Here is my HWinfo, the Vcore is the cpu voltage, be sure to check the Maximum columns.

Click the pic to make it bigger.

vcore.png

 

 

 

 

Thanks, will certainly take a look! 

ASUS should take a look at that last firmware iteration, it seems like I'm not the only one with stability issues with the December update based on this forum...

On the plus side, having swapped CPU half a dozen times this week I can now say I'm pretty confident in my thermal compound application 😉 Gotta see the positive aspects of it.

Nate152
Moderator

Yes, that is one of the first things I want to check is cpu temperature and voltage.

My cpu core usage shows 100%, but this is just at idle at the desktop, I didn't run any stress test or my cpu temps would be much higher. I don't want to give any false impression on the cpu temperature.

I do have a custom loop on my i7-12700KF with the EK Quantum Velocity2 water block and a hardware labs SR2 560mm radiator. The water block came with thermal grizzly hydronaut, but I prefer kryonaut.

I did a build log if you care to see some pics. -  TT Tower 900 - Republic of Gamers Forum - 906955

 

 

 

drduf
Level 7

Install went as smoothly as one could expect. Applied all updates while installing, and installed EFI on the NVMe which for some reason it didn't want to do previously (likely IRQ conflict with the USB?)

Here's the HWInfo output, noting that maximums reflect a CPU stress test performed using CPU-Z. I would have been shocked to have temperature problems - while I'm air cooled, it's still above the specs of the stock cooler (Noctua NH-U12S) on a CPU that is rated 65W TDP

drduf_0-1737473484811.png

 

Nate152
Moderator

 Max temp of 84c with a maximum of 1.385v under a stress test looks ok.

Because the Bus Clock is running at 99.8MHz, this slightly reduces the cpu core clock and memory clock.

It's nothing to worry about, but in the bios on the AI Tweaker page, the BCLK should be set to 100.00MHz. You can bump this up to 100.20MHz to compensate, then check the Bus Clock in HWinfo again. This will give a slight increase to the cpu core clock and memory clock. 

You can click the little arrow beside Core Temperature to see the temperature of each individual cpu core.

ASUS motherboards usually do a good job with the cpu voltage, but you can always try setting a lower voltage to see if it's stable. Your maximum Vcore shows 1.385v, you could try 1.35v. If stable, you should see lower cpu temps.

Or, if you're happy, you can leave it as is.

core temp.png

 

 

 

 

 

 

Thanks, much appreciated! Fiddled a bit in the power management settings of the bios, bringing down average temp by a few Cs. CPU is unlikely to ever run at 100% on all cores so not too worried about that one, max load is very, very likely to be taken by the tensor cores in the GPU, and case has ample ventilation (10x 120mm fans, 6 in/4out), with the most intense processing being batch jobs lasting 10-15 hours at a time so likely to be performed while I'm not sitting in front of it to be bothered by fan noise (although so far I'm pretty impressed by the low noise level of cheap-ish upHere and Noctua fans). Will run the tuning algorithm at some point to decrease the start-stop aspect of the fans.