Edit: See update at end of post for more details (no solutions, but things to consider.)
As of Linux 6.7 I'm getting hard freezes that require a power cut to reset (sysrq doesn't work.) Happens at both idle and load anywhere from 5 minutes in to an hour. Running journalctl --follow
and dmesg -w
(both as root) reveal nothing at the time of the crash. Kernel version 6.6 continues to be 100% stable.
System:
- Distro/Kernel: Arch Linux 6.7.arch3-1
- CPU: AMD Ryzen 5 2600X
- GPU: AMD RX580 8GB via AMDGPU
- RAM: Some configuration of 16GB at 2667 MT/s.
- WM: SwayWM
I'm unsure how to go about properly reporting a bug if no errors are being generated.
Any advice?
I'm not alone on this apparently (warning, it's reddit.)
Update (01-27-2024)
I've spent the last 5 days or so bisecting the kernel from stable 6.6 to 6.7 while also touching on linux-next and 6.8rc1. I've experienced hangs on each kernel after 6.6 but under different conditions. In some cases sys-rq can rescue the system, but other times it's a hard (still errorless) crash. I believe all of these crashes can be blamed on AMDGPU given all other user reports (see: reddit thread above) mention having an AMD card.
List of similar issues
Note: Some of these are from earlier Kernel versions, but they're included since they present the same way.
- https://reddit.com/19e0y0b
- https://reddit.com/196tz6v
- https://bugzilla.kernel.org/show_bug.cgi?id=218413
- https://gitlab.freedesktop.org/drm/amd/-/issues/3124#note_2252559
Patched/Unpatched 6.8rc1 attempts
This bug report is the closest I can find on the issue. which links off to this report, which includes a patch. The patch in question is for 6.8rc1 and allows the system to stay up longer, but frequently "trips," meaning the system begins to stumble and halt for tenths of a second with accelerated video playback in my browser (qutebrowser) has to contend with load elsewhere on the system (gaming, etc.) Hangs under this patch and kernel can be rescued. However, with video acceleration disabled in the browser (and the browser not even running,) hard crashes can still occur. So either there's two new issues being brought into the fold (one to do with video accel. and the original issue mentioned in this post), or the original issue is just manifesting in different ways.
Bisecting 6.6 to 6.7
This process was taking forever because there's no reproducible situation in which the system halts. My method was to build the kernel (8 minutes) then reboot and let the system idle on a Sway session while I'm off doing something else (2 to 4 hours.) If I come back to a hard lock, then I mark the version as failed and repeat. This method let me try 2 versions of the kernel a day, not nearly enough to have this fixed quickly or easily. Due to the amount of time it takes to detect a crash, it's also possible to mark a bad commit as good (meaning it didn't crash in 2 hours, but would have in 4 or 5, etc.) I won't be continuing this.
The state of AMDGPU in general
It seems I'm lucky to have not had these issues before. A little bit of time reading issues like this show that people have been encountering problems for a while now (pre-kernel 6.7.) There are issues that have been open for years that still appear to be problems on modern systems and hardware. I'm not passing blame to anyone here, just stating that it's a miracle any of this stuff works at all given how complex the hardware is that even those who appear to be spending huge amounts of time dealing directly with it can't properly untangle what causes these faults.
I've never done it, but I would try reproducing this in a VM like qemu.... I would be googling at this point but I think you can debug a kernel crash from there somehow.
That's an interesting idea. I'll have to look into whether it's a viable option first, though.
I did this recently and it was extremely quick to bisect and debug, but I was lucky enough to have a simple repro that worked in the emulator.
I think if I were you I'd try to repro on bleeding edge first. Then if it's still broken, I'd try to get the repro time down as much as possible and automate it. Then I'd either bisect on qemu if possible, or bare metal.
Yeah, the qemu idea was brought up earlier in the thread and it's very interesting. Glad you confirmed you could repro real issues there in the test environment, so it's at least a little likely I'll be able to do the same. Makes sense that it would work and is way better than letting the real system crash and burn. My kernel compile time is pretty short so it shouldn't be too bad to bisect, I'm just not sure how many commits separate my stable kernel from the bugged 6.7. TBH I'm not that familiar with kernel dev., so maybe it's way simpler than that.
The one I was able to test on qemu was a reliable failure of memory management syscalls triggered by a certain usage pattern. Unfortunately yours sounds like it's probably hardware dependent. People in that Reddit thread mentioned video decoding, so you could try hammering that.
The nice thing about bisecting is that it's mostly logarithmic, so doubling the commits should only take one extra step. I'd be surprised if you had to do more than a 10-12 steps.
You may already have a good kernel config, but for this sort of thing I usually use
make localmodconfig
. That'll build all the modules that are loaded when you run it, which can cut down on compile time massively.I'm fresh off ruling out the RAM via memtest. I'll let it do a longer soak overnight to see if anything fails then, but I'm now on to bisecting the kernel from what I believe is the last release of 6.6 (6.6.13) to hopefully whatever the offending commit is. Been a while since I've had to mess around with manually building the kernel without the aid of linux-tkg, but I'm off to learn it anyway. Thanks for the help!
Good luck! Sounds like you got it under control, but I'm happy to help if you run into trouble. I'm curious what you'll find.