Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

0x0 · edit-2 1 year ago

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

@[email protected] · 1 year ago

I did this recently and it was extremely quick to bisect and debug, but I was lucky enough to have a simple repro that worked in the emulator.

I think if I were you I’d try to repro on bleeding edge first. Then if it’s still broken, I’d try to get the repro time down as much as possible and automate it. Then I’d either bisect on qemu if possible, or bare metal.

0x0 · 1 year ago

Yeah, the qemu idea was brought up earlier in the thread and it’s very interesting. Glad you confirmed you could repro real issues there in the test environment, so it’s at least a little likely I’ll be able to do the same. Makes sense that it would work and is way better than letting the real system crash and burn. My kernel compile time is pretty short so it shouldn’t be too bad to bisect, I’m just not sure how many commits separate my stable kernel from the bugged 6.7. TBH I’m not that familiar with kernel dev., so maybe it’s way simpler than that.

@[email protected] · 1 year ago

The one I was able to test on qemu was a reliable failure of memory management syscalls triggered by a certain usage pattern. Unfortunately yours sounds like it’s probably hardware dependent. People in that Reddit thread mentioned video decoding, so you could try hammering that.

The nice thing about bisecting is that it’s mostly logarithmic, so doubling the commits should only take one extra step. I’d be surprised if you had to do more than a 10-12 steps.

You may already have a good kernel config, but for this sort of thing I usually use make localmodconfig. That’ll build all the modules that are loaded when you run it, which can cut down on compile time massively.

0x0 · 1 year ago

I’m fresh off ruling out the RAM via memtest. I’ll let it do a longer soak overnight to see if anything fails then, but I’m now on to bisecting the kernel from what I believe is the last release of 6.6 (6.6.13) to hopefully whatever the offending commit is. Been a while since I’ve had to mess around with manually building the kernel without the aid of linux-tkg, but I’m off to learn it anyway. Thanks for the help!

@[email protected] · 1 year ago

Good luck! Sounds like you got it under control, but I’m happy to help if you run into trouble. I’m curious what you’ll find.

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Update (01-27-2024)

List of similar issues

Patched/Unpatched 6.8rc1 attempts

Bisecting 6.6 to 6.7

The state of AMDGPU in general