Misc Z13/395+ CPU problems.
So I'm finally getting around to playing a game on this thing (woot! Satisfactory factory goodness).
Alas, after about 30 minutes of play (steam + proton), the wayland user session suddenly locked up. But fortunately the keycombo to enter a virtual console worked (and alas - I don't have another machine here with me where I could have sshed).
Here's the dmseg logs I could snarf. Nasty amdgpu errors:
71 Replies
this was with 6.13.5-102.bazzite.fc41.x86_64 (64-bit) kernel
just looking at this log and the 6.13.6 changelog. I bet it is fixed in:
amdgpu/pm/legacy: fix suspend/resume issues
.6 is in testing with mesa 25
thanks! just snarfed it. So far I think 6.13.6 probably fixes it! @antheas
Though 6.13.6-101 does have a regression compared to 6.13.5:
Pressing the sleep button in the gui no longer fully enters sleep (black screen happens, wake happens but can still hear fans spinning). New error message appears in the dmesg output:
[ 137.130226] amd_pmc AMDI000B:00: Last suspend didn't reach deepest state
the failure happens after the virtual console is shutdown. I turned on no_console_suspend after seeing this problem and wherever the new sleep problem is after suspend devices completes (because the screen was dark but cpu clearly still kicking)
I can't test it rn Heading to bed
Tomorrow
Fucking Kernel I swear to god
just looking at the changelog that thing probably broke it
heh!
Antheas to the gpu kernel peeps
built the revert
GitHub
Release 6.13.6-102: Revert AMD Sleep patch ยท hhd-dev/kernel-bazzite
Commit a355d0d24d00d19fa70d6408fc1be34fe8ac79e5 is suspected to be causing sleep issues on the Z13. Revert it.
Full Changelog: 6.13.6-101...6.13.6-102
Worked on my side but it was a dirty build
So tomorrow I'll test this one and hopefully Kyle will drop into testing
There is a chance it was something else
I partially built the kernel
So maybe missing module
btw - alas, after 1 hr of play Satisfactory still crashed using this bazzite:testing branch (kernel 6.13.6-101.bazzite.fc41.x86_64). Wayland locked up. Relevant dmesg attached:
btw I just took a look at the first exception in this latest newcrash file. I bet the root cause is somewhere in panel-self-refresh. The relevant code is young (Nov 2024ish: https://lore.kernel.org/all/[email protected]/T/#m650152eb173c3a0b299c39dd843e92d0903b8b49 ) amdgpu_dm_enable_self_refresh().
I'm going to dig around and see if I can find a runtime flag to turn off this feature and see if the problem goes away.
ok I dug around in the relevant kernel srcs and that exception. I think very high likelyhood the problem is in the new panel-replay optimization feature. I'm currently doing a test with "rpm-ostree kargs --append=amdgpu.dcdebugmask=0x400" to mask out just that feature. I also wouldn't be surprised (based on the code comments about what that feature does) that this will also fix the occasional draw artifacts. Also the power savings provided by this feature is probably small
because the new kernel hasn't landed in testing yet
You can test if this is the problem by adding
dcdebugmask=0x10
to your kargs
this disables panel self refresh
rpm-ostree kargs --append-if-missing=dcdebugmask=0x10
its a power usage optimization featureright, I was just mentioning the 6.13.6-101 didn't even fix the original thing I thought it fixed ๐
IMO no need to turn off all of PSR, from looking at the code the error is in the self-refresh path only.
so 0x400 probably better
Yes, but to test whether this is actually the problem its a good idea to disable it. if your issue is fixed, you know the problem lies there ๐
0x400 turns off a subset of what 0x10 turns off ๐
Sure. If that doesn't work it might still be worthwhile to turn the whole feature off.
My way of testing things is usually turn the whole shit off, see if it works, and then re-enable things one by one ๐
sure, I'm testing 0x400 now, if that doesn't work I'll go to a bigger hammer (with higher costs)
yeah - but i've looked at the code and the error path is definitely in the section guarded by 0x400. Testing now though
alas 0x400 was not sufficient, the exception eventually occurred and made it a bit deeper into amdgpu_dm_enable_self_refresh() but failed later in the function. So I switched to 0x10 (to turn off all of the PSR code). Been running now for 40 min and I think it will be golden. Because the occasional draw artifacts that everyone has seen no longer occur.
I bet this problem could occur on any eDP panel that supports PSR. ya'll could turn on dcdebugmask=0x10 on for everyone via kargs and I think the cost would be zero for any unit that doesn't have a PSR capable display. From browsing 6.14 commits it looks like AMD geeks are still futzing with this feature, so such a hack will probably be needed only for a little while.
i.e. tasty sounding commits like this:
drm/amd/display: Disable PSR-SU on eDP panels
@antheas
the good news: turning off PSR definitely works-around the original exception in this report, it also fixes the occasional brief draw artifacts we've seen. See comment above about tasty sounding 6.14 commits for the root cause.
the bad news: I just installed the new testing build. Whichever change you backed out to make 6.13.6-102 kernel wasn't the cause of 6.13.6 failing to fully enter sleep. Sleep still doesn't fully enter when on the testing branch. Relevant dmesgs are unchanged:
[ 49.815569] PM: suspend entry (s2idle)
[ 49.841997] Filesystems sync: 0.025 seconds
[ 49.894384] Freezing user space processes
[ 50.888266] Freezing user space processes completed (elapsed 0.993 seconds)
[ 50.888285] OOM killer disabled.
[ 50.888290] Freezing remaining freezable tasks
[ 50.889316] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[ 50.904563] queueing ieee80211 work while going to suspend
[ 50.911953] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Asserting Reset
[ 51.048414] usb 3-2: reset high-speed USB device number 2 using xhci_hcd
[ 51.185959] PM: suspend devices took 0.296 seconds
[ 51.187718] ACPI: EC: interrupt blocked
[ 70.547009] amd_pmc AMDI000B:00: Last suspend didn't reach deepest state
[ 70.547561] ACPI: EC: interrupt unblocked
[ 70.746632] [drm] PCIE GART of 512M enabled (table at 0x00000083FFB00000).
[ 70.746668] amdgpu 0000:c4:00.0: amdgpu: SMU is resuming...
Ok so I have to do a full kernel rebuild and try it today
Turns out it was always broken and 13.6 is fine
You probably plugged in a dock or something
hmm - even with no USB accessories attached behavior is same on my flow 13.6 gives that error message wrt suspend (and fans keep spinning while sleeping). 13.5 is fine.
do you get that "amd_pmc AMDI000B:00: Last suspend didn't reach deepest state" message even on 13.5?
yes
so sleep doesn't fully enter for you on 13.5? (fans stay spinning etc)
fans stay off
but the message says what i said
Fan stays on on .6 again you jinxed it
I'll do more testing tomorrow
weird.
I just tried a bunch of sleep cycles in 13.5 and didn't have that message.|
[ 357.699955] PM: suspend entry (s2idle) [ 357.711232] Filesystems sync: 0.010 seconds [ 357.738037] Freezing user space processes [ 359.640330] Freezing user space processes completed (elapsed 1.901 seconds) [ 359.641204] OOM killer disabled. [ 359.641763] Freezing remaining freezable tasks [ 359.644107] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 359.667529] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Asserting Reset [ 359.673237] queueing ieee80211 work while going to suspend [ 359.674305] queueing ieee80211 work while going to suspend [ 359.814613] usb 3-2: reset high-speed USB device number 2 using xhci_hcd [ 359.952181] PM: suspend devices took 0.307 seconds [ 359.953778] ACPI: EC: interrupt blocked
@Kyle Gospo push the .5-103 to stable
.6 is cooked
i set it as latest if akmods need a rebuild
When I compile it locally it works
Fml
ooh! interesting!!!
antheas is following a classic heisenbug
a bug that disappears once you try to debug it
:clueless:
maybe its some compiler optimization causing issues
that's classic heisenbug
@geeksville new kernel is building, seems like amdxdna needed some fixes. Hopefully in a few hours you can test
cool beans! i'll try it today!
6.13.6-103 (via bazzite:testing) works good! fixes the new sleep problem
https://bodhi.fedoraproject.org/updates/FEDORA-2025-346cf69656 6.13.7 is also out already now might be worth it to rebase. ๐
This also finally includes the unicode fix with the anaconda installer
(after the GPU reset all was fine again)
I'll queue a .7 in a few hours
this really needs a rename
https://github.com/bazzite-org/kernel-bazzite/releases/tag/6.13.7-104 something for you to play with in a few hours
GitHub
Release 6.13.7-104: Z13 keyboard goodies ยท bazzite-org/kernel-bazzite
For the Asus ROG Z13:
Fixes the touchpad acting like a mouse during boot
Fixes the keyboard and lightbar light brightness levels and syncs them with the keyboard backlight setting in KDE/GNOME
Fix...
alas, this kernel isn't yet in bazzite-testing but I can check again tomorrow morning.
My z13 wakes up at night on its own
And crashes
Same time, 2:38
interesting! I haven't seen that on mine (6.13.6-103.bazzite.fc41.x86_64). I put it to sleep at night and when I wake in the morning by pressing a key it looks fine.
I think it's .7
My .7
Probably Mario's display patches are undercooked and I should nix them
Although I can't see anything wrong with them
I'm busy with other stuff for a few days so I haven't tried to figure out how bazzite/rpm/fedora build system layers patches on top of the regular kernel tree. But just from scrolling through github, is this okay? i.e. this function fails to release a lock through one of the two possible exit paths.

also that caused me to search for brt_lock (admittedly only in the patch file view on github - so imperfect). Here is it possibly calling unlock on a mutex we have already released?

btw - for lulz I tried running the latest ollama (in podman and with the gpu exposed into the container). It worked good! happily uses the GPU and runs fast (haven't benchmarked yet)
I fixed the issues with the locks. I don't think that's it
Yeah I fixed that
And that too, mutex lock happens after the unregister check
That way when we unregister it does not lock twice
What happens is that the GPU explodes in the log i have
A rail does not come back and then it starts accessing invalid memory and it diws
heh - for lulz I tried using ollama via the very fresh rocm halo support. It mostly worked well but I did just see a GPU reset (which everything except ollama recovered from)
fyi
GitHub
Release 6.13.7-107: Asus Z13 RGB Support ยท bazzite-org/kernel-bazzite
Adds RGB support to Asus Z13 + stability fixes related to backlight.
Full Changelog: 6.13.7-106...6.13.7-107
@Kyle Gospo drop a build in testing in a few hour
https://github.com/bazzite-org/kernel-bazzite/compare/6.13.7-106...6.13.7-107
the new testing build works well (at least as well as the one that had the prior kernel. The keyboard/clamshell light control works also.
RGB should work too on this one
ooh - is there a helper app I should try to test that?
KDE accent
hmm - cool. it seems like it kinda works? I found the UI in KDE it now has the option to have the keyboard color follow the accent color. And initially the color for my theme was blue and the keyboard light was blue (yay!). But if I change the accent color in the KDE theme to red: The KDE button in the UI for keyboard color changes to red (yay) but the actual LED lights on the keyboard stay blue. So possibly something wonky there.
hopefully nothing crashed
yeah the accent thing is kinda trash
Are you guys also experiencing slower Wi-Fi performance on your z13? Seems capped at around 200 Mbps for me and the Mediatek module is soldered on
yes - the upload speed in particular is really slow on the mt7925. For the time beingI'm using a USB wifi dongle. After 6.14 is out (and in fedora/bazzite) if it is not fixed then (and no one else is working on it first), I'm planning on spending some serious effort on debugging it (hopefully the flaw is not in the opaque on-device firmware blob) ๐
btw this person did some interesting crude bisect testing. If their test is correct there was a regression somewhere between 6.13.0 and 6.13.3
though I'm a little skeptical because I don't see anything substantial in "git diff v6.12 v6.13-rc3 -- drivers/net/wireless/mediatek"
didn't check linux-firmware though
though 6.14 has lots of new relevant commits. git diff v6.13-rc7 master drivers/net/wireless/mediatek
sad they choose to solder instead of an m.2 key and save 13 cents ๐ฆ
also I presume solder saves money on shock&vibe and related (i.e. users mucking around inside the device and fucking up without getting caught) warranty costs.
so probably more like $2
there's a reason they made a door for the SSD and it wasn't just to be nice ๐
and the % of users who would even bother to swap wifi (even if a M.2 connector) is tiny
but yeah - it would fucking rock for me!
Nice... I'll keep an eye on 6.14... I installed Arch originally... but I went back to Windows when I couldnt improve the wifi performance... I'm planning on picking up a 2TB SSD so I'll give it another shot then
btw @antheas I think I've got ryzenadj updated and I'm correctly writing undervolt values for CPU and GPU. I'll test more tomorrow after building up a little test harness. It is a bit of a race against time because we're going away for about a week on a bike trip - so if I don't have it tested by tomorrow evening I won't be back at it until sometime late next week.
I'm not sure if still relevant, but 6.13.8 came out with a suspend fix on eDP: https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.13.8

This should finally fix lingering wake-suspend issues on eDP displays
also mesa 25.0.2 should be out soon with some important fixes
btw the CPU undervolting works. I've also updated the ryzenadj tool so it can read per core CPU power and voltage.
My system works robustly with a -40mV cpu core voltage offset. I'm going to check this in and start working on the (probably similar) changes needed to undervolt the GPU die.
somewhere at about -80 my CPU (mprimes) stress test starts to fail... but pretty happy with -40 at least
GitHub
Add support for Strix Halo CPUs by geeksville ยท Pull Request #334 ยท...
Hi @FlyGoat,
It was good chatting with you via email. I've made some progress on adding Strix Halo CPU support. I'm attaching this PR but I'll keep adding to it in my wor...
good news: I ran a long stress test and -40 of cpu undervolt was fine on my machine for a few hour test (and makes things run a lot cooler under load). -50 is too much and my cpu stresstest fails. UPDATE: Alas, -40 failed after about 4 hrs of stress testing, I'm going to leave -30 running overnight and stay with that if it does okay. FINAL-UPDATE: -30 ran solid for 8 hrs so I'm calling it good (for my particular laptop).
I just added a commit to improve the GPU support for halos on ryzenadj. Alas, the cogfx mailbox is at a different message code on this new arch (different from hawkpoint/van gogh) - so no setting GPU undervolt until that new mailbox code is found. I'm going to go away now on my bike trek for about a week and a half, hopefully the windows ghelper folks will figure that out and I can crib from what they find ๐
btw: I think the SMU gets a bit clobbered on wake from sleep, if that's true it will be necessary to rerun the set coall command after wake. When I come back I'll check that and if necessary make some sort of systemctlish thing to do the proper whacking.