Misc Z13/395+ CPU problems.

So I'm finally getting around to playing a game on this thing (woot! Satisfactory factory goodness). Alas, after about 30 minutes of play (steam + proton), the wayland user session suddenly locked up. But fortunately the keycombo to enter a virtual console worked (and alas - I don't have another machine here with me where I could have sshed). Here's the dmseg logs I could snarf. Nasty amdgpu errors:
71 Replies
geeksville
geeksvilleOPโ€ข4w ago
geeksville
geeksvilleOPโ€ข4w ago
this was with 6.13.5-102.bazzite.fc41.x86_64 (64-bit) kernel just looking at this log and the 6.13.6 changelog. I bet it is fixed in: amdgpu/pm/legacy: fix suspend/resume issues
antheas
antheasโ€ข4w ago
.6 is in testing with mesa 25
geeksville
geeksvilleOPโ€ข4w ago
thanks! just snarfed it. So far I think 6.13.6 probably fixes it! @antheas Though 6.13.6-101 does have a regression compared to 6.13.5: Pressing the sleep button in the gui no longer fully enters sleep (black screen happens, wake happens but can still hear fans spinning). New error message appears in the dmesg output:
geeksville
geeksvilleOPโ€ข4w ago
[ 137.130226] amd_pmc AMDI000B:00: Last suspend didn't reach deepest state the failure happens after the virtual console is shutdown. I turned on no_console_suspend after seeing this problem and wherever the new sleep problem is after suspend devices completes (because the screen was dark but cpu clearly still kicking)
antheas
antheasโ€ข4w ago
I can't test it rn Heading to bed Tomorrow Fucking Kernel I swear to god just looking at the changelog that thing probably broke it
geeksville
geeksvilleOPโ€ข4w ago
heh!
CheckYourFax
CheckYourFaxโ€ข4w ago
Antheas to the gpu kernel peeps
antheas
antheasโ€ข4w ago
built the revert
antheas
antheasโ€ข4w ago
GitHub
Release 6.13.6-102: Revert AMD Sleep patch ยท hhd-dev/kernel-bazzite
Commit a355d0d24d00d19fa70d6408fc1be34fe8ac79e5 is suspected to be causing sleep issues on the Z13. Revert it. Full Changelog: 6.13.6-101...6.13.6-102
antheas
antheasโ€ข4w ago
Worked on my side but it was a dirty build So tomorrow I'll test this one and hopefully Kyle will drop into testing There is a chance it was something else I partially built the kernel So maybe missing module
geeksville
geeksvilleOPโ€ข4w ago
btw - alas, after 1 hr of play Satisfactory still crashed using this bazzite:testing branch (kernel 6.13.6-101.bazzite.fc41.x86_64). Wayland locked up. Relevant dmesg attached:
geeksville
geeksvilleOPโ€ข4w ago
geeksville
geeksvilleOPโ€ข4w ago
btw I just took a look at the first exception in this latest newcrash file. I bet the root cause is somewhere in panel-self-refresh. The relevant code is young (Nov 2024ish: https://lore.kernel.org/all/[email protected]/T/#m650152eb173c3a0b299c39dd843e92d0903b8b49 ) amdgpu_dm_enable_self_refresh(). I'm going to dig around and see if I can find a runtime flag to turn off this feature and see if the problem goes away. ok I dug around in the relevant kernel srcs and that exception. I think very high likelyhood the problem is in the new panel-replay optimization feature. I'm currently doing a test with "rpm-ostree kargs --append=amdgpu.dcdebugmask=0x400" to mask out just that feature. I also wouldn't be surprised (based on the code comments about what that feature does) that this will also fix the occasional draw artifacts. Also the power savings provided by this feature is probably small
CheckYourFax
CheckYourFaxโ€ข4w ago
because the new kernel hasn't landed in testing yet You can test if this is the problem by adding dcdebugmask=0x10 to your kargs this disables panel self refresh rpm-ostree kargs --append-if-missing=dcdebugmask=0x10 its a power usage optimization feature
geeksville
geeksvilleOPโ€ข4w ago
right, I was just mentioning the 6.13.6-101 didn't even fix the original thing I thought it fixed ๐Ÿ˜‰ IMO no need to turn off all of PSR, from looking at the code the error is in the self-refresh path only. so 0x400 probably better
CheckYourFax
CheckYourFaxโ€ข4w ago
Yes, but to test whether this is actually the problem its a good idea to disable it. if your issue is fixed, you know the problem lies there ๐Ÿ™‚
geeksville
geeksvilleOPโ€ข4w ago
0x400 turns off a subset of what 0x10 turns off ๐Ÿ˜‰
CheckYourFax
CheckYourFaxโ€ข4w ago
Sure. If that doesn't work it might still be worthwhile to turn the whole feature off. My way of testing things is usually turn the whole shit off, see if it works, and then re-enable things one by one ๐Ÿ˜›
geeksville
geeksvilleOPโ€ข4w ago
sure, I'm testing 0x400 now, if that doesn't work I'll go to a bigger hammer (with higher costs) yeah - but i've looked at the code and the error path is definitely in the section guarded by 0x400. Testing now though alas 0x400 was not sufficient, the exception eventually occurred and made it a bit deeper into amdgpu_dm_enable_self_refresh() but failed later in the function. So I switched to 0x10 (to turn off all of the PSR code). Been running now for 40 min and I think it will be golden. Because the occasional draw artifacts that everyone has seen no longer occur. I bet this problem could occur on any eDP panel that supports PSR. ya'll could turn on dcdebugmask=0x10 on for everyone via kargs and I think the cost would be zero for any unit that doesn't have a PSR capable display. From browsing 6.14 commits it looks like AMD geeks are still futzing with this feature, so such a hack will probably be needed only for a little while. i.e. tasty sounding commits like this: drm/amd/display: Disable PSR-SU on eDP panels @antheas the good news: turning off PSR definitely works-around the original exception in this report, it also fixes the occasional brief draw artifacts we've seen. See comment above about tasty sounding 6.14 commits for the root cause. the bad news: I just installed the new testing build. Whichever change you backed out to make 6.13.6-102 kernel wasn't the cause of 6.13.6 failing to fully enter sleep. Sleep still doesn't fully enter when on the testing branch. Relevant dmesgs are unchanged: [ 49.815569] PM: suspend entry (s2idle) [ 49.841997] Filesystems sync: 0.025 seconds [ 49.894384] Freezing user space processes [ 50.888266] Freezing user space processes completed (elapsed 0.993 seconds) [ 50.888285] OOM killer disabled. [ 50.888290] Freezing remaining freezable tasks [ 50.889316] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 50.904563] queueing ieee80211 work while going to suspend [ 50.911953] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Asserting Reset [ 51.048414] usb 3-2: reset high-speed USB device number 2 using xhci_hcd [ 51.185959] PM: suspend devices took 0.296 seconds [ 51.187718] ACPI: EC: interrupt blocked [ 70.547009] amd_pmc AMDI000B:00: Last suspend didn't reach deepest state [ 70.547561] ACPI: EC: interrupt unblocked [ 70.746632] [drm] PCIE GART of 512M enabled (table at 0x00000083FFB00000). [ 70.746668] amdgpu 0000:c4:00.0: amdgpu: SMU is resuming...
antheas
antheasโ€ข4w ago
Ok so I have to do a full kernel rebuild and try it today
antheas
antheasโ€ข4w ago
Turns out it was always broken and 13.6 is fine You probably plugged in a dock or something
geeksville
geeksvilleOPโ€ข4w ago
hmm - even with no USB accessories attached behavior is same on my flow 13.6 gives that error message wrt suspend (and fans keep spinning while sleeping). 13.5 is fine. do you get that "amd_pmc AMDI000B:00: Last suspend didn't reach deepest state" message even on 13.5?
antheas
antheasโ€ข4w ago
yes
geeksville
geeksvilleOPโ€ข4w ago
so sleep doesn't fully enter for you on 13.5? (fans stay spinning etc)
antheas
antheasโ€ข4w ago
fans stay off but the message says what i said Fan stays on on .6 again you jinxed it I'll do more testing tomorrow
geeksville
geeksvilleOPโ€ข4w ago
weird. I just tried a bunch of sleep cycles in 13.5 and didn't have that message.|
[ 357.699955] PM: suspend entry (s2idle) [ 357.711232] Filesystems sync: 0.010 seconds [ 357.738037] Freezing user space processes [ 359.640330] Freezing user space processes completed (elapsed 1.901 seconds) [ 359.641204] OOM killer disabled. [ 359.641763] Freezing remaining freezable tasks [ 359.644107] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 359.667529] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Asserting Reset [ 359.673237] queueing ieee80211 work while going to suspend [ 359.674305] queueing ieee80211 work while going to suspend [ 359.814613] usb 3-2: reset high-speed USB device number 2 using xhci_hcd [ 359.952181] PM: suspend devices took 0.307 seconds [ 359.953778] ACPI: EC: interrupt blocked
antheas
antheasโ€ข4w ago
@Kyle Gospo push the .5-103 to stable .6 is cooked i set it as latest if akmods need a rebuild When I compile it locally it works Fml
geeksville
geeksvilleOPโ€ข4w ago
ooh! interesting!!!
CheckYourFax
CheckYourFaxโ€ข4w ago
antheas is following a classic heisenbug a bug that disappears once you try to debug it :clueless: maybe its some compiler optimization causing issues that's classic heisenbug
antheas
antheasโ€ข3w ago
@geeksville new kernel is building, seems like amdxdna needed some fixes. Hopefully in a few hours you can test
geeksville
geeksvilleOPโ€ข3w ago
cool beans! i'll try it today! 6.13.6-103 (via bazzite:testing) works good! fixes the new sleep problem
CheckYourFax
CheckYourFaxโ€ข3w ago
https://bodhi.fedoraproject.org/updates/FEDORA-2025-346cf69656 6.13.7 is also out already now might be worth it to rebase. ๐Ÿ˜› This also finally includes the unicode fix with the anaconda installer
geeksville
geeksvilleOPโ€ข3w ago
geeksville
geeksvilleOPโ€ข3w ago
(after the GPU reset all was fine again)
antheas
antheasโ€ข3w ago
I'll queue a .7 in a few hours this really needs a rename
antheas
antheasโ€ข3w ago
https://github.com/bazzite-org/kernel-bazzite/releases/tag/6.13.7-104 something for you to play with in a few hours
GitHub
Release 6.13.7-104: Z13 keyboard goodies ยท bazzite-org/kernel-bazzite
For the Asus ROG Z13: Fixes the touchpad acting like a mouse during boot Fixes the keyboard and lightbar light brightness levels and syncs them with the keyboard backlight setting in KDE/GNOME Fix...
geeksville
geeksvilleOPโ€ข3w ago
alas, this kernel isn't yet in bazzite-testing but I can check again tomorrow morning.
antheas
antheasโ€ข3w ago
My z13 wakes up at night on its own And crashes Same time, 2:38
geeksville
geeksvilleOPโ€ข3w ago
interesting! I haven't seen that on mine (6.13.6-103.bazzite.fc41.x86_64). I put it to sleep at night and when I wake in the morning by pressing a key it looks fine.
antheas
antheasโ€ข3w ago
I think it's .7 My .7 Probably Mario's display patches are undercooked and I should nix them Although I can't see anything wrong with them
geeksville
geeksvilleOPโ€ข3w ago
I'm busy with other stuff for a few days so I haven't tried to figure out how bazzite/rpm/fedora build system layers patches on top of the regular kernel tree. But just from scrolling through github, is this okay? i.e. this function fails to release a lock through one of the two possible exit paths.
No description
geeksville
geeksvilleOPโ€ข3w ago
also that caused me to search for brt_lock (admittedly only in the patch file view on github - so imperfect). Here is it possibly calling unlock on a mutex we have already released?
No description
geeksville
geeksvilleOPโ€ข3w ago
btw - for lulz I tried running the latest ollama (in podman and with the gpu exposed into the container). It worked good! happily uses the GPU and runs fast (haven't benchmarked yet)
antheas
antheasโ€ข3w ago
I fixed the issues with the locks. I don't think that's it Yeah I fixed that And that too, mutex lock happens after the unregister check That way when we unregister it does not lock twice What happens is that the GPU explodes in the log i have A rail does not come back and then it starts accessing invalid memory and it diws
geeksville
geeksvilleOPโ€ข3w ago
heh - for lulz I tried using ollama via the very fresh rocm halo support. It mostly worked well but I did just see a GPU reset (which everything except ollama recovered from)
geeksville
geeksvilleOPโ€ข3w ago
fyi
antheas
antheasโ€ข2w ago
GitHub
Release 6.13.7-107: Asus Z13 RGB Support ยท bazzite-org/kernel-bazzite
Adds RGB support to Asus Z13 + stability fixes related to backlight. Full Changelog: 6.13.7-106...6.13.7-107
antheas
antheasโ€ข2w ago
geeksville
geeksvilleOPโ€ข2w ago
the new testing build works well (at least as well as the one that had the prior kernel. The keyboard/clamshell light control works also.
antheas
antheasโ€ข2w ago
RGB should work too on this one
geeksville
geeksvilleOPโ€ข2w ago
ooh - is there a helper app I should try to test that?
antheas
antheasโ€ข2w ago
KDE accent
geeksville
geeksvilleOPโ€ข2w ago
hmm - cool. it seems like it kinda works? I found the UI in KDE it now has the option to have the keyboard color follow the accent color. And initially the color for my theme was blue and the keyboard light was blue (yay!). But if I change the accent color in the KDE theme to red: The KDE button in the UI for keyboard color changes to red (yay) but the actual LED lights on the keyboard stay blue. So possibly something wonky there.
antheas
antheasโ€ข2w ago
hopefully nothing crashed yeah the accent thing is kinda trash
konros
konrosโ€ข2w ago
Are you guys also experiencing slower Wi-Fi performance on your z13? Seems capped at around 200 Mbps for me and the Mediatek module is soldered on
geeksville
geeksvilleOPโ€ข2w ago
yes - the upload speed in particular is really slow on the mt7925. For the time beingI'm using a USB wifi dongle. After 6.14 is out (and in fedora/bazzite) if it is not fixed then (and no one else is working on it first), I'm planning on spending some serious effort on debugging it (hopefully the flaw is not in the opaque on-device firmware blob) ๐Ÿ˜‰ btw this person did some interesting crude bisect testing. If their test is correct there was a regression somewhere between 6.13.0 and 6.13.3
geeksville
geeksvilleOPโ€ข2w ago
though I'm a little skeptical because I don't see anything substantial in "git diff v6.12 v6.13-rc3 -- drivers/net/wireless/mediatek" didn't check linux-firmware though though 6.14 has lots of new relevant commits. git diff v6.13-rc7 master drivers/net/wireless/mediatek
CheckYourFax
CheckYourFaxโ€ข2w ago
sad they choose to solder instead of an m.2 key and save 13 cents ๐Ÿ˜ฆ
geeksville
geeksvilleOPโ€ข2w ago
also I presume solder saves money on shock&vibe and related (i.e. users mucking around inside the device and fucking up without getting caught) warranty costs. so probably more like $2 there's a reason they made a door for the SSD and it wasn't just to be nice ๐Ÿ˜‰ and the % of users who would even bother to swap wifi (even if a M.2 connector) is tiny but yeah - it would fucking rock for me!
konros
konrosโ€ข2w ago
Nice... I'll keep an eye on 6.14... I installed Arch originally... but I went back to Windows when I couldnt improve the wifi performance... I'm planning on picking up a 2TB SSD so I'll give it another shot then
geeksville
geeksvilleOPโ€ข2w ago
btw @antheas I think I've got ryzenadj updated and I'm correctly writing undervolt values for CPU and GPU. I'll test more tomorrow after building up a little test harness. It is a bit of a race against time because we're going away for about a week on a bike trip - so if I don't have it tested by tomorrow evening I won't be back at it until sometime late next week.
CheckYourFax
CheckYourFaxโ€ข2w ago
I'm not sure if still relevant, but 6.13.8 came out with a suspend fix on eDP: https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.13.8
CheckYourFax
CheckYourFaxโ€ข2w ago
No description
CheckYourFax
CheckYourFaxโ€ข2w ago
This should finally fix lingering wake-suspend issues on eDP displays
geeksville
geeksvilleOPโ€ข2w ago
also mesa 25.0.2 should be out soon with some important fixes btw the CPU undervolting works. I've also updated the ryzenadj tool so it can read per core CPU power and voltage. My system works robustly with a -40mV cpu core voltage offset. I'm going to check this in and start working on the (probably similar) changes needed to undervolt the GPU die.
yay! 40 (30mV actual at idle - a bit more under load) of CPU undervolting works:
sudo ./ryzenadj --dump-table --set-coall=0x0fffd8 >coal40
core voltage change:
| 0x0BD0 | 0x3F343865 | 0.704 | 0.675 |
| 0x0BD4 | 0x3F33FD8A | 0.703 | 0.675 |
| 0x0BD8 | 0x3F34373A | 0.704 | 0.675 |
| 0x0BDC | 0x3F35B1FB | 0.710 | 0.681 |
| 0x0BE0 | 0x3F351F74 | 0.708 | 0.679 |
| 0x0BE4 | 0x3F34F129 | 0.707 | 0.678 |
| 0x0BE8 | 0x3F34C85D | 0.706 | 0.678 |
| 0x0BEC | 0x3F350183 | 0.707 | 0.679 |
| 0x0BF0 | 0x3F327A87 | 0.697 | 0.668 |
| 0x0BF4 | 0x3F341BA7 | 0.704 | 0.675 |
| 0x0BF8 | 0x3F32F8A9 | 0.699 | 0.671 |
| 0x0BFC | 0x3F33A0E0 | 0.702 | 0.673 |
| 0x0C00 | 0x3F32ED00 | 0.699 | 0.670 |
| 0x0C04 | 0x3F337071 | 0.701 | 0.672 |
| 0x0C08 | 0x3F3376AF | 0.701 | 0.672 |
| 0x0C0C | 0x3F326486 | 0.697 | 0.668 |
yay! 40 (30mV actual at idle - a bit more under load) of CPU undervolting works:
sudo ./ryzenadj --dump-table --set-coall=0x0fffd8 >coal40
core voltage change:
| 0x0BD0 | 0x3F343865 | 0.704 | 0.675 |
| 0x0BD4 | 0x3F33FD8A | 0.703 | 0.675 |
| 0x0BD8 | 0x3F34373A | 0.704 | 0.675 |
| 0x0BDC | 0x3F35B1FB | 0.710 | 0.681 |
| 0x0BE0 | 0x3F351F74 | 0.708 | 0.679 |
| 0x0BE4 | 0x3F34F129 | 0.707 | 0.678 |
| 0x0BE8 | 0x3F34C85D | 0.706 | 0.678 |
| 0x0BEC | 0x3F350183 | 0.707 | 0.679 |
| 0x0BF0 | 0x3F327A87 | 0.697 | 0.668 |
| 0x0BF4 | 0x3F341BA7 | 0.704 | 0.675 |
| 0x0BF8 | 0x3F32F8A9 | 0.699 | 0.671 |
| 0x0BFC | 0x3F33A0E0 | 0.702 | 0.673 |
| 0x0C00 | 0x3F32ED00 | 0.699 | 0.670 |
| 0x0C04 | 0x3F337071 | 0.701 | 0.672 |
| 0x0C08 | 0x3F3376AF | 0.701 | 0.672 |
| 0x0C0C | 0x3F326486 | 0.697 | 0.668 |
somewhere at about -80 my CPU (mprimes) stress test starts to fail... but pretty happy with -40 at least
geeksville
geeksvilleOPโ€ข2w ago
GitHub
Add support for Strix Halo CPUs by geeksville ยท Pull Request #334 ยท...
Hi @FlyGoat, It was good chatting with you via email. I've made some progress on adding Strix Halo CPU support. I'm attaching this PR but I'll keep adding to it in my wor...
geeksville
geeksvilleOPโ€ข2w ago
good news: I ran a long stress test and -40 of cpu undervolt was fine on my machine for a few hour test (and makes things run a lot cooler under load). -50 is too much and my cpu stresstest fails. UPDATE: Alas, -40 failed after about 4 hrs of stress testing, I'm going to leave -30 running overnight and stay with that if it does okay. FINAL-UPDATE: -30 ran solid for 8 hrs so I'm calling it good (for my particular laptop). I just added a commit to improve the GPU support for halos on ryzenadj. Alas, the cogfx mailbox is at a different message code on this new arch (different from hawkpoint/van gogh) - so no setting GPU undervolt until that new mailbox code is found. I'm going to go away now on my bike trek for about a week and a half, hopefully the windows ghelper folks will figure that out and I can crib from what they find ๐Ÿ˜‰ btw: I think the SMU gets a bit clobbered on wake from sleep, if that's true it will be necessary to rerun the set coall command after wake. When I come back I'll check that and if necessary make some sort of systemctlish thing to do the proper whacking.

Did you find this page helpful?