eGPU crashes randomly and Rog Ally X needs a force reboot
I'm running a RX 6800 XT with a AOOSTAR AG 02 to my Rog Ally X.
I've used
all-ways-egpu
methond 2 and 3 and I'm very happy with the setup and performance, expect from some random eGPU crashes. I've tried with different games, different game settings with no luck. Whenever the crash happens I have to restart the Rog Ally X. About 10 seconds before I lose the display the games tend to get very slow (like 5-10 frames per second) although the performance overaly keeps saying I'm running at 60+ fps.
The GPU is under 60C and the VRAM usage never goes above 12GB (out of 16). I tried stress testing it and it does just fine so I don't think the GPU is faulty.
I have attached dmseg
and journalctl
logs. It seems like the thunderbolt controller thinks the eGPU was disconnected briefly? These are my kargs
14 Replies
Set the correct clock and memory limits using lact
I've got LACT because I setup some custom profiles for fan control (was thinking 80C+ might be the reason why it was crashing).
I'm not sure what the limits of the clock speed and memory limits would be. I think those are the defaults (attached screenshot). Should I look at manufacturers site to check what the usual values are? Or am I supposed to have them lower than suggested so that it does not crash?

I tried these settings based on the GPU specs but the device got very slow and had to force a reboot. I think it doesn't like the min memory clock.
I ended up just updating my Max GPU Clock from 2444 to 2350. The manufacturer list the max as 2360. The Min GPU Clock and the Memory ones were left as is.
I've had no luck with the clocks I set above. Would you mind sharing what your suggestion would be? Have you had such an issue that you resolved via setting clocks speeds? From my understanding the 2444MHz that was set as default is too high as 2360 is the Boost that my card supports, but even setting that to 10 lower like 2350 I'm having backouts.
I fixed my eGPU Rx 6800 by simply setting the limits according to the specs...
I see your w limit is wrong.
Did you add thunderbolt.host_reset=0?
Btw check you wattage limitations.
Yeah this is the GPU I have https://www.techpowerup.com/gpu-specs/sapphire-nitro-rx-6800-xt.b8324 and it lists 2360MHz as the max. I tried setting it to 2350MHz.
The power usage limit seems to be broken on lact (seems to show the APU limits). I can see that it is pulling way more than that, like up to 270W when GPU is at 100%.
Yeah I had to add
thunderbolt.host_reset=0
because it was suggested from the all-ways-egpu
dev to help with auto-switching to eGPU after reboots. Is it okay to leave it like that?I am not using any parameter that you added after featuremask.
I added PCI=nommconf.
I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sys
You can read more here
https://universal-blue.discourse.group/t/ayaneo-geek-1s-2s-linux-bazzite-support-is-already-almost-there-lets-add-them-to-the-officially-supported-devices/1046/36
Universal Blue
Ayaneo Geek 1S/2S Linux/Bazzite support is already almost there, le...
Hello guys. I am still in contact with Ayaneo. Some devices have already updated bios and EC but 1s have not been updated yet. A quick update: -Audio jack fixed -Egpu working -Resume, works with hibernation workaround Great device overall nowadays. Bazzite is at its peak performance. Can’t wait for kernel 6.14 for NTSYNC!
I don't see the host reset suggestion here
https://github.com/ewagner12/all-ways-egpu/wiki/AMD-Performance-Fixes
GitHub
AMD Performance Fixes
Configure eGPU as primary under Linux Wayland desktops - ewagner12/all-ways-egpu
I'll have a look at PCI and playing around with the kernel args later today when I'm home. Would there be any issue with HHD not being able to write to /sys? I can deactivate it to try but hopefully HHD will still be able to apply TDP in handheld mode.
After a reboot my external display would not display anything but the sound would come out of it. I was looking at this similar issue that the dev suggested the kargs and that made reboot always switch to the external display: https://github.com/ewagner12/all-ways-egpu/issues/42#issuecomment-2764261679
It will. You only will not be able to apply the limit in steam performance overlay.
Would you try rerunning the all-way-egpu setup?
If you used PCI=nommconf you need to rerun
So far I removed all these kargs that I had added recently due to try to solve the GPU crashes and added the pci=nommconf in. Rebooting does not automatically switch to the external monitor (but sound through eARC does). I'll try running all-ways-egpu again now and see how it goes.
Method 2 and 3 were not able to switch to external monitor after boot.
I'll try adding the host reset back in and see if that will fix it again (I used to have this before that karg)
I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sysI did this and can now see the slider show the correct values. Do I set it to the max? The Max is 332W, while clicking Default sets it to 289W. Re-adding
thunderbolt.host_reset=0
makes the external monitor the default after a reboot. I have to stick with it.Ok. Then you should test with or without nommconf
Played for about an hour and a half with no crashes. This is with nommconf on. Will try again tomorrow. If it does not crash then I’m not removing nommconf or trying without it anymore.
Thanks for the help!
Ok. Report here in case you need something else, close the thread when needed
I've got to play a bit more with my setup. So far I haven't seen any crashes. But, the pci=nommconf resulted to freeze my system upon waking from sleep. So after wake it would not display anything on the internal or external display until a force restart. Buttons were also non responsive like I could not put it to sleep again. Now sleep/wake are working again. I'll play some more and in the upcoming days and conclude if the wattage + clock limit will have fixed my crashes. Appreciate your help!
I got freezes when waking up today without pci=nommconf so that might not be the culprit on the freezes. Will not be re-adding it until I figure out the freeze just in case. I don't think the lact watt/clock changes could have affected that and I'm seeing some HHD exceptions so I started another post on that.