eGPU crashes randomly and Rog Ally X needs a force reboot

I'm running a RX 6800 XT with a AOOSTAR AG 02 to my Rog Ally X. I've used all-ways-egpu methond 2 and 3 and I'm very happy with the setup and performance, expect from some random eGPU crashes. I've tried with different games, different game settings with no luck. Whenever the crash happens I have to restart the Rog Ally X. About 10 seconds before I lose the display the games tend to get very slow (like 5-10 frames per second) although the performance overaly keeps saying I'm running at 60+ fps. The GPU is under 60C and the VRAM usage never goes above 12GB (out of 16). I tried stress testing it and it does just fine so I don't think the GPU is faulty. I have attached dmseg and journalctl logs. It seems like the thunderbolt controller thinks the eGPU was disconnected briefly? These are my kargs
rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0
rhgb quiet root=UUID=71350456-f59b-44b7-bd9e-eb986329111c rootflags=subvol=root rw ostree=/ostree/boot.1/default/e91de861bae555263d9c9f2e37f233242b11a8a402626ba6e831b602d6fd650f/0 amdgpu.gttsize=12192 amdgpu.sg_display=0 bluetooth.disable_ertm=1 preempt=full amdgpu.ppfeaturemask=0xfff7ffff amdgpu.runpm=0 thunderbolt.host_reset=0 vt.global_cursor_default=0 pcie_aspm=off dcdebugmask=0x10 amdgpu.gfxoff=0
14 Replies
Zetarancio
Zetarancio7d ago
Set the correct clock and memory limits using lact
mindxpert
mindxpertOP7d ago
I've got LACT because I setup some custom profiles for fan control (was thinking 80C+ might be the reason why it was crashing). I'm not sure what the limits of the clock speed and memory limits would be. I think those are the defaults (attached screenshot). Should I look at manufacturers site to check what the usual values are? Or am I supposed to have them lower than suggested so that it does not crash?
No description
mindxpert
mindxpertOP7d ago
I tried these settings based on the GPU specs but the device got very slow and had to force a reboot. I think it doesn't like the min memory clock.
Core Clock: 2000 to 2360Mhz
Memory Clock: 2000 to 2000MHz
Core Clock: 2000 to 2360Mhz
Memory Clock: 2000 to 2000MHz
I ended up just updating my Max GPU Clock from 2444 to 2350. The manufacturer list the max as 2360. The Min GPU Clock and the Memory ones were left as is.
Core Clock: 500 to 2350Mhz
Memory Clock: 1348 to 2000MHz
Core Clock: 500 to 2350Mhz
Memory Clock: 1348 to 2000MHz
I've had no luck with the clocks I set above. Would you mind sharing what your suggestion would be? Have you had such an issue that you resolved via setting clocks speeds? From my understanding the 2444MHz that was set as default is too high as 2360 is the Boost that my card supports, but even setting that to 10 lower like 2350 I'm having backouts.
Zetarancio
Zetarancio7d ago
I fixed my eGPU Rx 6800 by simply setting the limits according to the specs... I see your w limit is wrong. Did you add thunderbolt.host_reset=0? Btw check you wattage limitations.
mindxpert
mindxpertOP7d ago
Yeah this is the GPU I have https://www.techpowerup.com/gpu-specs/sapphire-nitro-rx-6800-xt.b8324 and it lists 2360MHz as the max. I tried setting it to 2350MHz. The power usage limit seems to be broken on lact (seems to show the APU limits). I can see that it is pulling way more than that, like up to 270W when GPU is at 100%. Yeah I had to add thunderbolt.host_reset=0 because it was suggested from the all-ways-egpu dev to help with auto-switching to eGPU after reboots. Is it okay to leave it like that?
Zetarancio
Zetarancio7d ago
I am not using any parameter that you added after featuremask. I added PCI=nommconf. I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sys You can read more here https://universal-blue.discourse.group/t/ayaneo-geek-1s-2s-linux-bazzite-support-is-already-almost-there-lets-add-them-to-the-officially-supported-devices/1046/36
Universal Blue
Ayaneo Geek 1S/2S Linux/Bazzite support is already almost there, le...
Hello guys. I am still in contact with Ayaneo. Some devices have already updated bios and EC but 1s have not been updated yet. A quick update: -Audio jack fixed -Egpu working -Resume, works with hibernation workaround Great device overall nowadays. Bazzite is at its peak performance. Can’t wait for kernel 6.14 for NTSYNC!
Zetarancio
Zetarancio7d ago
GitHub
AMD Performance Fixes
Configure eGPU as primary under Linux Wayland desktops - ewagner12/all-ways-egpu
mindxpert
mindxpertOP6d ago
I'll have a look at PCI and playing around with the kernel args later today when I'm home. Would there be any issue with HHD not being able to write to /sys? I can deactivate it to try but hopefully HHD will still be able to apply TDP in handheld mode. After a reboot my external display would not display anything but the sound would come out of it. I was looking at this similar issue that the dev suggested the kargs and that made reboot always switch to the external display: https://github.com/ewagner12/all-ways-egpu/issues/42#issuecomment-2764261679
Zetarancio
Zetarancio6d ago
It will. You only will not be able to apply the limit in steam performance overlay. Would you try rerunning the all-way-egpu setup? If you used PCI=nommconf you need to rerun
mindxpert
mindxpertOP6d ago
So far I removed all these kargs that I had added recently due to try to solve the GPU crashes and added the pci=nommconf in. Rebooting does not automatically switch to the external monitor (but sound through eARC does). I'll try running all-ways-egpu again now and see how it goes. Method 2 and 3 were not able to switch to external monitor after boot.
Apr 23 23:25:16 fedora systemd[1]: Starting all-ways-egpu-boot-vga.service - Configure eGPU as primary using boot_vga under Wayland desktops...
Apr 23 23:25:16 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 1
Apr 23 23:25:17 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 2
Apr 23 23:25:16 fedora systemd[1]: Starting all-ways-egpu-boot-vga.service - Configure eGPU as primary using boot_vga under Wayland desktops...
Apr 23 23:25:16 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 1
Apr 23 23:25:17 fedora all-ways-egpu-entry.sh[3022]: No eGPU detected, retry 2
I'll try adding the host reset back in and see if that will fix it again (I used to have this before that karg)
I experienced the powerlimit issue like one year ago and I solved by removing the option in hdd to write Tdp to /sys
I did this and can now see the slider show the correct values. Do I set it to the max? The Max is 332W, while clicking Default sets it to 289W. Re-adding thunderbolt.host_reset=0 makes the external monitor the default after a reboot. I have to stick with it.
Zetarancio
Zetarancio6d ago
Ok. Then you should test with or without nommconf
mindxpert
mindxpertOP6d ago
Played for about an hour and a half with no crashes. This is with nommconf on. Will try again tomorrow. If it does not crash then I’m not removing nommconf or trying without it anymore. Thanks for the help!
Zetarancio
Zetarancio3d ago
Ok. Report here in case you need something else, close the thread when needed
mindxpert
mindxpertOP2d ago
I've got to play a bit more with my setup. So far I haven't seen any crashes. But, the pci=nommconf resulted to freeze my system upon waking from sleep. So after wake it would not display anything on the internal or external display until a force restart. Buttons were also non responsive like I could not put it to sleep again. Now sleep/wake are working again. I'll play some more and in the upcoming days and conclude if the wattage + clock limit will have fixed my crashes. Appreciate your help! I got freezes when waking up today without pci=nommconf so that might not be the culprit on the freezes. Will not be re-adding it until I figure out the freeze just in case. I don't think the lact watt/clock changes could have affected that and I'm seeing some HHD exceptions so I started another post on that.

Did you find this page helpful?