Bot reccuringly not responding at specific times
For a few months now, our bot has been having issues during Sundays.
It begins every Sunday at 04:00 CET and ends every Monday at 04:00 CET. This coincides with a whole 24h Sunday time on BRT timezone (might be relevant due to bot demographics)
We have invested a lot of time investigating it together at the Kordex discord(warning, huge thread)
Would like to check with Kord if there is something that you can see that we are not seeing.
Basically, this last Sunday, we grabbed a 10minute CPU profiling graph during a time the issue was happening. Today in the morning, I grabbed another 10min profile during a regular behavior time.
Comparing both, we could notice that during the issues, the bot had 25% CPU time usage on the
DefaultGatewayEventInterceptor.handle
and GuildEventHandler.handle
methods.
When doing a 5minute profile with a heavy extension of our bot disabled, this number grew up to 42% of CPU time used by these methods.
On regular behavior, these methods are taking only 6% of the CPU.
For some visual, i am attaching what it looked like in the past 2 Sundays. This is how it looks like every Sunday for the past couple months for us
Interaction Latency means how long ago was the command created, based on the time the bot handled it
Because of the spike in how long it takes for the bot to handle, a huge percentage of our users are affected because it takes more than 3s to react, therefore returning "the application did not respond"
We have made several improvements on the bot, as well as some Kordex releases, but are at the end of our ideas on what it could be.
Perhaps some of you might have some more insights or ideas for us to try11 Replies
Might be worth noting some of our suspects and what we have changed.
The first suspect was a couple of servers our bot was added to, that had crypto bot users in them. Every Sunday, these bots would just go crazy with UpdateMemberEvents. Like, hundreds per minute. We assumed they were spending the whole day updating their own names/profile with information about cryptocurrencies values.
Our bot has a use case where we consumed MemberUpdateEvents. Two major changes happened since this was found:
- we worked with Kordex to improve how we filter events, and are dropping all events coming from Bot users before we send them to any handler
- we removed a bug in our handler that caused all remaining (real user) MemberUpdateEvents to come in through, even though only those who come from 1 specific server was desired
Even with these two improvements, the bot still hangs.
However, now we are having issues finding event related inputs that could overwhelm the bot. And are now looking for other possible bottlenecks
@Tschis have you checked the follow up on your previous thread
https://discord.com/channels/556525343595298817/1325146059575660565/1327421428374569010
This is a separate issue, that is for our gateway service and this is for our actual bot process (they are running in different, separate containers)
Our bot does not suffer the same issue with the EOF exceptions
I have tried filtering the events we get that could be spamming, e.g. MemberUpdateEvents, to directly drop them instead of handling, but no effect
Unfortunately, the way that the Intents work is a bit complicated
I can not ask for just MemberJoinEvents, I have to request the GuildCreate intent which comes with more than what I could want
Not to mention some very weird behavior, such as Intent.GuildPresence intent, which should only give me PresenceUpdateEvent, changes how GuildCreate events work, which comes from a completely different intent
We basically do not need PresenceUpdateEvents but have requested it because it makes starting up the bot faster as we need Member information from all Guilds and with presence intent the initial GuildCreate event contains information from all member data
So when the bot starts, we either have to request member data for 16k+ guilds, or we request GuildPresence intent and get all of that directly
But then get spammed by "undesired" PresenceUpdateEvents
@Tschis Hello, I wasn't available most of the day; let me brainstorm this with you
by filtering, could you elaborate how you are doing it?
they're using Kord Extensions' ability to filter out Kord events from its events flow using a predicate
I read your thread from the beginning and I reduced it to the following: You need presence update to be a one-time event specifically to get the data of all members
We need the presence intent to get the member data on guildcreate events when the bot starts up. Yes
However, so far we have not concluded that this is the root cause of our issues, as we also have the intent througout the whole week and the problem only ocurrs on Sundays
I do think half of your problem is expecting there to be a single, clear root cause
Sometimes you just need to work on optimising things that can obviously be optimised
I expect that because of the condition in which the problem appears, which is exactly 24h duration from 04:00 CET to 04:00 CET from Sunday to Monday
The time consistency would also lead me to think that