Unresponsive deployment after some hours
Hello, I've been using Railway to host my Telegram bot for more than an year and I never experienced this sort of issue until I enabled app sleeping a few days ago.
With that option enabled, the bot would just go to sleep without ever being able to wake (but that's fine, given I didn't take into account that a Telegram bot only pulls for updates).
So, when I noticed that I disabled app sleeping and allowed the bot to redeploy with the new configuration but in the next days the bot became unresponsive after a few hours. I tried to fix it redeploying it but it was always a temporary fix and now it seems to be stuck again (with no issues logged).
Can you please check if there is anything wrong with the configuration or the deployment of my project?
I'd be happy to share any other information needed.
Also, I'd like to have the project back to work as soon as possible but I can keep it broken until the issue is triaged, if needed.
The project is f6d25e17-9bb2-457f-8ce2-e55b5ce1dcd8
The service is 8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b
The deployment is 81cc519b-a7a3-41db-8266-53e08952e935
EDIT: the deploy was triggered yesterday at 2:16 PM CET (GMT+2) and it was working properly at least until today at 1:15 AM CET (GMT+2)
28 Replies
Project ID:
f6d25e17-9bb2-457f-8ce2-e55b5ce1dcd8,8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b,81cc519b-a7a3-41db-8266-53e08952e935
So it sounds like it's still going to sleep? does that sound right to you?
the immediate solution would be to deploy your bot into another service and leave the bugged service alone for now
side note, I'd be curious to see how a telegram bot that uses webhooks instead of polling would work with app sleeping
So it sounds like it's still going to sleep?It feels like it but I'm having a hard time figuring it out what could be causing it (also given there were little to no changes recently to bot's logic)
the immediate solution would be to deploy your bot into another service and leave the bugged service alone for nowThanks, I didn't thought about that! I've now deployed the service on my test environment while leaving the production alone. As I suspected I've got no errors from Telegram (which should complain when 2 instances of the same bot are running concurrently), it really feels dead. Thanks as always @Brody, should I ping again in a few days to see if the issue can be looked thoroughly?
if you dont hear back from me by Tuesday please ping, as i plan on bringing this up to the team, and in that case it would be helpful to leave the suspected bugged service untouched if possible
hey rob, the applicable person would be off until tmr
@Brody I've got a bad news (totally on me): I left automatic deploys enabled and a pull request has been automerged in the night so the faulty deployment is now gone.
For the time being you can ignore this issue, I will ping you if it happens again
okay sounds good!
Hi @Brody, sorry to bother you once more but it finally happened again. The deployment stuck is e90da47 (service 8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b).
I can't unfortunately disable the connection with my main branch because that would require a redeploy but I will do my best to stop merging code until the issue is looked at (the only issue would be an automerge by renovate during nighttime)
so the bot is unresponsive?
It wasn't until I deployed it on my test environment (with prod's token). It has been stuck for about 3 and half hours now
what makes you think your deployment is being put to sleep instead of soft locking or something similar?
Speaking for this deployment only as I can't really recall the oldest ones, it was a freshly deployed instance (5 hours old), it wasn't consuming that many resources and lately I limited the concurrency to max 10 requests at a time
The bot itself never suffered issues causing it to stop working without any sign, I would expect at least a stacktrace but I had none
You can see in the image the point in time when it stopped working, while processing 10 JPEG -> PNG conversions
you suspect it got slept at 3:12 pm?
Somewhere after that, I can only say that's the latest log I had proving the bot was online
and is it currently "sleeping" or have you since done a redeploy
It is still sleeping to this moment, I left prod deployment there and enabled the test one with the same token and it doesn't result in Telegram's error saying that only one instance of a bot can run concurrently
Also, I never saw its memory decrease in such a "slow" curve to this time as it did after those 3:12 PM (CET).
so heres the thing, if it was sleeping the memory reporting would have frozen
yet memory reporting continued and did differ
now i know i said i planned on bringing this up to the team, but im going to hold off on that for now since this is not looking like a platform issue
That would be fine and I get your point of view, I need to find a way to figure it out because as of now I wouldn't know how to triage it
is that service on the v2 runtime at least?
New builder and runtime v2
then i think the next course of action would be to add some very verbose logging so you can try to determine when and where your app is softlocking, and then hopefully why
I will see what I can come up with, thanks as always 😄
May I delete the stuck instance? I have no rush on that regard
yep!
and if you still think this is railway sleeping your service, catch and log sigterm as thats the signal sent when your container is stopped by railway for any reason
How confident do you feel saying that the memory of a sleeping service will be frozen until awoken?
unless they have since fixed that (doubt it) then that would still be the case
yep, metrics are not seeded with zeros during sleep so they would appear as frozen if there was metrics to begin with, but this service is slept for a very long time so there is not enough awake metrics to show either
How long would it take for those services to go to sleep? I don't think that's my case because I guess I would notice it in the UI
10 - 15 minutes maybe
but yeah i think its very safe to say that your deployment was not slept
Perfect, it doesn't match my deployment alive period