Railway•6mo ago

Unresponsive deployment after some hours

Hello, I've been using Railway to host my Telegram bot for more than an year and I never experienced this sort of issue until I enabled app sleeping a few days ago. With that option enabled, the bot would just go to sleep without ever being able to wake (but that's fine, given I didn't take into account that a Telegram bot only pulls for updates). So, when I noticed that I disabled app sleeping and allowed the bot to redeploy with the new configuration but in the next days the bot became unresponsive after a few hours. I tried to fix it redeploying it but it was always a temporary fix and now it seems to be stuck again (with no issues logged). Can you please check if there is anything wrong with the configuration or the deployment of my project? I'd be happy to share any other information needed. Also, I'd like to have the project back to work as soon as possible but I can keep it broken until the issue is triaged, if needed. The project is f6d25e17-9bb2-457f-8ce2-e55b5ce1dcd8 The service is 8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b The deployment is 81cc519b-a7a3-41db-8266-53e08952e935 EDIT: the deploy was triggered yesterday at 2:16 PM CET (GMT+2) and it was working properly at least until today at 1:15 AM CET (GMT+2)

28 Replies

Percy•6mo ago

Project ID: f6d25e17-9bb2-457f-8ce2-e55b5ce1dcd8,8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b,81cc519b-a7a3-41db-8266-53e08952e935

Brody•6mo ago

So it sounds like it's still going to sleep? does that sound right to you? the immediate solution would be to deploy your bot into another service and leave the bugged service alone for now side note, I'd be curious to see how a telegram bot that uses webhooks instead of polling would work with app sleeping

robOP•6mo ago

So it sounds like it's still going to sleep?

It feels like it but I'm having a hard time figuring it out what could be causing it (also given there were little to no changes recently to bot's logic)

the immediate solution would be to deploy your bot into another service and leave the bugged service alone for now

Thanks, I didn't thought about that! I've now deployed the service on my test environment while leaving the production alone. As I suspected I've got no errors from Telegram (which should complain when 2 instances of the same bot are running concurrently), it really feels dead. Thanks as always @Brody, should I ping again in a few days to see if the issue can be looked thoroughly?

Brody•6mo ago

if you dont hear back from me by Tuesday please ping, as i plan on bringing this up to the team, and in that case it would be helpful to leave the suspected bugged service untouched if possible hey rob, the applicable person would be off until tmr

robOP•6mo ago

@Brody I've got a bad news (totally on me): I left automatic deploys enabled and a pull request has been automerged in the night so the faulty deployment is now gone. For the time being you can ignore this issue, I will ping you if it happens again

Brody•6mo ago

okay sounds good!

robOP•5mo ago

Hi @Brody, sorry to bother you once more but it finally happened again. The deployment stuck is e90da47 (service 8c4e4e87-7cf3-4ab2-9ec2-b6cc41db7b5b). I can't unfortunately disable the connection with my main branch because that would require a redeploy but I will do my best to stop merging code until the issue is looked at (the only issue would be an automerge by renovate during nighttime)

Brody•5mo ago

so the bot is unresponsive?

robOP•5mo ago

It wasn't until I deployed it on my test environment (with prod's token). It has been stuck for about 3 and half hours now

Brody•5mo ago

what makes you think your deployment is being put to sleep instead of soft locking or something similar?

robOP•5mo ago

Speaking for this deployment only as I can't really recall the oldest ones, it was a freshly deployed instance (5 hours old), it wasn't consuming that many resources and lately I limited the concurrency to max 10 requests at a time The bot itself never suffered issues causing it to stop working without any sign, I would expect at least a stacktrace but I had none You can see in the image the point in time when it stopped working, while processing 10 JPEG -> PNG conversions

Brody•5mo ago

you suspect it got slept at 3:12 pm?

robOP•5mo ago

Somewhere after that, I can only say that's the latest log I had proving the bot was online

Brody•5mo ago

and is it currently "sleeping" or have you since done a redeploy

robOP•5mo ago

It is still sleeping to this moment, I left prod deployment there and enabled the test one with the same token and it doesn't result in Telegram's error saying that only one instance of a bot can run concurrently Also, I never saw its memory decrease in such a "slow" curve to this time as it did after those 3:12 PM (CET).

Brody•5mo ago

so heres the thing, if it was sleeping the memory reporting would have frozen yet memory reporting continued and did differ now i know i said i planned on bringing this up to the team, but im going to hold off on that for now since this is not looking like a platform issue

robOP•5mo ago

That would be fine and I get your point of view, I need to find a way to figure it out because as of now I wouldn't know how to triage it

Brody•5mo ago

is that service on the v2 runtime at least?

robOP•5mo ago

New builder and runtime v2

Brody•5mo ago

then i think the next course of action would be to add some very verbose logging so you can try to determine when and where your app is softlocking, and then hopefully why

robOP•5mo ago

I will see what I can come up with, thanks as always 😄 May I delete the stuck instance? I have no rush on that regard

Brody•5mo ago

yep! and if you still think this is railway sleeping your service, catch and log sigterm as thats the signal sent when your container is stopped by railway for any reason

robOP•5mo ago

How confident do you feel saying that the memory of a sleeping service will be frozen until awoken?

Brody•5mo ago

unless they have since fixed that (doubt it) then that would still be the case

Brody•5mo ago

yep, metrics are not seeded with zeros during sleep so they would appear as frozen if there was metrics to begin with, but this service is slept for a very long time so there is not enough awake metrics to show either

robOP•5mo ago

How long would it take for those services to go to sleep? I don't think that's my case because I guess I would notice it in the UI

Brody•5mo ago

10 - 15 minutes maybe but yeah i think its very safe to say that your deployment was not slept

robOP•5mo ago

Perfect, it doesn't match my deployment alive period

Gaming

Programming

Unresponsive deployment after some hours