Diagnosing sporadic timeout
What's the best way to diagnose sporadic worker timeout? Probably 1 out of every 20 requests times out - and there's no clear patterns on the cause. Re-running the same request subsequently usually works fine - so it doesn't look like an edge case breaking issue.
114 Replies
Project ID:
N/A
fe275cf6-71f5-416b-935f-a5acd5e181fb
At a glance, this maps really clearly to traffic volume. More users = more incidents of time out
gunicorn?
Yessir
this article is a great resource for the issue i think you are running into
https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
you will also want to upgrade to the dev plan to follow along with that artical
but either way, take your time and read through that whole article before you make any changes
Appreciate it.
@Brody digested the article and upgraded to the dev plan. Mind if I run my proposed solution past you?
absolutely, hit me
- My app is I/O bounded (e.g. majority of the bottlenecks are waiting for OpenAI + Azure API requests to come back)
- I often have a high number of concurrent users
- Therefore I should increase the number of workers
- Since the formula for workers is (2*CPU)+1 and I now have 8 CPU, the appropriate number of workers is 15
- I can do that by updating my procfile to
web: gunicorn --timeout=600 --workers=15 app:app
i was thinking you could use the gevent worker class
maybe
web: gunicorn --worker-class=gevent --worker-connections=100 --workers=15 app:app
Feeling a bit dumb - but would the fact that I have multiple parallel requests which each take lots of time - imply that I need parallelism rather than concurrency?
workers are for parallelism
Laying it out:
- Each request is strictly linear. And most of it is waiting for APIs to reply
- At any point, I'll have a bunch of requests running (since each can take up to 15 seconds)
- Therefore I want parallelism rather than concurrency
correct
so try my proposed start command
cool
(you will need to add gevent to your requirements.txt file)
also
you talk about having a high number of users
so its starting to sound like you might need to be on the teams plan, not for extra memory or cpu, but because it seems like you are hosting a commercial product on railway
ah - i was not aware the dev plan didn't work for commercial products
dev plan is for hobby use cases
Happy to upgrade. It's arguably somewhere between hobby and a real product, but it does have about $300 worth of paid users.
then that would for sure require you to be on the teams plan
https://docs.railway.app/reference/plans#developer-plan-offering
The Developer Plan is meant for serious hobbyist use-cases. We do urge companies to upgrade to the Teams plan for a commercial license to use Railway.you might not be operating a company but you are collecting income generated from an app thats hosted on railway heres the ceo of railway telling someone the same thing https://discord.com/channels/713503345364697088/1100082877980479528/1100182836079771669
Is that a strict requirement?
Actual project revenue is so low that this makes a meaningful difference in terms of hosting on railway vs an alternative.
its not super strict, but for what you have said, i think jake and the team might prefer you to be on the teams plan
but im just the messenger, i cant enforce it
It's just me - part time, generating < 300/mo and spending most of it on OpenAI bills 😬
like i said, im just the messenger, i will let the team make the final decision
would you like me to tag in a team member to give you the verdict?
Generally the policy is as long as you're making enough to pay for the teams plan, you should be upgrading. 300$ a month with the majority going to OpenAI? There's some room in there for Railway.
The main perk of upgrading is direct support from the Team, which is pretty helpful for larger scale dev and growing your business
And of course higher limits on your services
Hey @DM - the upgrade path is for a support expectation. The timeouts that you are describing are likely not with Railway. The issue we have here is, a user on the Dev plan who is working with customers yell at us for not meeting support SLOs when they are on the wrong tier. This way, we can't tell who needs what.
Upgrading lets us know that your application needs a tad more care.
If you are cool with Brody's jokes for your app and understand that the team can't reply to you asap, you are fine on the Dev plan. Else, we urge you to upgrade. Railway isn't the cheapest hosting platform out there, but we hope the time we save you helps shift the calculous in our favor.
my jokes???
You are a funny guy, but $20 per user per month for Angelo jokes is an upgrade.
oh yeah I see what you did now
Upgraded because paying more to you guys beats giving money to Salesforce. And because if it wasn't for Brody, I never would have known what concurrency is.
On a related note - I'm still getting worker timeouts even after implementing the above fix to procfile
are you getting these timeouts while you have users useing the service?
Will see if the upgrade to team plan fixes it.
What do you mean?
Yeah, it's all via active usage
is this your current start command?
That's correct.
ohmygod
I installed gevent but didn't add it to requirements.txt
Sorry. Long day. Data analyst here pretending I'm a dev
you added gevent into the start command but didnt add it to your requirements.txt file?
(yup ... fixing now)
Redeployed. Wish me luck.
i wish you luck!
@Brody 1 hour and 100 queries later no errors (woop woop)
im happy to hear that!!!
you have access to 32 vCPU's now, so you could do 64 workers!!
Dumb question - but what does it mean when I see the same line (in logs) being repeated twice? Haven't seen it happen before and wonder what the implication is.
workers are completely separate instances of your app, so it would just be two instances of your app that happened to print the same thing, shouldnt be cause for concern
@Brody any advice on what I could be doing wrong? Still seeing a ton of worker timeouts 😦
I also realized my project might still be under my personal account, rather than the team account.
hey please dont ping the team (angelo)
Ack.
Apologies for breaking protocol
have you redeployed your project since upgrading to the teams plan?
I have.
Deploy looked clean? I can also increase the number of workers - but surprised this is not enough. My traffic isn't that heavy.
then your service should be running with the new increased resource limits, what makes you think it isnt?
I'm not sure. It's listed under me vs the project.
drag it into that area
Mostly I'm just freaking out because the app is breaking for customers.
Doesn't seem to work
wdym doesnt work?
dragging
you need to drag from the drag point, the little dots in the top right of the project card
❤️
My CPU metrics are low - so I have a feeling it's still something with my config (rather than the resources)
What's the best way to diagnose deploy?
Wonder if I messed up the config / environment somehow
okay so say if you only had one user, what would the response time for the requests look like?
30 seconds? 90 seconds?
what time range are we looking at
give me best and worst case scenario for when only one single user is using the service
Sorry for the lag.
There's 3 formats of requests:
- 1x OpenAI api call (3-10 seconds)
- 4x OpenAI call (10-25 seconds)
- 1x OpenAI + 1x Azure OCR call (10-20 seconds)
The worst case is one of the API requests times out. My traffic is way under the rate limit, but sometimes they get overloaded.
so from the data you just gave me, the longest response time that you could expect from your app would be 25 seconds?
Assuming neither API I depend on fails.
Generally speaking - when behaving normally they all work in ~20 seconds max.
A number of the timeouts are coming from the simpler (3-10 second) requests
Anecdotally - external APIs were only responsible for 20%ish of timeouts when I was on Heroku
well gunicorn's request timeout is 30 seconds, so if your app doesn't respond in 30 seconds, you see that timeout message, so something you are doing is sporadically taking longer than 30 seconds
I'd just like to be upfront in saying this, but this would not be railway's fault
railway has no timeout, these timeouts would be inefficiencys in your code
I 100% don't think it's yalls fault.
Transparently taking advantage of your help / expertise to help me diagnose what the issue is. Do you have any pointers on how I would tackle diagnosing the root cause?
add time logging to critical points throughout your request hander functions
log the time it takes for your app to call open-ai, azure, etc, and similarly log the entire time it took to complete all processing
👀
if you don't have telemetry, you will have no hope of narrowing down the code that's causing timeouts
The part that confuses me is that I can easily replicate the same request that trigger issues - but without error.
E.g. if I find a request that broke and re-run it, it usually works.
I know, makes it really hard to debug, but what makes it super hard to debug is not knowing what caused the timeout when a timeout does occur
Got it.
So you would suggest logging the times for each critical step?
Any suggestions on the package / tool?
yes! add individual time logging for every external request or heavy processing your app does if applicable
I'm not a python developer, so unfortunately I don't have any good recommendations, but if all you need is a log of time since point A to point B in your code, I think the standard library can do that
Cool.
As a last-ditch effort before going the tracking route - would it be worthwhile to try and up the number of parallel workers?
startTime = time.Now()
// external api call
endTime = time.Now()
print("time for x api call: ", endTime.Since(startTime))
that's some pseudo code
not real code
yeah - i'll sort it out
no it wouldn't, you now have 32 cores and are running 64 workers, this is now just code issues
got it. last dumb question
What's the best way to validate / x-check that the procfile is correct (and save me from more 'forgot to add to requirements.txt' type errors)
um yeah pythons dependency stuff is quite bad, you just have to make sure to add whatever you use in your project to the txt file
that's pretty much the extent of keeping python packages in check with the txt file unfortunately
there is pyproject, look into that after you get these issues sorted out
@Brody I'm gonna sound like a drunk conspiracy theorist here for a second, but hear me out.
lol okay
Every time there's errors, it's as the memory goes up.
RAM*
The timing of timeouts maps exactly to the moments memory is increasing.
...but each time I redeploy it resets.
Is there a way to manually assign some minimal level of dedicated RAM?
does memory increase during successful requests?
not that i can tell
those 2 bumps line up exactly with the instances of failed requests
this would only make things worse, it will not solve your problems, please don't go down this path, you need to find and fix the root cause
Do you have any sense on the 'why'?
on why your memory usage increases during timed out requests?
Yup. And why assigning dedicated RAM wouldn't make things any better?
why memory increase? honestly no clue, have never seen your code, there's hundreds of possible reasons
why capping memory not good? because it doesn't actually fix the root issues, so if it doesn't fix the root issue you shouldn't do it. it could lead to full service crashes, making things even worse
Got it.
Were you on the starter plan before you upgraded?
I went starter -> dev -> team yesterday afternoon
If you haven't redeployed this project since you upgraded it will still be limited by starter limits
Try pushing a new commit
they have
The timeouts also line up to CPU increasing above 0.
thing is 200mb is the starter plan limit, you sure you have?
im confident its running on the teams plans
fair enough
Basically anytime there's a change in the above charts, it's also the same time the timeouts occur.
Unfortunately this is very likely a code issue
😦
How's your logging?
adam, its 512???
starter is 200, trial is 512
oh, my bad
no prob
i have never used starter or trial plans myself, so im a bit lacking in knowledge there
I don't track component times yet. But anecdotally - re-running failed queries usually works.
I also know via API logs that those are a rare (but non-zero) source of failures
we talked about doing that earlier, you should get on that, you will never be able to efficiently debug this without telemetry / logging
Yup.
alright then you know what your next step is