R
Railway2y ago
DM

Diagnosing sporadic timeout

What's the best way to diagnose sporadic worker timeout? Probably 1 out of every 20 requests times out - and there's no clear patterns on the cause. Re-running the same request subsequently usually works fine - so it doesn't look like an edge case breaking issue.
114 Replies
Percy
Percy2y ago
Project ID: N/A
DM
DMOP2y ago
fe275cf6-71f5-416b-935f-a5acd5e181fb At a glance, this maps really clearly to traffic volume. More users = more incidents of time out
Brody
Brody2y ago
gunicorn?
DM
DMOP2y ago
Yessir
Brody
Brody2y ago
this article is a great resource for the issue i think you are running into https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7 you will also want to upgrade to the dev plan to follow along with that artical but either way, take your time and read through that whole article before you make any changes
DM
DMOP2y ago
Appreciate it. @Brody digested the article and upgraded to the dev plan. Mind if I run my proposed solution past you?
Brody
Brody2y ago
absolutely, hit me
DM
DMOP2y ago
- My app is I/O bounded (e.g. majority of the bottlenecks are waiting for OpenAI + Azure API requests to come back) - I often have a high number of concurrent users - Therefore I should increase the number of workers - Since the formula for workers is (2*CPU)+1 and I now have 8 CPU, the appropriate number of workers is 15 - I can do that by updating my procfile to web: gunicorn --timeout=600 --workers=15 app:app
Brody
Brody2y ago
i was thinking you could use the gevent worker class maybe web: gunicorn --worker-class=gevent --worker-connections=100 --workers=15 app:app
DM
DMOP2y ago
Feeling a bit dumb - but would the fact that I have multiple parallel requests which each take lots of time - imply that I need parallelism rather than concurrency?
Brody
Brody2y ago
workers are for parallelism
DM
DMOP2y ago
Laying it out: - Each request is strictly linear. And most of it is waiting for APIs to reply - At any point, I'll have a bunch of requests running (since each can take up to 15 seconds) - Therefore I want parallelism rather than concurrency
Brody
Brody2y ago
correct so try my proposed start command
DM
DMOP2y ago
cool
Brody
Brody2y ago
(you will need to add gevent to your requirements.txt file) also you talk about having a high number of users so its starting to sound like you might need to be on the teams plan, not for extra memory or cpu, but because it seems like you are hosting a commercial product on railway
DM
DMOP2y ago
ah - i was not aware the dev plan didn't work for commercial products
Brody
Brody2y ago
dev plan is for hobby use cases
DM
DMOP2y ago
Happy to upgrade. It's arguably somewhere between hobby and a real product, but it does have about $300 worth of paid users.
Brody
Brody2y ago
then that would for sure require you to be on the teams plan https://docs.railway.app/reference/plans#developer-plan-offering
The Developer Plan is meant for serious hobbyist use-cases. We do urge companies to upgrade to the Teams plan for a commercial license to use Railway.
you might not be operating a company but you are collecting income generated from an app thats hosted on railway heres the ceo of railway telling someone the same thing https://discord.com/channels/713503345364697088/1100082877980479528/1100182836079771669
DM
DMOP2y ago
Is that a strict requirement? Actual project revenue is so low that this makes a meaningful difference in terms of hosting on railway vs an alternative.
Brody
Brody2y ago
its not super strict, but for what you have said, i think jake and the team might prefer you to be on the teams plan but im just the messenger, i cant enforce it
DM
DMOP2y ago
It's just me - part time, generating < 300/mo and spending most of it on OpenAI bills 😬
Brody
Brody2y ago
like i said, im just the messenger, i will let the team make the final decision would you like me to tag in a team member to give you the verdict?
Adam
Adam2y ago
Generally the policy is as long as you're making enough to pay for the teams plan, you should be upgrading. 300$ a month with the majority going to OpenAI? There's some room in there for Railway. The main perk of upgrading is direct support from the Team, which is pretty helpful for larger scale dev and growing your business And of course higher limits on your services
angelo
angelo2y ago
Hey @DM - the upgrade path is for a support expectation. The timeouts that you are describing are likely not with Railway. The issue we have here is, a user on the Dev plan who is working with customers yell at us for not meeting support SLOs when they are on the wrong tier. This way, we can't tell who needs what. Upgrading lets us know that your application needs a tad more care. If you are cool with Brody's jokes for your app and understand that the team can't reply to you asap, you are fine on the Dev plan. Else, we urge you to upgrade. Railway isn't the cheapest hosting platform out there, but we hope the time we save you helps shift the calculous in our favor.
Brody
Brody2y ago
my jokes???
angelo
angelo2y ago
You are a funny guy, but $20 per user per month for Angelo jokes is an upgrade.
Brody
Brody2y ago
oh yeah I see what you did now
DM
DMOP2y ago
Upgraded because paying more to you guys beats giving money to Salesforce. And because if it wasn't for Brody, I never would have known what concurrency is. On a related note - I'm still getting worker timeouts even after implementing the above fix to procfile
Brody
Brody2y ago
are you getting these timeouts while you have users useing the service?
DM
DMOP2y ago
Will see if the upgrade to team plan fixes it. What do you mean?
DM
DMOP2y ago
DM
DMOP2y ago
Yeah, it's all via active usage
Brody
Brody2y ago
is this your current start command?
DM
DMOP2y ago
That's correct. ohmygod I installed gevent but didn't add it to requirements.txt Sorry. Long day. Data analyst here pretending I'm a dev
Brody
Brody2y ago
you added gevent into the start command but didnt add it to your requirements.txt file?
DM
DMOP2y ago
(yup ... fixing now) Redeployed. Wish me luck.
Brody
Brody2y ago
i wish you luck!
DM
DMOP2y ago
@Brody 1 hour and 100 queries later no errors (woop woop)
Brody
Brody2y ago
im happy to hear that!!! you have access to 32 vCPU's now, so you could do 64 workers!!
DM
DMOP2y ago
Dumb question - but what does it mean when I see the same line (in logs) being repeated twice? Haven't seen it happen before and wonder what the implication is.
Brody
Brody2y ago
workers are completely separate instances of your app, so it would just be two instances of your app that happened to print the same thing, shouldnt be cause for concern
DM
DMOP2y ago
@Brody any advice on what I could be doing wrong? Still seeing a ton of worker timeouts 😦
DM
DMOP2y ago
DM
DMOP2y ago
I also realized my project might still be under my personal account, rather than the team account.
Brody
Brody2y ago
hey please dont ping the team (angelo)
DM
DMOP2y ago
Ack. Apologies for breaking protocol
Brody
Brody2y ago
have you redeployed your project since upgrading to the teams plan?
DM
DMOP2y ago
I have. Deploy looked clean? I can also increase the number of workers - but surprised this is not enough. My traffic isn't that heavy.
Brody
Brody2y ago
then your service should be running with the new increased resource limits, what makes you think it isnt?
DM
DMOP2y ago
I'm not sure. It's listed under me vs the project.
DM
DMOP2y ago
Brody
Brody2y ago
drag it into that area
DM
DMOP2y ago
Mostly I'm just freaking out because the app is breaking for customers. Doesn't seem to work
Brody
Brody2y ago
wdym doesnt work?
DM
DMOP2y ago
dragging
Brody
Brody2y ago
you need to drag from the drag point, the little dots in the top right of the project card
DM
DMOP2y ago
❤️ My CPU metrics are low - so I have a feeling it's still something with my config (rather than the resources)
DM
DMOP2y ago
DM
DMOP2y ago
What's the best way to diagnose deploy? Wonder if I messed up the config / environment somehow
Brody
Brody2y ago
okay so say if you only had one user, what would the response time for the requests look like? 30 seconds? 90 seconds? what time range are we looking at give me best and worst case scenario for when only one single user is using the service
DM
DMOP2y ago
Sorry for the lag. There's 3 formats of requests: - 1x OpenAI api call (3-10 seconds) - 4x OpenAI call (10-25 seconds) - 1x OpenAI + 1x Azure OCR call (10-20 seconds) The worst case is one of the API requests times out. My traffic is way under the rate limit, but sometimes they get overloaded.
Brody
Brody2y ago
so from the data you just gave me, the longest response time that you could expect from your app would be 25 seconds?
DM
DMOP2y ago
Assuming neither API I depend on fails. Generally speaking - when behaving normally they all work in ~20 seconds max. A number of the timeouts are coming from the simpler (3-10 second) requests Anecdotally - external APIs were only responsible for 20%ish of timeouts when I was on Heroku
Brody
Brody2y ago
well gunicorn's request timeout is 30 seconds, so if your app doesn't respond in 30 seconds, you see that timeout message, so something you are doing is sporadically taking longer than 30 seconds I'd just like to be upfront in saying this, but this would not be railway's fault railway has no timeout, these timeouts would be inefficiencys in your code
DM
DMOP2y ago
I 100% don't think it's yalls fault. Transparently taking advantage of your help / expertise to help me diagnose what the issue is. Do you have any pointers on how I would tackle diagnosing the root cause?
Brody
Brody2y ago
add time logging to critical points throughout your request hander functions log the time it takes for your app to call open-ai, azure, etc, and similarly log the entire time it took to complete all processing
DM
DMOP2y ago
👀
Brody
Brody2y ago
if you don't have telemetry, you will have no hope of narrowing down the code that's causing timeouts
DM
DMOP2y ago
The part that confuses me is that I can easily replicate the same request that trigger issues - but without error. E.g. if I find a request that broke and re-run it, it usually works.
Brody
Brody2y ago
I know, makes it really hard to debug, but what makes it super hard to debug is not knowing what caused the timeout when a timeout does occur
DM
DMOP2y ago
Got it. So you would suggest logging the times for each critical step? Any suggestions on the package / tool?
Brody
Brody2y ago
yes! add individual time logging for every external request or heavy processing your app does if applicable I'm not a python developer, so unfortunately I don't have any good recommendations, but if all you need is a log of time since point A to point B in your code, I think the standard library can do that
DM
DMOP2y ago
Cool. As a last-ditch effort before going the tracking route - would it be worthwhile to try and up the number of parallel workers?
Brody
Brody2y ago
startTime = time.Now() // external api call endTime = time.Now() print("time for x api call: ", endTime.Since(startTime)) that's some pseudo code not real code
DM
DMOP2y ago
yeah - i'll sort it out
Brody
Brody2y ago
no it wouldn't, you now have 32 cores and are running 64 workers, this is now just code issues
DM
DMOP2y ago
got it. last dumb question What's the best way to validate / x-check that the procfile is correct (and save me from more 'forgot to add to requirements.txt' type errors)
Brody
Brody2y ago
um yeah pythons dependency stuff is quite bad, you just have to make sure to add whatever you use in your project to the txt file that's pretty much the extent of keeping python packages in check with the txt file unfortunately there is pyproject, look into that after you get these issues sorted out
DM
DMOP2y ago
@Brody I'm gonna sound like a drunk conspiracy theorist here for a second, but hear me out.
Brody
Brody2y ago
lol okay
DM
DMOP2y ago
Every time there's errors, it's as the memory goes up. RAM*
DM
DMOP2y ago
DM
DMOP2y ago
The timing of timeouts maps exactly to the moments memory is increasing. ...but each time I redeploy it resets. Is there a way to manually assign some minimal level of dedicated RAM?
Brody
Brody2y ago
does memory increase during successful requests?
DM
DMOP2y ago
not that i can tell those 2 bumps line up exactly with the instances of failed requests
Brody
Brody2y ago
this would only make things worse, it will not solve your problems, please don't go down this path, you need to find and fix the root cause
DM
DMOP2y ago
Do you have any sense on the 'why'?
Brody
Brody2y ago
on why your memory usage increases during timed out requests?
DM
DMOP2y ago
Yup. And why assigning dedicated RAM wouldn't make things any better?
Brody
Brody2y ago
why memory increase? honestly no clue, have never seen your code, there's hundreds of possible reasons why capping memory not good? because it doesn't actually fix the root issues, so if it doesn't fix the root issue you shouldn't do it. it could lead to full service crashes, making things even worse
DM
DMOP2y ago
Got it.
Adam
Adam2y ago
Were you on the starter plan before you upgraded?
DM
DMOP2y ago
I went starter -> dev -> team yesterday afternoon
Adam
Adam2y ago
If you haven't redeployed this project since you upgraded it will still be limited by starter limits Try pushing a new commit
Brody
Brody2y ago
they have
DM
DMOP2y ago
The timeouts also line up to CPU increasing above 0.
DM
DMOP2y ago
Adam
Adam2y ago
thing is 200mb is the starter plan limit, you sure you have?
Brody
Brody2y ago
im confident its running on the teams plans
Adam
Adam2y ago
fair enough
DM
DMOP2y ago
Basically anytime there's a change in the above charts, it's also the same time the timeouts occur.
Adam
Adam2y ago
Unfortunately this is very likely a code issue
DM
DMOP2y ago
😦
Adam
Adam2y ago
How's your logging?
Brody
Brody2y ago
adam, its 512???
Adam
Adam2y ago
starter is 200, trial is 512
Brody
Brody2y ago
oh, my bad
Adam
Adam2y ago
no prob
Brody
Brody2y ago
i have never used starter or trial plans myself, so im a bit lacking in knowledge there
DM
DMOP2y ago
I don't track component times yet. But anecdotally - re-running failed queries usually works. I also know via API logs that those are a rare (but non-zero) source of failures
Brody
Brody2y ago
we talked about doing that earlier, you should get on that, you will never be able to efficiently debug this without telemetry / logging
DM
DMOP2y ago
Yup.
Brody
Brody2y ago
alright then you know what your next step is
Want results from more Discord servers?
Add your server