Railway•2y ago

Diagnosing sporadic timeout

What's the best way to diagnose sporadic worker timeout? Probably 1 out of every 20 requests times out - and there's no clear patterns on the cause. Re-running the same request subsequently usually works fine - so it doesn't look like an edge case breaking issue.

114 Replies

Percy•2y ago

Project ID: N/A

DMOP•2y ago

fe275cf6-71f5-416b-935f-a5acd5e181fb At a glance, this maps really clearly to traffic volume. More users = more incidents of time out

Brody•2y ago

gunicorn?

DMOP•2y ago

Yessir

Brody•2y ago

this article is a great resource for the issue i think you are running into https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7 you will also want to upgrade to the dev plan to follow along with that artical but either way, take your time and read through that whole article before you make any changes

DMOP•2y ago

Appreciate it. @Brody digested the article and upgraded to the dev plan. Mind if I run my proposed solution past you?

Brody•2y ago

absolutely, hit me

DMOP•2y ago

- My app is I/O bounded (e.g. majority of the bottlenecks are waiting for OpenAI + Azure API requests to come back) - I often have a high number of concurrent users - Therefore I should increase the number of workers - Since the formula for workers is (2*CPU)+1 and I now have 8 CPU, the appropriate number of workers is 15 - I can do that by updating my procfile to web: gunicorn --timeout=600 --workers=15 app:app

Brody•2y ago

i was thinking you could use the gevent worker class maybe web: gunicorn --worker-class=gevent --worker-connections=100 --workers=15 app:app

DMOP•2y ago

Feeling a bit dumb - but would the fact that I have multiple parallel requests which each take lots of time - imply that I need parallelism rather than concurrency?

Brody•2y ago

workers are for parallelism

DMOP•2y ago

Laying it out: - Each request is strictly linear. And most of it is waiting for APIs to reply - At any point, I'll have a bunch of requests running (since each can take up to 15 seconds) - Therefore I want parallelism rather than concurrency

Brody•2y ago

correct so try my proposed start command

DMOP•2y ago

cool

Brody•2y ago

(you will need to add gevent to your requirements.txt file) also you talk about having a high number of users so its starting to sound like you might need to be on the teams plan, not for extra memory or cpu, but because it seems like you are hosting a commercial product on railway

DMOP•2y ago

ah - i was not aware the dev plan didn't work for commercial products

Brody•2y ago

dev plan is for hobby use cases

DMOP•2y ago

Happy to upgrade. It's arguably somewhere between hobby and a real product, but it does have about $300 worth of paid users.

Brody•2y ago

then that would for sure require you to be on the teams plan https://docs.railway.app/reference/plans#developer-plan-offering

The Developer Plan is meant for serious hobbyist use-cases. We do urge companies to upgrade to the Teams plan for a commercial license to use Railway.

you might not be operating a company but you are collecting income generated from an app thats hosted on railway heres the ceo of railway telling someone the same thing https://discord.com/channels/713503345364697088/1100082877980479528/1100182836079771669

DMOP•2y ago

Is that a strict requirement? Actual project revenue is so low that this makes a meaningful difference in terms of hosting on railway vs an alternative.

Brody•2y ago

its not super strict, but for what you have said, i think jake and the team might prefer you to be on the teams plan but im just the messenger, i cant enforce it

DMOP•2y ago

It's just me - part time, generating < 300/mo and spending most of it on OpenAI bills 😬

Brody•2y ago

like i said, im just the messenger, i will let the team make the final decision would you like me to tag in a team member to give you the verdict?

Adam•2y ago

Generally the policy is as long as you're making enough to pay for the teams plan, you should be upgrading. 300$ a month with the majority going to OpenAI? There's some room in there for Railway. The main perk of upgrading is direct support from the Team, which is pretty helpful for larger scale dev and growing your business And of course higher limits on your services

angelo•2y ago

Hey @DM - the upgrade path is for a support expectation. The timeouts that you are describing are likely not with Railway. The issue we have here is, a user on the Dev plan who is working with customers yell at us for not meeting support SLOs when they are on the wrong tier. This way, we can't tell who needs what. Upgrading lets us know that your application needs a tad more care. If you are cool with Brody's jokes for your app and understand that the team can't reply to you asap, you are fine on the Dev plan. Else, we urge you to upgrade. Railway isn't the cheapest hosting platform out there, but we hope the time we save you helps shift the calculous in our favor.

Brody•2y ago

my jokes???

angelo•2y ago

You are a funny guy, but $20 per user per month for Angelo jokes is an upgrade.

Brody•2y ago

oh yeah I see what you did now

DMOP•2y ago

Upgraded because paying more to you guys beats giving money to Salesforce. And because if it wasn't for Brody, I never would have known what concurrency is. On a related note - I'm still getting worker timeouts even after implementing the above fix to procfile

Brody•2y ago

are you getting these timeouts while you have users useing the service?

DMOP•2y ago

Will see if the upgrade to team plan fixes it. What do you mean?

DMOP•2y ago

DMOP•2y ago

Yeah, it's all via active usage

Brody•2y ago

is this your current start command?

DMOP•2y ago

That's correct. ohmygod I installed gevent but didn't add it to requirements.txt Sorry. Long day. Data analyst here pretending I'm a dev

Brody•2y ago

you added gevent into the start command but didnt add it to your requirements.txt file?

DMOP•2y ago

(yup ... fixing now) Redeployed. Wish me luck.

Brody•2y ago

i wish you luck!

DMOP•2y ago

@Brody 1 hour and 100 queries later no errors (woop woop)

Brody•2y ago

im happy to hear that!!! you have access to 32 vCPU's now, so you could do 64 workers!!

DMOP•2y ago

Dumb question - but what does it mean when I see the same line (in logs) being repeated twice? Haven't seen it happen before and wonder what the implication is.

Brody•2y ago

workers are completely separate instances of your app, so it would just be two instances of your app that happened to print the same thing, shouldnt be cause for concern

DMOP•2y ago

@Brody any advice on what I could be doing wrong? Still seeing a ton of worker timeouts 😦

DMOP•2y ago

DMOP•2y ago

I also realized my project might still be under my personal account, rather than the team account.

Brody•2y ago

hey please dont ping the team (angelo)

DMOP•2y ago

Ack. Apologies for breaking protocol

Brody•2y ago

have you redeployed your project since upgrading to the teams plan?

DMOP•2y ago

I have. Deploy looked clean? I can also increase the number of workers - but surprised this is not enough. My traffic isn't that heavy.

Brody•2y ago

then your service should be running with the new increased resource limits, what makes you think it isnt?

DMOP•2y ago

I'm not sure. It's listed under me vs the project.

DMOP•2y ago

Brody•2y ago

drag it into that area

DMOP•2y ago

Mostly I'm just freaking out because the app is breaking for customers. Doesn't seem to work

Brody•2y ago

wdym doesnt work?

DMOP•2y ago

dragging

Brody•2y ago

you need to drag from the drag point, the little dots in the top right of the project card

DMOP•2y ago

❤️ My CPU metrics are low - so I have a feeling it's still something with my config (rather than the resources)

DMOP•2y ago

DMOP•2y ago

What's the best way to diagnose deploy? Wonder if I messed up the config / environment somehow

Brody•2y ago

okay so say if you only had one user, what would the response time for the requests look like? 30 seconds? 90 seconds? what time range are we looking at give me best and worst case scenario for when only one single user is using the service

DMOP•2y ago

Sorry for the lag. There's 3 formats of requests: - 1x OpenAI api call (3-10 seconds) - 4x OpenAI call (10-25 seconds) - 1x OpenAI + 1x Azure OCR call (10-20 seconds) The worst case is one of the API requests times out. My traffic is way under the rate limit, but sometimes they get overloaded.

Brody•2y ago

so from the data you just gave me, the longest response time that you could expect from your app would be 25 seconds?

DMOP•2y ago

Assuming neither API I depend on fails. Generally speaking - when behaving normally they all work in ~20 seconds max. A number of the timeouts are coming from the simpler (3-10 second) requests Anecdotally - external APIs were only responsible for 20%ish of timeouts when I was on Heroku

Brody•2y ago

well gunicorn's request timeout is 30 seconds, so if your app doesn't respond in 30 seconds, you see that timeout message, so something you are doing is sporadically taking longer than 30 seconds I'd just like to be upfront in saying this, but this would not be railway's fault railway has no timeout, these timeouts would be inefficiencys in your code

DMOP•2y ago

I 100% don't think it's yalls fault. Transparently taking advantage of your help / expertise to help me diagnose what the issue is. Do you have any pointers on how I would tackle diagnosing the root cause?

Brody•2y ago

add time logging to critical points throughout your request hander functions log the time it takes for your app to call open-ai, azure, etc, and similarly log the entire time it took to complete all processing

DMOP•2y ago

👀

Brody•2y ago

if you don't have telemetry, you will have no hope of narrowing down the code that's causing timeouts

DMOP•2y ago

The part that confuses me is that I can easily replicate the same request that trigger issues - but without error. E.g. if I find a request that broke and re-run it, it usually works.

Brody•2y ago

I know, makes it really hard to debug, but what makes it super hard to debug is not knowing what caused the timeout when a timeout does occur

DMOP•2y ago

Got it. So you would suggest logging the times for each critical step? Any suggestions on the package / tool?

Brody•2y ago

yes! add individual time logging for every external request or heavy processing your app does if applicable I'm not a python developer, so unfortunately I don't have any good recommendations, but if all you need is a log of time since point A to point B in your code, I think the standard library can do that

DMOP•2y ago

Cool. As a last-ditch effort before going the tracking route - would it be worthwhile to try and up the number of parallel workers?

Brody•2y ago

startTime = time.Now() // external api call endTime = time.Now() print("time for x api call: ", endTime.Since(startTime)) that's some pseudo code not real code

DMOP•2y ago

yeah - i'll sort it out

Brody•2y ago

no it wouldn't, you now have 32 cores and are running 64 workers, this is now just code issues

DMOP•2y ago

got it. last dumb question What's the best way to validate / x-check that the procfile is correct (and save me from more 'forgot to add to requirements.txt' type errors)

Brody•2y ago

um yeah pythons dependency stuff is quite bad, you just have to make sure to add whatever you use in your project to the txt file that's pretty much the extent of keeping python packages in check with the txt file unfortunately there is pyproject, look into that after you get these issues sorted out

DMOP•2y ago

@Brody I'm gonna sound like a drunk conspiracy theorist here for a second, but hear me out.

Brody•2y ago

lol okay

DMOP•2y ago

Every time there's errors, it's as the memory goes up. RAM*

DMOP•2y ago

DMOP•2y ago

The timing of timeouts maps exactly to the moments memory is increasing. ...but each time I redeploy it resets. Is there a way to manually assign some minimal level of dedicated RAM?

Brody•2y ago

does memory increase during successful requests?

DMOP•2y ago

not that i can tell those 2 bumps line up exactly with the instances of failed requests

Brody•2y ago

this would only make things worse, it will not solve your problems, please don't go down this path, you need to find and fix the root cause

DMOP•2y ago

Do you have any sense on the 'why'?

Brody•2y ago

on why your memory usage increases during timed out requests?

DMOP•2y ago

Yup. And why assigning dedicated RAM wouldn't make things any better?

Brody•2y ago

why memory increase? honestly no clue, have never seen your code, there's hundreds of possible reasons why capping memory not good? because it doesn't actually fix the root issues, so if it doesn't fix the root issue you shouldn't do it. it could lead to full service crashes, making things even worse

DMOP•2y ago

Got it.

Adam•2y ago

Were you on the starter plan before you upgraded?

DMOP•2y ago

I went starter -> dev -> team yesterday afternoon

Adam•2y ago

If you haven't redeployed this project since you upgraded it will still be limited by starter limits Try pushing a new commit

Brody•2y ago

they have

DMOP•2y ago

The timeouts also line up to CPU increasing above 0.

DMOP•2y ago

Adam•2y ago

thing is 200mb is the starter plan limit, you sure you have?

Brody•2y ago

im confident its running on the teams plans

Adam•2y ago

fair enough

DMOP•2y ago

Basically anytime there's a change in the above charts, it's also the same time the timeouts occur.

Adam•2y ago

Unfortunately this is very likely a code issue

DMOP•2y ago

😦

Adam•2y ago

How's your logging?

Brody•2y ago

adam, its 512???

Adam•2y ago

starter is 200, trial is 512

Brody•2y ago

oh, my bad

Adam•2y ago

no prob

Brody•2y ago

i have never used starter or trial plans myself, so im a bit lacking in knowledge there

DMOP•2y ago

I don't track component times yet. But anecdotally - re-running failed queries usually works. I also know via API logs that those are a rare (but non-zero) source of failures

Brody•2y ago

we talked about doing that earlier, you should get on that, you will never be able to efficiently debug this without telemetry / logging

DMOP•2y ago

Yup.

Brody•2y ago

alright then you know what your next step is

Gaming

Programming

Diagnosing sporadic timeout