Railway•13mo ago

Is my service down?

service id: 44d24596-53c9-4ef8-bbc2-f3da425755c5 service name: shipaid-api I'm seeing a lot of error logs with ECONNRESET, but the service is running according to my /health endpoint. We didn't change anything recently

189 Replies

Percy•13mo ago

Project ID: 44d24596-53c9-4ef8-bbc2-f3da425755c5

Brody•13mo ago

what kind of app is this?

Matheus SantosOP•13mo ago

it's a Node.js API

Brody•13mo ago

more specifically?

Matheus SantosOP•13mo ago

sorry, I didn't get it. what exactly you need to know? tech stack? business context?

Brody•13mo ago

tech stack

Matheus SantosOP•13mo ago

ok, we're using Node.js with Typescript and Adonnis.js framework our API is hosted with Railway and we use Nhost, which is kind of a ecosystem that provides us some services such as Postgres and a GraphQL server we're also using redis hosted with railway and we use Rollbar for logs and monitoring we mostly operate integrating with shopify stores, we're a shopify app so we mostly listen to shopify webhooks to fire operations in the API, we also have a Admin dashboard with react that interacts with the API via GraphQL

Matheus SantosOP•13mo ago

the application seems to be running as I also see info logs related to my business going on, but these error logs keeps popping and it seems to be preventing some of my operations to run

Brody•13mo ago

do you happen to know what service you're communicating with when you get these errors?

Matheus SantosOP•13mo ago

Brody•13mo ago

does your api accept any post requests?

Matheus SantosOP•13mo ago

yes

Brody•13mo ago

what do these post requests consist of?

Matheus SantosOP•13mo ago

they are mostly shopify webhook listeners besides some queries/mutations called by my FE Admin dashboard

Brody•13mo ago

what's the frequency of these errors?

Matheus SantosOP•13mo ago

it seems to be happening all the time since one hour ago

Brody•13mo ago

okay I'll ask the team if they have any idea what could cause this

Matheus SantosOP•13mo ago

ok, thanks

Matheus SantosOP•13mo ago

I found these logs related to rollbar... but I don't see how they are preventing my jobs to run

Brody•13mo ago

looks like you're just having trouble communicating to 3rd party api services, very odd

Matheus SantosOP•13mo ago

I've just tested on my local env sending a notification to rollbar and it worked

Matheus SantosOP•13mo ago

but it keeps happening on production, is it possible that railway is blocking the communication?

Matheus SantosOP•13mo ago

Is there anything else I can check here? the most critical operation of my system is not running and I think it's related to it. We have a job that is queued in redis and it looks like it's not being fired

Matheus SantosOP•13mo ago

we're using this package to communicate with redis https://github.com/Rocketseat/adonis-bull

GitHub

GitHub - Rocketseat/adonis-bull: The easiest way to start using an ...

The easiest way to start using an asynchronous job queue with AdonisJS. Ready for Adonis v5 ⚡️ - GitHub - Rocketseat/adonis-bull: The easiest way to start using an asynchronous job queue with Adoni...

Brody•13mo ago

I honestly have no clue, I have asked the team

Matheus SantosOP•13mo ago

hey, is it possible for you to restart the instance of my service? maybe it will clean what's causing the issue or, If I push a new deploy it will restart itself?

Brody•13mo ago

you can both restart or redeploy it yourself by clicking the button in the 3 dot menu on the deploy

Matheus SantosOP•13mo ago

Brody•13mo ago

to clear up any possible misconceptions, i dont work for railway

Matheus SantosOP•13mo ago

oh ok, thanks for clarifying hey, there is definetely something going on with Redis err: ReplyError: ERR Error running script (call to f_5b2b5ebf27a713df55ffe098b317f75d41c716e5): @user_script:94: @user_script: 94: -MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

Matheus SantosOP•13mo ago

but I can't see its logs from here

Matheus SantosOP•13mo ago

is there another way to see it?

Brody•13mo ago

have you migrated redis?

Matheus SantosOP•13mo ago

Brody•13mo ago

you are still using the old redis plugin?

Matheus SantosOP•13mo ago

if I restart database it will erase everything that is currently there?

Brody•13mo ago

Matheus SantosOP•13mo ago

I'm not sure if I'm using this redis plugin tho what does it do?

Brody•13mo ago

you are, you are showing me screenshots specific to the deprecated redis plugin

Matheus SantosOP•13mo ago

ok, so this plugin is this UI? to see logs etc...?

Matheus SantosOP•13mo ago

should I try restarting the database?

Brody•13mo ago

yep thats the deprecated redis plugin cant hurt

Matheus SantosOP•13mo ago

using this old plugin affects the internal use of redis of my application?

Brody•13mo ago

Matheus SantosOP•13mo ago

still not working even after I restarted is there another way to see redis logs? without updating the plugin

Brody•13mo ago

there isnt, as the message says can you connect to redis from your computer?

Matheus SantosOP•13mo ago

yes

Brody•13mo ago

are you sure your code is using the current current credentials for redis?

Matheus SantosOP•13mo ago

yes my credentials are set on railway

Brody•13mo ago

okay then ill do this @matt - error connecting to redis

Matheus SantosOP•13mo ago

I'm going through the migration guide and I'm not sure if I understood this part. Will the migration stop my API? because it says "connected to the plugin" and not connected to the redis instance itself. I thought the plugin was just this UI so I don't know what would be an example of service connected to the plugin

Brody•13mo ago

redis is the plugin its a redis plugin, aka a redis database

Matheus SantosOP•13mo ago

ok so it will stop my application

Brody•13mo ago

yes

Matheus SantosOP•13mo ago

ok, so from my understanding, I don't have to do anything to migrate. Just click the migrate button and wait for a short downtime. Am I correct?

Brody•13mo ago

thats the idea

Matheus SantosOP•13mo ago

because maybe I can migrate now but I'm just trying to make sure that it won't make things worst

Brody•13mo ago

i honestly didn’t think of migrating to a v2 database, but now i think this has a decent change of working. but one thing to check before you migrate, are you using variable references for your database? https://docs.railway.app/develop/variables#reference-variables

Matheus SantosOP•13mo ago

checking no, looks like they're just placed normally as a variable of my service

Matheus SantosOP•13mo ago

oh no, sorry

char8•13mo ago

I think your Redis is running out of memory and crashing on OOM? 🤔

Matheus SantosOP•13mo ago

actually we're using

Matheus SantosOP•13mo ago

Brody•13mo ago

uh yeah that would do it yeah thats good, but a migration wont solve an OOM error

Matheus SantosOP•13mo ago

I thought memory would auto scale

Brody•13mo ago

yes, up to your plan limits (32gb on pro)

Matheus SantosOP•13mo ago

so just upgrading the plan would solve it?

Brody•13mo ago

ive seen the team just increase the account limit before, not sure what they want to do this time though

char8•13mo ago

I've asked internally (I haven't dealt much with Redis)

Matheus SantosOP•13mo ago

looking into my usage I see several times where it hits 60GB, so I thought it was just auto scaling and billing me for that. Isn't it the case?

Brody•13mo ago

that's total project usage each service / plugin can use up to what your plan allows

Matheus SantosOP•13mo ago

is this the usage of redis only? isn't it higher than 32 tho?

Brody•13mo ago

that's redis only, but that's an accumulative value

Matheus SantosOP•13mo ago

oh ok

Matheus SantosOP•13mo ago

I don't see a quick way to upgrade tho, is there any other alternative?

char8•13mo ago

I've raised it to 40G on your plan if you restart your Redis the new limit should kick in; 🤞 that'll prevent it crashing. If it doesn't we can raise further This is not standard on the plan and my override only applies for a week, so before then can you reach out to [email protected] re: this and get on a custom plan?

Matheus SantosOP•13mo ago

ok I'll restart

char8•13mo ago

lmk when that's done and I'll check

Matheus SantosOP•13mo ago

I think it's done still seeing the same errors on logs tho how can I check if the new limit is applied?

Brody•13mo ago

check the plugin metrics if you where hitting 32gb, there's a good chance you'd easily hit 40gb

Matheus SantosOP•13mo ago

it's on 30 but it was like this before also

char8•13mo ago

you'll see it in the 1hr view - it's at 39 ish now

Matheus SantosOP•13mo ago

oh, yes I see it now

char8•13mo ago

I'd avoid landing on the Data tab in the UI too btw, it tries to query the redis DB and maybe causing memory pressure. It looks stablish 🤞

Matheus SantosOP•13mo ago

but 39.9 isn't going to hit the limit again very soon?

char8•13mo ago

Redis is a curious beast. It'll eat up as much ram as it can in some cases, only when it needs to page more than it can free does it hit OOM. If you notce, you were stable at 32G for a long while.

Brody•13mo ago

might be worth trying out dragonflydb?

Matheus SantosOP•13mo ago

whats dragonfly? never heard of it

Brody•13mo ago

also, I fear for your egress costs, you are using the public host of the redis database instead of a private host, thus you are subjecting yourself to egress fees on the database supposed to be marketed as a higher performing drop in replacement for redis, don't know if that means it would use less memory for your usecase, but might be worth a look into?

Matheus SantosOP•13mo ago

ok, I'll look into it as for egress, what should I do in this case?

char8•13mo ago

also if you want to keep it at 40g beyond a week, please contact sales@ 🙂

Matheus SantosOP•13mo ago

sorry to bother with all these questions but I don't have context of most of these things, newbie to devops topics

Brody•13mo ago

absolutely no worries mate, that's what the help threads are for!

char8•13mo ago

same, happy to help - and we'll also use this to improve the UX so its not as confusing when things like this happen - I'm gonna add a task to see if we can surface that Out-of-Memory alert somewhere for starters

Matheus SantosOP•13mo ago

thanks for helping out! regarding the egress... this also can be handled by upgrading the plan?

Brody•13mo ago

the same egress fees apply to any plan, to eliminate egress fees on the database you would want to communicate to it over the private network

char8•13mo ago

yeah - you want to upgrade to the new Redis Template (https://docs.railway.app/guides/database-migration-guide) this system uses private networking for service-to-service comms, so no egress fees + also more control over your Database settings/etc...

Matheus SantosOP•13mo ago

I think there is still some issue, my jobs are not running I don't see the errors anymore, but my jobs have a lot of logs and I don't see any of it

Brody•13mo ago

are you working with a new database now, or still the plugin?

Matheus SantosOP•13mo ago

still the plugin I didn't migrate yet

Brody•13mo ago

whats the memory metrics look like?

Matheus SantosOP•13mo ago

very close to 40

Brody•13mo ago

okay i have to ask, is 40gb even normal for what youre doing??? thats 40$ in just mem costs

Matheus SantosOP•13mo ago

I'm not sure to be honest

Matheus SantosOP•13mo ago

I'm pretty sure that any of my jobs are running, normally this is full with job logs

Brody•13mo ago

likely still an OOM problem

Matheus SantosOP•13mo ago

I don't see errors this time tho. Is there a way to contact anyone from railway quickly? this is becoming pretty critical now, I need to upgrade asap if that's the case can we raise it a bit more here?

char8•13mo ago

I think I can try 50? lemme see

Matheus SantosOP•13mo ago

ok let me know when I should restart

char8•13mo ago

I can do it live for 50. Looking at it, it doesn't seem to have OOM crashed in the past 2 hours

Matheus SantosOP•13mo ago

yes, raise it please, let's try but I'm also finding it odd because I don't see it breaking either and also not seeing error logs

char8•13mo ago

are you seeing any issues? I don't see it breaking at all. I'll up it to 45, we can then see if Redis eats up the extra 5G (cos I'm sure it's going to). Being resident at 45G doesn't necessarily mean it'll OOM btw. Redis will have some RAM it can free if it needs more. The OOM hits when you have a query or something that needs more than it has available.

Matheus SantosOP•13mo ago

just to give you context normally it looks like this

Matheus SantosOP•13mo ago

everything that starts with "[" is a job running and I have it all the time this is from yesterday

Matheus SantosOP•13mo ago

and now it's like this

char8•13mo ago

This is most likely something up with your application - I'd try to debug why your jobs aren't getting processed. Maybe try connecting to the redis externally and see what its doing. I've raised your memory limit to 50G

Matheus SantosOP•13mo ago

I haven't change anything in my application recently, we're avoiding new deployments because of black friday week. And also the same jobs run in my application locally but I'll try to connect my application to the production redis instance and test it

Matheus SantosOP•13mo ago

I connected my local application to prod redis and now hundreds of logs are popping

Matheus SantosOP•13mo ago

but I don't see the logs on railway any clue? is it possible that all of these are jobs that were waiting on queue while redis was crashing? and it looks like all of these jobs are running in my machine now I redeployed my API and it seems to be resolved the issue somehow thanks for the support!

Brody•13mo ago

good old restart

Matheus SantosOP•12mo ago

Hi guys, it looks like I'm running into the same issue again! I had a call with railway last week in which the person told me our Redis RAM would be increased to 64GB for good but now I'm hitting 50GB and it looks like it's crashing due to OOM

Matheus SantosOP•12mo ago

Is there way to have someone quickly looking into this? sorry to bother again, but would you be able to look into it? This is very urgent

char8•12mo ago

looks like the limit was still 50G on the redis, but your account limit was 64. I've upped the limit on your redis manually to 64. Will follow up with the team why the account limit didn't apply (might be a bug). Do you remember who you spoke to?

Matheus SantosOP•12mo ago

thanks! I spoke with Angelo ( [email protected]) I'm finding very odd that it's getting close to 62 anyways, we haven't deployed any changes recently and our traffic is also not that big from what I'm monitoring the issue started to happen again eralier today, and we had half of the traffic we had last friday (black friday) any clue on what can be causing this big increase?

Faraz•12mo ago

What client are you using to connect to the Redis? (also following up here as opposed to the email as the conversation is already ongoing)

Matheus SantosOP•12mo ago

this one https://github.com/Rocketseat/adonis-bull

GitHub

GitHub - Rocketseat/adonis-bull: The easiest way to start using an ...

Faraz•12mo ago

My guess would be that something the library or your code is doing that's causing a memory leak because 60 gigs is a LOT. Maybe open an issue on their repo? It also seems like it hasn't been updated in 2 years. Perhaps try something like BullMQ?

Matheus SantosOP•12mo ago

yes, changing this library is on our roadmap but since nothing has been updated recently, I was wondering if there could be something on railway side causing this. Because we never had this issue before and we were already using this library

Faraz•12mo ago

Quite confident it isn't us as a lot of users use queues in combination with Redis and are seeing no issues.

Adam•12mo ago

I agree with fp here. have you restarted your service lately?

Matheus SantosOP•12mo ago

yes, last week when I first ran into this issue and earlier today when we had the issue happening again

Adam•12mo ago

can you show your memory metrics at those points? there should be a line on the graph that indicates when you restarted

Matheus SantosOP•12mo ago

Adam•12mo ago

I’m not seeing any lines, did you restart the service during that time period?

Brody•12mo ago

they're still on the deprecated redis plugin, I'm not sure you'd see a line for a restart on those plugins

Matheus SantosOP•12mo ago

yes, between 11 am and 12 pm

Adam•12mo ago

ahh gotcha, yeah that tracks. Also noticed the cap at 50GB up to 64 mentioned earlier in the convo only potential solution I have is to swap to the new plugin. As this plugin is deprecated there isn’t much we can help you with

Brody•12mo ago

yeah I'd recommend a migration too

Adam•12mo ago

given that the memory climb isn’t gradual, it’s likely not a memory leak. Seems like your app is just being greedy allocating all the memory available then crying for more No idea what it could be allocating it for, but it’s likely useless

Matheus SantosOP•12mo ago

ok, so updating to new plugin would help to investigate why is it allocating that much?

char8•12mo ago

not a redis expert, but there are some stats things that might help: https://redis.io/commands/memory-stats/

Adam•12mo ago

yes

Matheus SantosOP•12mo ago

nice, I'll look into it

Brody•12mo ago

they're now called database services, and they can give you much more control even switching to dragonfly like I've mentioned before might work out better too

Matheus SantosOP•12mo ago

I see, I'll try to update it on next few days then

Brody•12mo ago

haha try to migrate before it crashes again

Matheus SantosOP•12mo ago

hey, I'm migrating to new plugin, it's been more than 2 hours that the migration progress is stuck on Migrating Data

Matheus SantosOP•12mo ago

is this normal? I think there is 7.5 G being migrated hey, I'm sorry to bother again, but are you able to check if everything is ok with my migration? I think it's taking to long and my app is completely down while the migration is executing staging had ~half of the data and took ~30mins I think, so I expected prod to take ~1hour

char8•12mo ago

I don't know the migration internals well, but will check internally

Matheus SantosOP•12mo ago

thanks!

char8•12mo ago

if you look at the 'Redis Migration' service in your production environment and check the logs, it looks like it finished the transfer. I'm checking why the spinner is still showing. You should be able to update your apps environment variable and re-deploy it (the logs have the instructions) it looks like you got it working with the new redis with the redeploy? everything ok

Matheus SantosOP•12mo ago

I just redeployed, I'm testing to see if everything is ok

char8•12mo ago

We're going to keep that migration workload for an hour till JR gets in and has a look, we'll get to the bottom of why it got stuck. I'll also apply some credits to your account since the migration should definitely not have caused such long downtime like it did.

Matheus SantosOP•12mo ago

ok, thanks for your help. there is also a bunch of PR environments on my account that I'm not being able to get rid off. All of them are already closed on my end, so could you please delete it? those with "pr" at the end

char8•12mo ago

oh wow - yeah def will also look into this

Matheus SantosOP•12mo ago

I'm still seeing the Redis Legacy here so please let me know when I can delete it

char8•12mo ago

Re: PR environments, can you delete them from https://railway.app/project/dd204693-57d8-4d8e-afd2-d01235ff028f/settings/environments we fixed an issue on that page this morning and the envs should now be visible there We'll get back to you today on this, I've already credited your account, but we'll make sure to cover any extra charges from those two services sticking around.

Matheus SantosOP•12mo ago

ok! thanks

jr•12mo ago

Hey @Matheus Santos, The migration data transfer completed successfully and was just getting caught on a logging issue (very strange and we will look into it). The migration in the UI shows as errored and the service was redeployed. I beleive everything looks good but just letting you know

char8•12mo ago

thanks! we've had a look to see what went wrong, please go ahead and delete them.

Matheus SantosOP•12mo ago

done, thanks! everything seems fine now. I'll keep monitoring because now we're in 45GB memory and I still think this is too much, but hopefully we'll be stable while I'll investigate whats causing that

Matheus SantosOP•12mo ago

now that I've updated I see it seems like we're keeping some data of completed jobs, is there a way to clean up completed jobs from queue?

Brody•12mo ago

is there a way to clean up completed jobs from queue

more so a question for adonis-bull's docs

Matheus SantosOP•12mo ago

I've tried but couldn't find anything related there, that's why I'm checking if theres a way to do it through railway 😅 we're gonna migrate this library also in the next days but just trying to free some space

Brody•12mo ago

railway just provides the database, nothing fancy is going on haha

Matheus SantosOP•12mo ago

ok, thanks! I have 7.5G on my prod Redis instance, could that be whats causing RAM usage to be this big? and I thinks this isn't being used I want to clean it but I'm afraid the instance crashes while deleting because it's a lot of data hello! looks like we ran into same issue again today. I see some logs from about ~3 hours ago in which looks like Redis were down for a minute, then it got back up and some of my jobs were running but others were not. I was going to start debugging my API locally pointing to Redis PROD (and all my other services) and as soon as I started my application, all of these jobs that should have ran about 3 hours ago started automatically running locally, like if they were "waiting" to enter in the queue. Even though I did nothing but start my application

Matheus SantosOP•12mo ago

we were not close to the memory limit this time

Matheus SantosOP•12mo ago

When I noticed this issue a few minutes ago, I checked how many of my orders were affected and there was ~500, now this number is constantly decreasing automatically because of this jobs that were "waiting" to enter in the queue can someone please look into this? I know the library we're using is outdated and we're working to get that updated, but it just doesn't make sense to me that this is happening again because this time we have the updated plugin, we were not close to the memory limit, and locally everything runs smoothly so I think it's unlikely something in my application

char8•12mo ago

the Redis has been up for 36 hours from what i see If some jobs were running and others were not, that's probably something internal in your app, since Railway can't do anything to cause a partial failure inside your app like that. I'd check if your app is able to correctly recover from broken redis connections, etc... and try to add more logging to see why things fail. Incase you identify that this happens because of redis connection failures, you should implement a retry - if you're not using private networking to talk to your redis, consider switching, that'll give you a cheaper and more performant network path between your app and Redis. Please see: https://docs.railway.app/reference/private-networking#caveats as private networking becomes available shortly after your app starts, so it may need to retry if it can't connect to the redis initially. Would try this out on a staging env before you switchover production.

Matheus SantosOP•12mo ago

I'll make sure to check that, but if it was something in my app, wouldn't the issue be also happening on my local env?

Adam•12mo ago

Have you left your local env running for that long?

Matheus SantosOP•12mo ago

I mean it only took a couple of minutes for the jobs "waiting" to complete then I restarted my application in production and everything got back to normal there

Matheus SantosOP•12mo ago

same issue again, from 10 minutes ago...

Matheus SantosOP•12mo ago

why is it even getting down if we're not hitting the memory limit?

Brody•12mo ago

are you using the private network like char suggested?

Matheus SantosOP•12mo ago

I didn't have a chance to check that yet, but again: We didn't change anything on our side recently, and it works locally.

Brody•12mo ago

i am also confident these issues are not caused by railway

Matheus SantosOP•12mo ago

but the fact that is getting down... how can my app be causing that?

Brody•12mo ago

char made some great suggestions, please look into implementing what was said i understand how this can seem like an issue on railway's side, but all they are doing is providing you with a regular redis database, and running your code as-is

Matheus SantosOP•12mo ago

I will, for sure... it's just that I didn't have time yet to do it, i'm already out of office today. (it's 10pm in my timezone). But in the meantime it would be good if someone could take a deeper look into it, because like I said, even if it's something on our side, we haven't change anything recently. To be honest we never changed the way we queue jobs into redis, so why would it stop working now? If it's just running the code as-is, shouldn't it be working then? because I don't find any issues locally... and the fact that the instance is getting down... I don't see how my app could be making it crash unless is a OOM issue again, but the metrics show our usage around ~50GB and out limit is 64G now

Brody•12mo ago

as odd as this may sound, code can and will behave differently in different environments, with differing amounts of traffic, etc. i have seen hundreds of cases where code that worked locally doesn’t work or doesn’t work correctly on railway, and those cases where no fault of railway. to be honest, i don't know why you are having issues now, and i think it would be extremely hard for any of us to know whats going on with your code to cause these issues, but unfortunately deep diving into a users codebase when railway is not at fault is not really what we or the team are here for, sorry.

jeremy•12mo ago

You could try updating your deps, that would eliminate a lot of potential issues. adonis-bull is 2 years old, depends on bull@^3.22.1 which is 3 years old, which depends on ioredis@^4.22.0 which is 2+ years old. If you use Redis outside of the bull package, you might be using ioredis@>=5. You could also check into the github issues for these packages with your error logs. Issues might come from theses deprecated packages, which might have been fixed in latest versions

Matheus SantosOP•12mo ago

hello, I am updating my dependency to use the latest version of BullMQ. However, I suspect that one of the reasons my instance is taking that much RAM (~60GB) is because we were not removing our completed jobs. When I upgraded my plugin I noticed that we had ~7.5 GB of data and I see the below amount of keys. It says expires=49 but I'm not sure if it will really expire and clean the data. Am I correct in this hypothesis? If so, I wonder if it would be possible to deploy a fresh new redis prod instance, I think that would be simpler than cleaning my current instance

Matheus SantosOP•12mo ago

hey, would you be able to help with that? ⬆️

char8•12mo ago

you can create as many Redis nodes as you like https://docs.railway.app/guides/redis we can't really help with application logic but yeah, that reads as only 49/5468871 keys have expiry set but please - if its a platform issue, email support, and if it's help about your application (like this) - maybe a new help thread will be better than re-raising this topic.

Gaming

Programming

Is my service down?