Is my service down?
service id: 44d24596-53c9-4ef8-bbc2-f3da425755c5
service name: shipaid-api
I'm seeing a lot of error logs with ECONNRESET, but the service is running according to my /health endpoint. We didn't change anything recently
189 Replies
Project ID:
44d24596-53c9-4ef8-bbc2-f3da425755c5
what kind of app is this?
it's a Node.js API
more specifically?
sorry, I didn't get it. what exactly you need to know? tech stack? business context?
tech stack
ok, we're using Node.js with Typescript and Adonnis.js framework
our API is hosted with Railway and we use Nhost, which is kind of a ecosystem that provides us some services such as Postgres and a GraphQL server
we're also using redis hosted with railway
and we use Rollbar for logs and monitoring
we mostly operate integrating with shopify stores, we're a shopify app so we mostly listen to shopify webhooks to fire operations in the API, we also have a Admin dashboard with react that interacts with the API via GraphQL
the application seems to be running as I also see info logs related to my business going on, but these error logs keeps popping and it seems to be preventing some of my operations to run
do you happen to know what service you're communicating with when you get these errors?
no
does your api accept any post requests?
yes
what do these post requests consist of?
they are mostly shopify webhook listeners
besides some queries/mutations called by my FE Admin dashboard
what's the frequency of these errors?
it seems to be happening all the time since one hour ago
okay I'll ask the team if they have any idea what could cause this
ok, thanks
I found these logs related to rollbar... but I don't see how they are preventing my jobs to run
looks like you're just having trouble communicating to 3rd party api services, very odd
I've just tested on my local env sending a notification to rollbar and it worked
but it keeps happening on production, is it possible that railway is blocking the communication?
Is there anything else I can check here? the most critical operation of my system is not running and I think it's related to it. We have a job that is queued in redis and it looks like it's not being fired
we're using this package to communicate with redis
https://github.com/Rocketseat/adonis-bull
GitHub
GitHub - Rocketseat/adonis-bull: The easiest way to start using an ...
The easiest way to start using an asynchronous job queue with AdonisJS. Ready for Adonis v5 ⚡️ - GitHub - Rocketseat/adonis-bull: The easiest way to start using an asynchronous job queue with Adoni...
I honestly have no clue, I have asked the team
hey, is it possible for you to restart the instance of my service?
maybe it will clean what's causing the issue
or, If I push a new deploy it will restart itself?
you can both restart or redeploy it yourself by clicking the button in the 3 dot menu on the deploy
ok
to clear up any possible misconceptions, i dont work for railway
oh ok, thanks for clarifying
hey, there is definetely something going on with Redis
err: ReplyError: ERR Error running script (call to f_5b2b5ebf27a713df55ffe098b317f75d41c716e5): @user_script:94: @user_script: 94: -MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
but I can't see its logs from here
is there another way to see it?
have you migrated redis?
no
you are still using the old redis plugin?
if I restart database it will erase everything that is currently there?
no
I'm not sure if I'm using this redis plugin tho
what does it do?
you are, you are showing me screenshots specific to the deprecated redis plugin
ok, so this plugin is this UI? to see logs etc...?
should I try restarting the database?
yep thats the deprecated redis plugin
cant hurt
using this old plugin affects the internal use of redis of my application?
no
still not working
even after I restarted
is there another way to see redis logs?
without updating the plugin
there isnt, as the message says
can you connect to redis from your computer?
yes
are you sure your code is using the current current credentials for redis?
yes
my credentials are set on railway
okay then ill do this
@matt - error connecting to redis
I'm going through the migration guide and I'm not sure if I understood this part. Will the migration stop my API? because it says "connected to the plugin" and not connected to the redis instance itself. I thought the plugin was just this UI so I don't know what would be an example of service connected to the plugin
redis is the plugin
its a redis plugin, aka a redis database
ok so it will stop my application
yes
ok, so from my understanding, I don't have to do anything to migrate. Just click the migrate button and wait for a short downtime. Am I correct?
thats the idea
because maybe I can migrate now but I'm just trying to make sure that it won't make things worst
i honestly didn’t think of migrating to a v2 database, but now i think this has a decent change of working.
but one thing to check before you migrate, are you using variable references for your database? https://docs.railway.app/develop/variables#reference-variables
checking
no, looks like they're just placed normally as a variable of my service
oh no, sorry
I think your Redis is running out of memory and crashing on OOM? 🤔
actually we're using
uh yeah that would do it
yeah thats good, but a migration wont solve an OOM error
I thought memory would auto scale
yes, up to your plan limits (32gb on pro)
so just upgrading the plan would solve it?
ive seen the team just increase the account limit before, not sure what they want to do this time though
I've asked internally (I haven't dealt much with Redis)
looking into my usage I see several times where it hits 60GB, so I thought it was just auto scaling and billing me for that. Isn't it the case?
that's total project usage
each service / plugin can use up to what your plan allows
is this the usage of redis only? isn't it higher than 32 tho?
that's redis only, but that's an accumulative value
oh ok
I don't see a quick way to upgrade tho, is there any other alternative?
I've raised it to 40G on your plan if you restart your Redis the new limit should kick in; 🤞 that'll prevent it crashing. If it doesn't we can raise further
This is not standard on the plan and my override only applies for a week, so before then can you reach out to [email protected] re: this and get on a custom plan?
ok I'll restart
lmk when that's done and I'll check
I think it's done
still seeing the same errors on logs tho
how can I check if the new limit is applied?
check the plugin metrics
if you where hitting 32gb, there's a good chance you'd easily hit 40gb
it's on 30 but it was like this before also
you'll see it in the 1hr view - it's at 39 ish now
oh, yes
I see it now
I'd avoid landing on the Data tab in the UI too btw, it tries to query the redis DB and maybe causing memory pressure. It looks stablish 🤞
but 39.9 isn't going to hit the limit again very soon?
Redis is a curious beast. It'll eat up as much ram as it can in some cases, only when it needs to page more than it can free does it hit OOM. If you notce, you were stable at 32G for a long while.
might be worth trying out dragonflydb?
whats dragonfly? never heard of it
also, I fear for your egress costs, you are using the public host of the redis database instead of a private host, thus you are subjecting yourself to egress fees on the database
supposed to be marketed as a higher performing drop in replacement for redis, don't know if that means it would use less memory for your usecase, but might be worth a look into?
ok, I'll look into it
as for egress, what should I do in this case?
also if you want to keep it at 40g beyond a week, please contact sales@ 🙂
sorry to bother with all these questions but I don't have context of most of these things, newbie to devops topics
absolutely no worries mate, that's what the help threads are for!
same, happy to help - and we'll also use this to improve the UX so its not as confusing when things like this happen - I'm gonna add a task to see if we can surface that Out-of-Memory alert somewhere for starters
thanks for helping out! regarding the egress... this also can be handled by upgrading the plan?
the same egress fees apply to any plan, to eliminate egress fees on the database you would want to communicate to it over the private network
yeah - you want to upgrade to the new Redis Template (https://docs.railway.app/guides/database-migration-guide) this system uses private networking for service-to-service comms, so no egress fees + also more control over your Database settings/etc...
I think there is still some issue, my jobs are not running
I don't see the errors anymore, but my jobs have a lot of logs and I don't see any of it
are you working with a new database now, or still the plugin?
still the plugin
I didn't migrate yet
whats the memory metrics look like?
very close to 40
okay i have to ask, is 40gb even normal for what youre doing???
thats 40$ in just mem costs
I'm not sure to be honest
I'm pretty sure that any of my jobs are running, normally this is full with job logs
likely still an OOM problem
I don't see errors this time tho. Is there a way to contact anyone from railway quickly? this is becoming pretty critical now, I need to upgrade asap if that's the case
can we raise it a bit more here?
I think I can try 50? lemme see
ok
let me know when I should restart
I can do it live for 50. Looking at it, it doesn't seem to have OOM crashed in the past 2 hours
yes, raise it please, let's try
but I'm also finding it odd because I don't see it breaking either
and also not seeing error logs
are you seeing any issues? I don't see it breaking at all. I'll up it to 45, we can then see if Redis eats up the extra 5G (cos I'm sure it's going to). Being resident at 45G doesn't necessarily mean it'll OOM btw. Redis will have some RAM it can free if it needs more. The OOM hits when you have a query or something that needs more than it has available.
just to give you context normally it looks like this
everything that starts with "[" is a job running and I have it all the time
this is from yesterday
and now it's like this
This is most likely something up with your application - I'd try to debug why your jobs aren't getting processed. Maybe try connecting to the redis externally and see what its doing.
I've raised your memory limit to 50G
I haven't change anything in my application recently, we're avoiding new deployments because of black friday week. And also the same jobs run in my application locally
but I'll try to connect my application to the production redis instance and test it
I connected my local application to prod redis and now hundreds of logs are popping
but I don't see the logs on railway
any clue?
is it possible that all of these are jobs that were waiting on queue while redis was crashing?
and it looks like all of these jobs are running in my machine now
I redeployed my API and it seems to be resolved the issue somehow
thanks for the support!
good old restart
Hi guys, it looks like I'm running into the same issue again!
I had a call with railway last week in which the person told me our Redis RAM would be increased to 64GB for good
but now I'm hitting 50GB and it looks like it's crashing due to OOM
Is there way to have someone quickly looking into this?
sorry to bother again, but would you be able to look into it? This is very urgent
looks like the limit was still 50G on the redis, but your account limit was 64. I've upped the limit on your redis manually to 64. Will follow up with the team why the account limit didn't apply (might be a bug). Do you remember who you spoke to?
thanks! I spoke with Angelo ( [email protected])
I'm finding very odd that it's getting close to 62 anyways, we haven't deployed any changes recently and our traffic is also not that big
from what I'm monitoring the issue started to happen again eralier today, and we had half of the traffic we had last friday (black friday)
any clue on what can be causing this big increase?
What client are you using to connect to the Redis?
(also following up here as opposed to the email as the conversation is already ongoing)
GitHub
GitHub - Rocketseat/adonis-bull: The easiest way to start using an ...
The easiest way to start using an asynchronous job queue with AdonisJS. Ready for Adonis v5 ⚡️ - GitHub - Rocketseat/adonis-bull: The easiest way to start using an asynchronous job queue with Adoni...
My guess would be that something the library or your code is doing that's causing a memory leak because 60 gigs is a LOT. Maybe open an issue on their repo?
It also seems like it hasn't been updated in 2 years. Perhaps try something like BullMQ?
yes, changing this library is on our roadmap but since nothing has been updated recently, I was wondering if there could be something on railway side causing this. Because we never had this issue before and we were already using this library
Quite confident it isn't us as a lot of users use queues in combination with Redis and are seeing no issues.
I agree with fp here. have you restarted your service lately?
yes, last week when I first ran into this issue and earlier today when we had the issue happening again
can you show your memory metrics at those points? there should be a line on the graph that indicates when you restarted
I’m not seeing any lines, did you restart the service during that time period?
they're still on the deprecated redis plugin, I'm not sure you'd see a line for a restart on those plugins
yes, between 11 am and 12 pm
ahh gotcha, yeah that tracks. Also noticed the cap at 50GB up to 64 mentioned earlier in the convo
only potential solution I have is to swap to the new plugin. As this plugin is deprecated there isn’t much we can help you with
yeah I'd recommend a migration too
given that the memory climb isn’t gradual, it’s likely not a memory leak. Seems like your app is just being greedy allocating all the memory available then crying for more
No idea what it could be allocating it for, but it’s likely useless
ok, so updating to new plugin would help to investigate why is it allocating that much?
not a redis expert, but there are some stats things that might help:
https://redis.io/commands/memory-stats/
yes
nice, I'll look into it
they're now called database services, and they can give you much more control
even switching to dragonfly like I've mentioned before might work out better too
I see, I'll try to update it on next few days then
haha try to migrate before it crashes again
hey, I'm migrating to new plugin, it's been more than 2 hours that the migration progress is stuck on Migrating Data
is this normal?
I think there is 7.5 G being migrated
hey, I'm sorry to bother again, but are you able to check if everything is ok with my migration? I think it's taking to long and my app is completely down while the migration is executing
staging had ~half of the data and took ~30mins I think, so I expected prod to take ~1hour
I don't know the migration internals well, but will check internally
thanks!
if you look at the 'Redis Migration' service in your production environment and check the logs, it looks like it finished the transfer. I'm checking why the spinner is still showing. You should be able to update your apps environment variable and re-deploy it (the logs have the instructions)
it looks like you got it working with the new redis with the redeploy? everything ok
I just redeployed, I'm testing to see if everything is ok
We're going to keep that migration workload for an hour till JR gets in and has a look, we'll get to the bottom of why it got stuck. I'll also apply some credits to your account since the migration should definitely not have caused such long downtime like it did.
ok, thanks for your help. there is also a bunch of PR environments on my account that I'm not being able to get rid off. All of them are already closed on my end, so could you please delete it?
those with "pr" at the end
oh wow - yeah def will also look into this
I'm still seeing the Redis Legacy here so please let me know when I can delete it
Re: PR environments,
can you delete them from https://railway.app/project/dd204693-57d8-4d8e-afd2-d01235ff028f/settings/environments
we fixed an issue on that page this morning and the envs should now be visible there
We'll get back to you today on this, I've already credited your account, but we'll make sure to cover any extra charges from those two services sticking around.
ok! thanks
Hey @Matheus Santos,
The migration data transfer completed successfully and was just getting caught on a logging issue (very strange and we will look into it). The migration in the UI shows as errored and the service was redeployed. I beleive everything looks good but just letting you know
thanks! we've had a look to see what went wrong, please go ahead and delete them.
done, thanks! everything seems fine now. I'll keep monitoring because now we're in 45GB memory and I still think this is too much, but hopefully we'll be stable while I'll investigate whats causing that
now that I've updated I see it seems like we're keeping some data of completed jobs, is there a way to clean up completed jobs from queue?
is there a way to clean up completed jobs from queuemore so a question for adonis-bull's docs
I've tried but couldn't find anything related there, that's why I'm checking if theres a way to do it through railway 😅
we're gonna migrate this library also in the next days but just trying to free some space
railway just provides the database, nothing fancy is going on haha
ok, thanks!
I have 7.5G on my prod Redis instance, could that be whats causing RAM usage to be this big?
and I thinks this isn't being used
I want to clean it but I'm afraid the instance crashes while deleting because it's a lot of data
hello! looks like we ran into same issue again today. I see some logs from about ~3 hours ago in which looks like Redis were down for a minute, then it got back up and some of my jobs were running but others were not. I was going to start debugging my API locally pointing to Redis PROD (and all my other services) and as soon as I started my application, all of these jobs that should have ran about 3 hours ago started automatically running locally, like if they were "waiting" to enter in the queue. Even though I did nothing but start my application
we were not close to the memory limit this time
When I noticed this issue a few minutes ago, I checked how many of my orders were affected and there was ~500, now this number is constantly decreasing automatically because of this jobs that were "waiting" to enter in the queue
can someone please look into this?
I know the library we're using is outdated and we're working to get that updated, but it just doesn't make sense to me that this is happening again because this time we have the updated plugin, we were not close to the memory limit, and locally everything runs smoothly so I think it's unlikely something in my application
the Redis has been up for 36 hours from what i see
If some jobs were running and others were not, that's probably something internal in your app, since Railway can't do anything to cause a partial failure inside your app like that. I'd check if your app is able to correctly recover from broken redis connections, etc... and try to add more logging to see why things fail.
Incase you identify that this happens because of redis connection failures, you should implement a retry - if you're not using private networking to talk to your redis, consider switching, that'll give you a cheaper and more performant network path between your app and Redis.
Please see: https://docs.railway.app/reference/private-networking#caveats as private networking becomes available shortly after your app starts, so it may need to retry if it can't connect to the redis initially. Would try this out on a staging env before you switchover production.
I'll make sure to check that, but if it was something in my app, wouldn't the issue be also happening on my local env?
Have you left your local env running for that long?
I mean it only took a couple of minutes for the jobs "waiting" to complete
then I restarted my application in production and everything got back to normal there
same issue again, from 10 minutes ago...
why is it even getting down if we're not hitting the memory limit?
are you using the private network like char suggested?
I didn't have a chance to check that yet, but again: We didn't change anything on our side recently, and it works locally.
i am also confident these issues are not caused by railway
but the fact that is getting down... how can my app be causing that?
char made some great suggestions, please look into implementing what was said
i understand how this can seem like an issue on railway's side, but all they are doing is providing you with a regular redis database, and running your code as-is
I will, for sure... it's just that I didn't have time yet to do it, i'm already out of office today. (it's 10pm in my timezone). But in the meantime it would be good if someone could take a deeper look into it, because like I said, even if it's something on our side, we haven't change anything recently. To be honest we never changed the way we queue jobs into redis, so why would it stop working now?
If it's just running the code as-is, shouldn't it be working then? because I don't find any issues locally... and the fact that the instance is getting down... I don't see how my app could be making it crash unless is a OOM issue again, but the metrics show our usage around ~50GB and out limit is 64G now
as odd as this may sound, code can and will behave differently in different environments, with differing amounts of traffic, etc.
i have seen hundreds of cases where code that worked locally doesn’t work or doesn’t work correctly on railway, and those cases where no fault of railway.
to be honest, i don't know why you are having issues now, and i think it would be extremely hard for any of us to know whats going on with your code to cause these issues, but unfortunately deep diving into a users codebase when railway is not at fault is not really what we or the team are here for, sorry.
You could try updating your deps, that would eliminate a lot of potential issues.
adonis-bull
is 2 years old, depends on bull@^3.22.1
which is 3 years old, which depends on ioredis@^4.22.0
which is 2+ years old. If you use Redis outside of the bull package, you might be using ioredis@>=5
.
You could also check into the github issues for these packages with your error logs. Issues might come from theses deprecated packages, which might have been fixed in latest versionshello, I am updating my dependency to use the latest version of BullMQ. However, I suspect that one of the reasons my instance is taking that much RAM (~60GB) is because we were not removing our completed jobs. When I upgraded my plugin I noticed that we had ~7.5 GB of data and I see the below amount of keys. It says expires=49 but I'm not sure if it will really expire and clean the data. Am I correct in this hypothesis?
If so, I wonder if it would be possible to deploy a fresh new redis prod instance, I think that would be simpler than cleaning my current instance
hey, would you be able to help with that? ⬆️
you can create as many Redis nodes as you like
https://docs.railway.app/guides/redis
we can't really help with application logic
but yeah, that reads as only 49/5468871 keys have expiry set
but please - if its a platform issue, email support, and if it's help about your application (like this) - maybe a new help thread will be better than re-raising this topic.