Deployments keep failing β
Deployments have been failing with no changes on our end.
We have the same build running on Netlify and locally, both running successfully. It just fails on Cloudflare Pages. It looks like it's failing to fetch files (something is blocking the IP)?
Our stack is Nuxt 2 with Node 16 generating fully static pages. Url: https://website-poc.pages.dev/ (this has the last successful build, 5 days ago)
We can't deploy any updates to the website at the moment, so it's a critical issue.
Anyone else also experiencing something similar?
Deputy
Scheduling, Timesheet & Time Clock Software β Deputy
Deputy is easy-to-use employee scheduling, timesheet & time clock software. Get started in minutes for free today & see how much time you could save!
38 Replies
cc @JohnDotAwesome this might be of interest
π
@Jorge when you get back online - I notice that IP address is to AWS - what CMS is it trying to hit? Storyblok?
We started seeing issues with 3 other projects that use Nuxt, Node 16, and make requests to AWS infra starting on October 13th. Did you see any other potential failures longer ago than 5 days ago?
@Jorge just a heads up, we're actively trying to resolve this issue and I'll keep you posted. In the event that you need to perform a deployment, do know that you can still perform deployments via Direct Upload
I'll be continuing to take a look at this situation in the morning
cc @Nevi | Pages Product Manager @natalier
Thanks @JohnDotAwesome. Yes, that IP address is from Storyblok.
Yes, we first started experiencing issues 12 days ago (same October 13th you mentioned), initially just timing out builds with no error msgs:
It kept failing like that on every deploy for many days, until that specific error
Error: connect ETIMEDOUT 18.172.170.120:443
started coming up in the logs on October 16th. It then managed to do 2 or 3 successful deploys on the following days while still failing the majority with that same error msg. The last successful one being that one 5 days ago.
October 13th might be the common date here. I'm unaware of any incidents with AWS or Storyblok around that day. Have you heard anything on what might have triggered the issues?
Could this be related to this issue? https://discord.com/channels/595317990191398933/1164972884863754240/1166111453967814656That's different, but coincidentally started happening around the same time. We've rolled back the changes that were causing that particular case but these timeouts have persisted
Something we've wanted to try with the other cases is to update node.js. Obviously that's not super ideal, but it's worth a shot
is something internal broken
I'm seeing this same error, too
@budparr That error is probably related to this: https://www.cloudflarestatus.com/incidents/s1hkh315y9s9
I would wait until that is resolved and try again
Thanks!
got it thanks
@JohnDotAwesome thanks. That seems to be an issue with Cloudflare Pages per se though? It's working locally and with Netlify, so it's likely related to the environment/infra they are being building on (CF). Is there anything we can try to fix it within our build container?
Indeed. We believe there's an issue between Pages CI (which is in Google Cloud) and AWS. Netlify being in AWS does not see this issue. I have found that adding appropriate timeouts and retries to requests solves the issue, but again, that's obviously not ideal.
I'll be sharing more details as I get them + code samples for how I've fixed other projects on Pages. We fully intend to get to the bottom of the networking issues between Pages CI<->AWS
That'd be great, thanks @johndotawesome.
Alrighty. Just coming back to this @Jorge - If you can override how http requests are being made during your generate step, you can use a fetch function similar to the one below using the p-retry library:
Thanks @JohnDotAwesome. Where do you suggest we add this fetch function? We use Nuxt 2 with the Storyblok module [https://github.com/storyblok/storyblok-nuxt-2] to fetch the data.
looks like they recommend overriding the global
fetch
You'd want to set it to something like this:
Perhaps keep the console.logs initially for debugging
Hi @JohnDotAwesome thanks for your suggestion. I appreciate the workaround solution you've provided. I attempted to incorporate the custom fetch into that library, but unfortunately, it didn't seem resolve the issue. I will keep trying.
While this workaround might temporarily address the problem, it's a significant concern that we hope will receive the necessary attention.
In addition, we're encountering timeouts during our deployments on Pages, specifically when trying to download Node. This issue isn't related to our build process but appears to be a problem within the Pages environment.
You can see it attempted to download Node for 6 minutes but eventually timed out and caused the build to fail.
Have there been any recent major changes, work, or updates to Pages? Frankly, in light of the incident we experienced yesterday, it seems like things may be becoming unmanageable. What is the latest status on resolving these issues, and when can we expect to have stability and trust restored?
with respect to the node.js download, unless you need that specific version of node, you can specify
18
via the NODE_VERSION
env var or version file to use the pre-installed node.js 18 version
The v2 build image ships with pre-installed major versions of nodejs from 14 through 20
The team is awaiting a response from our cloud provider (GCP) on the timeout issues to Cloudfront. I will update this thread when I know more
w.r.t. stability issues, I can say that this is an issue we are taking very seriously internally. I don't want to say more than that for now; I'll leave that for senior leadership to say in the upcoming IRRight, thanks @JohnDotAwesome appreciate your support on this.
Regarding the Node issue, yes we do use
NODE_VERSION
and version file to set our Node version. It has always worked ok. I raised the issue now because it started to hang on that downloading task.Totally understand. I'm just saying if you specify only the value
18
you won't have that problem again.
We are working internally on Tool Caching (rather than dependency and build output), but that won't be available for a while. In order to solve your problem more immediately, I was suggesting being fuzzy with your node verision rather than exactRight, got it! That helps, thanks. It doesn't look like it's happening anymore, but we won't hurt by specifying just the major version (18) if that helps with stability. π
Our main concern remains the Storyblok timeouts during deployments (GCP -> Cloudfront issue). We are running an alternative pipeline with Github actions to manually build and upload the artifacts for now. This has been very stable but it's not the ideal scenario for us.
Please keep us updated on the response from GCP about this issue and hopefully we can fix that as soon as possible. Thanks @JohnDotAwesome.
Hi @JohnDotAwesome just touching base to check whether there has been any progress with Pages cloud provider (GCP) on the timeout issues to Cloudfront? Our deploys within Pages are still failing. Thanks
Hey, Jorge. Indeed. Google has shown us that it's actually Cloudfront dropping packets. We did a session with one of their engineers inspecting packets from GCP->Cloudfront and indeed, Cloudfront is the culprit.
This is regional, Cloudfront is presumably only showing this behavior for select points of presence (and in particular, the PoP closest to the Pages Build cluster). We're still trying to find the right way to engage with AWS since we're not a customer
The best advice here still is to upgrade to node v20 which re-uses TLS connections by default
@JohnDotAwesome Just to update this thread: I've upgraded to Node v20 but that hasn't fixed the issue. What did fix it was to upgrade to latest version of
storyblok-js-client
and storyblok-nuxt
(https://github.com/storyblok/storyblok-js-client & https://github.com/storyblok/storyblok-nuxt). I believe these modules don't use axios anymore and have better ways to handle this fetch issue.GitHub
GitHub - storyblok/storyblok-js-client: Universal JavaScript client...
Universal JavaScript client for Storyblok's API. Contribute to storyblok/storyblok-js-client development by creating an account on GitHub.
Interesting! The http clients in the previous version may have been explicitly not been using keepAlives. I really appreciate the follow-up. This was such a strange debugging saga for us
Hello - I'm getting similar issues trying to connect to the Prismic API using a Nuxt 2 app within Cloudflare Pages. I unfortunately can't move beyond Node 16 due to incompatibilities to do with Nuxt 2 dependencies, and the Prismic Client library hasn't been updated in the same way that Storyblok's has. This is really concerning, as multiple clients are currently unable to update their sites. Is there any progress on getting these connection with Cloudfront issues sorted? It really isn't ideal.
A sample error looks as follows:
Perhaps Prismic has a way to plumb http agents through to node-fetch? https://github.com/node-fetch/node-fetch#custom-agent <-- If node-fetch uses an http agent with
keepAlive: true
set, then it will workGitHub
GitHub - node-fetch/node-fetch: A light-weight module that brings t...
A light-weight module that brings the Fetch API to Node.js - GitHub - node-fetch/node-fetch: A light-weight module that brings the Fetch API to Node.js
Thanks for this John. Iβll have a look into that!
Has there been any progress working with the AWS side with their packet losses?
Nothing significant to update on right now π¦
Hey @alexh,
i do have the same issue with the same stack. Would be awesome to know if you found a solution for it!
I havenβt had time to investigate the suggestion yet but Iβll post in here when I do!
Y'all could always use something like https://www.npmjs.com/package/patch-package to modify your node_modules to set the keepAlive option. I know that's far from ideal, but it is an option
npm
patch-package
Fix broken node modules with no fuss. Latest version: 8.0.0, last published: 4 months ago. Start using patch-package in your project by running
npm i patch-package
. There are 804 other projects in the npm registry using patch-package.Thanks @JohnDotAwesome
@Lion7de I've forked the
@prismicio/client
library to both use the keepAlive option, as well as retry each request 5 times. It does cause build times to be longer, but it appears to work. https://github.com/studiotreble/prismic-client/tree/v5
You can install it by setting the version of the library in package.json to studiotreble/prismic-client#v5.2.4-custom
Would be great to get the network issues ironed out though.hrmmmm we've found with the keepAlive working properly, none of the requests end up timing out due to dropped packets. I wonder why your situation is ending up with longer builds. Presumably that's because some requests are still timing out. Do you have control over the time to abort the request?
Yes - keepAlive alone didn't seem to do the trick and was still getting timeouts.
Interesting!
In that case, I'd make sure those requests timeout in a reasonable time. In some situations, I saw clients waiting 30s before retrying (or just failing). Most of these requests to your CMS should respond in less than 2s and really even that's pushing it
I actually set it to 30s to be conservative. I could try reducing that
That would be consistent with the longer build times
yeah, I'd bring that down quite a bit. I'd recommend 5s, but tweak as needed
Okay will tweak, thanks!