Deployments keep failing ❌

Deployments have been failing with no changes on our end.
[error] There was an error fetching the page: connect ETIMEDOUT 18.172.170.120:443
18:35:59.870 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)
18:35:59.870 [error]
[error] There was an error fetching the page: connect ETIMEDOUT 18.172.170.120:443
18:35:59.870 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)
18:35:59.870 [error]
We have the same build running on Netlify and locally, both running successfully. It just fails on Cloudflare Pages. It looks like it's failing to fetch files (something is blocking the IP)? Our stack is Nuxt 2 with Node 16 generating fully static pages. Url: https://website-poc.pages.dev/ (this has the last successful build, 5 days ago) We can't deploy any updates to the website at the moment, so it's a critical issue. Anyone else also experiencing something similar?
Deputy
Scheduling, Timesheet & Time Clock Software — Deputy
Deputy is easy-to-use employee scheduling, timesheet & time clock software. Get started in minutes for free today & see how much time you could save!
38 Replies
James
James14mo ago
cc @JohnDotAwesome this might be of interest
JohnDotAwesome
JohnDotAwesome14mo ago
👀 @Jorge when you get back online - I notice that IP address is to AWS - what CMS is it trying to hit? Storyblok? We started seeing issues with 3 other projects that use Nuxt, Node 16, and make requests to AWS infra starting on October 13th. Did you see any other potential failures longer ago than 5 days ago? @Jorge just a heads up, we're actively trying to resolve this issue and I'll keep you posted. In the event that you need to perform a deployment, do know that you can still perform deployments via Direct Upload I'll be continuing to take a look at this situation in the morning cc @Nevi | Pages Product Manager @natalier
Jorge
JorgeOP14mo ago
Thanks @JohnDotAwesome. Yes, that IP address is from Storyblok. Yes, we first started experiencing issues 12 days ago (same October 13th you mentioned), initially just timing out builds with no error msgs:
22:00:38.371 Success: Finished cloning repository files
22:35:42.078 Failed: build exceeded the time limit and was terminated. Refer to https://developers.cloudflare.com/pages/platform/limits/#builds for build limits
22:00:38.371 Success: Finished cloning repository files
22:35:42.078 Failed: build exceeded the time limit and was terminated. Refer to https://developers.cloudflare.com/pages/platform/limits/#builds for build limits
It kept failing like that on every deploy for many days, until that specific error Error: connect ETIMEDOUT 18.172.170.120:443 started coming up in the logs on October 16th. It then managed to do 2 or 3 successful deploys on the following days while still failing the majority with that same error msg. The last successful one being that one 5 days ago. October 13th might be the common date here. I'm unaware of any incidents with AWS or Storyblok around that day. Have you heard anything on what might have triggered the issues? Could this be related to this issue? https://discord.com/channels/595317990191398933/1164972884863754240/1166111453967814656
JohnDotAwesome
JohnDotAwesome14mo ago
That's different, but coincidentally started happening around the same time. We've rolled back the changes that were causing that particular case but these timeouts have persisted Something we've wanted to try with the other cases is to update node.js. Obviously that's not super ideal, but it's worth a shot
kane
kane14mo ago
23:06:54.909 ✨ Success! Uploaded 0 files (10 already uploaded) (0.43 sec)
23:06:54.910
23:06:55.201 ✨ Upload complete!
23:06:59.721 Success: Assets published!
23:07:20.058 Error: Failed to publish your Function. Got error: Unknown internal error occurred.
23:06:54.909 ✨ Success! Uploaded 0 files (10 already uploaded) (0.43 sec)
23:06:54.910
23:06:55.201 ✨ Upload complete!
23:06:59.721 Success: Assets published!
23:07:20.058 Error: Failed to publish your Function. Got error: Unknown internal error occurred.
is something internal broken
budparr
budparr14mo ago
I'm seeing this same error, too
Chaika
Chaika14mo ago
@budparr That error is probably related to this: https://www.cloudflarestatus.com/incidents/s1hkh315y9s9 I would wait until that is resolved and try again
budparr
budparr14mo ago
Thanks!
kane
kane14mo ago
got it thanks
Jorge
JorgeOP14mo ago
@JohnDotAwesome thanks. That seems to be an issue with Cloudflare Pages per se though? It's working locally and with Netlify, so it's likely related to the environment/infra they are being building on (CF). Is there anything we can try to fix it within our build container?
JohnDotAwesome
JohnDotAwesome14mo ago
Indeed. We believe there's an issue between Pages CI (which is in Google Cloud) and AWS. Netlify being in AWS does not see this issue. I have found that adding appropriate timeouts and retries to requests solves the issue, but again, that's obviously not ideal. I'll be sharing more details as I get them + code samples for how I've fixed other projects on Pages. We fully intend to get to the bottom of the networking issues between Pages CI<->AWS
Jorge
JorgeOP13mo ago
That'd be great, thanks @johndotawesome.
JohnDotAwesome
JohnDotAwesome13mo ago
Alrighty. Just coming back to this @Jorge - If you can override how http requests are being made during your generate step, you can use a fetch function similar to the one below using the p-retry library:
async function fetchWithRetries({
url,
retries = 5,
timeout = 3 * 1000,
shouldRetry = (res) => !res.ok,
...requestInit
}) {
try {
return await pRetry(
async (attemptCount) => {
const res = await fetch(url, {
...requestInit,
signal: AbortSignal.timeout(timeout),
})

if (shouldRetry(res) && attemptCount < retries) {
throw new ResponseNotOkayError(url, res)
}

return res
},
{ retries, randomize: true }
)
} catch (e) {
if (e instanceof DOMException) {
throw new RequestTimedOutError(url, timeout, retries)
} else {
throw e
}
}
}

class ResponseNotOkayError extends Error {
constructor(url, res) {
super(`Request to ${url} was not okay`)
}
}

class RequestTimedOutError extends Error {
constructor(url, timeout, retries) {
super(
`Request to ${url} timed out ${retries} times each after ${timeout}ms`
)
}
}
async function fetchWithRetries({
url,
retries = 5,
timeout = 3 * 1000,
shouldRetry = (res) => !res.ok,
...requestInit
}) {
try {
return await pRetry(
async (attemptCount) => {
const res = await fetch(url, {
...requestInit,
signal: AbortSignal.timeout(timeout),
})

if (shouldRetry(res) && attemptCount < retries) {
throw new ResponseNotOkayError(url, res)
}

return res
},
{ retries, randomize: true }
)
} catch (e) {
if (e instanceof DOMException) {
throw new RequestTimedOutError(url, timeout, retries)
} else {
throw e
}
}
}

class ResponseNotOkayError extends Error {
constructor(url, res) {
super(`Request to ${url} was not okay`)
}
}

class RequestTimedOutError extends Error {
constructor(url, timeout, retries) {
super(
`Request to ${url} timed out ${retries} times each after ${timeout}ms`
)
}
}
Jorge
JorgeOP13mo ago
Thanks @JohnDotAwesome. Where do you suggest we add this fetch function? We use Nuxt 2 with the Storyblok module [https://github.com/storyblok/storyblok-nuxt-2] to fetch the data.
JohnDotAwesome
JohnDotAwesome13mo ago
looks like they recommend overriding the global fetch
No description
JohnDotAwesome
JohnDotAwesome13mo ago
You'd want to set it to something like this:
global.fetch = async (input, init) => {
console.log('input', input)
console.log('init', init)

try {
if (input instanceof URL || typeof input === 'string') {
return await fetchWithRetries({ url: input, ...init })
}

return await fetchWithRetries({ ...input, ...init })
} catch (e) {
console.error(e)
throw e
}
}
global.fetch = async (input, init) => {
console.log('input', input)
console.log('init', init)

try {
if (input instanceof URL || typeof input === 'string') {
return await fetchWithRetries({ url: input, ...init })
}

return await fetchWithRetries({ ...input, ...init })
} catch (e) {
console.error(e)
throw e
}
}
Perhaps keep the console.logs initially for debugging
Jorge
JorgeOP13mo ago
Hi @JohnDotAwesome thanks for your suggestion. I appreciate the workaround solution you've provided. I attempted to incorporate the custom fetch into that library, but unfortunately, it didn't seem resolve the issue. I will keep trying. While this workaround might temporarily address the problem, it's a significant concern that we hope will receive the necessary attention. In addition, we're encountering timeouts during our deployments on Pages, specifically when trying to download Node. This issue isn't related to our build process but appears to be a problem within the Pages environment.
08:09:05.339 Installing nodejs 18.18.1
08:09:06.047 Trying to update node-build... ok
08:09:06.302 Downloading node-v18.18.1-linux-x64.tar.gz...
08:09:06.303 -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:11:12.875 error: failed to download node-v18.18.1-linux-x64.tar.gz
08:11:12.875 -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:15:58.882 error: failed to download node-v18.18.1-linux-x64.tar.gz
08:15:58.882
08:15:58.907 BUILD FAILED (Ubuntu 22.04 using node-build 4.9.122-28-g4fd6e213)
08:09:05.339 Installing nodejs 18.18.1
08:09:06.047 Trying to update node-build... ok
08:09:06.302 Downloading node-v18.18.1-linux-x64.tar.gz...
08:09:06.303 -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:11:12.875 error: failed to download node-v18.18.1-linux-x64.tar.gz
08:11:12.875 -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:15:58.882 error: failed to download node-v18.18.1-linux-x64.tar.gz
08:15:58.882
08:15:58.907 BUILD FAILED (Ubuntu 22.04 using node-build 4.9.122-28-g4fd6e213)
You can see it attempted to download Node for 6 minutes but eventually timed out and caused the build to fail. Have there been any recent major changes, work, or updates to Pages? Frankly, in light of the incident we experienced yesterday, it seems like things may be becoming unmanageable. What is the latest status on resolving these issues, and when can we expect to have stability and trust restored?
JohnDotAwesome
JohnDotAwesome13mo ago
with respect to the node.js download, unless you need that specific version of node, you can specify 18 via the NODE_VERSION env var or version file to use the pre-installed node.js 18 version The v2 build image ships with pre-installed major versions of nodejs from 14 through 20 The team is awaiting a response from our cloud provider (GCP) on the timeout issues to Cloudfront. I will update this thread when I know more w.r.t. stability issues, I can say that this is an issue we are taking very seriously internally. I don't want to say more than that for now; I'll leave that for senior leadership to say in the upcoming IR
Jorge
JorgeOP13mo ago
Right, thanks @JohnDotAwesome appreciate your support on this. Regarding the Node issue, yes we do use NODE_VERSION and version file to set our Node version. It has always worked ok. I raised the issue now because it started to hang on that downloading task.
JohnDotAwesome
JohnDotAwesome13mo ago
Totally understand. I'm just saying if you specify only the value 18 you won't have that problem again. We are working internally on Tool Caching (rather than dependency and build output), but that won't be available for a while. In order to solve your problem more immediately, I was suggesting being fuzzy with your node verision rather than exact
Jorge
JorgeOP13mo ago
Right, got it! That helps, thanks. It doesn't look like it's happening anymore, but we won't hurt by specifying just the major version (18) if that helps with stability. 👌 Our main concern remains the Storyblok timeouts during deployments (GCP -> Cloudfront issue). We are running an alternative pipeline with Github actions to manually build and upload the artifacts for now. This has been very stable but it's not the ideal scenario for us. Please keep us updated on the response from GCP about this issue and hopefully we can fix that as soon as possible. Thanks @JohnDotAwesome. Hi @JohnDotAwesome just touching base to check whether there has been any progress with Pages cloud provider (GCP) on the timeout issues to Cloudfront? Our deploys within Pages are still failing. Thanks
JohnDotAwesome
JohnDotAwesome13mo ago
Hey, Jorge. Indeed. Google has shown us that it's actually Cloudfront dropping packets. We did a session with one of their engineers inspecting packets from GCP->Cloudfront and indeed, Cloudfront is the culprit. This is regional, Cloudfront is presumably only showing this behavior for select points of presence (and in particular, the PoP closest to the Pages Build cluster). We're still trying to find the right way to engage with AWS since we're not a customer The best advice here still is to upgrade to node v20 which re-uses TLS connections by default
Jorge
JorgeOP13mo ago
@JohnDotAwesome Just to update this thread: I've upgraded to Node v20 but that hasn't fixed the issue. What did fix it was to upgrade to latest version of storyblok-js-client and storyblok-nuxt (https://github.com/storyblok/storyblok-js-client & https://github.com/storyblok/storyblok-nuxt). I believe these modules don't use axios anymore and have better ways to handle this fetch issue.
GitHub
GitHub - storyblok/storyblok-js-client: Universal JavaScript client...
Universal JavaScript client for Storyblok's API. Contribute to storyblok/storyblok-js-client development by creating an account on GitHub.
JohnDotAwesome
JohnDotAwesome13mo ago
Interesting! The http clients in the previous version may have been explicitly not been using keepAlives. I really appreciate the follow-up. This was such a strange debugging saga for us
alexh
alexh12mo ago
Hello - I'm getting similar issues trying to connect to the Prismic API using a Nuxt 2 app within Cloudflare Pages. I unfortunately can't move beyond Node 16 due to incompatibilities to do with Nuxt 2 dependencies, and the Prismic Client library hasn't been updated in the same way that Storyblok's has. This is really concerning, as multiple clients are currently unable to update their sites. Is there any progress on getting these connection with Cloudfront issues sorted? It really isn't ideal. A sample error looks as follows:
14:15:37.209 ERROR request to https://xxxxx.cdn.prismic.io/api/v2 failed, reason: connect ETIMEDOUT 108.138.94.18:443
14:15:37.209
14:15:37.209 at ClientRequest.<anonymous> (node_modules/node-fetch/lib/index.js:1491:11)
14:15:37.209 at ClientRequest.emit (node:events:390:28)
14:15:37.209 at ClientRequest.emit (node:domain:475:12)
14:15:37.209 at TLSSocket.socketErrorListener (node:_http_client:447:9)
14:15:37.209 at TLSSocket.emit (node:events:390:28)
14:15:37.209 at TLSSocket.emit (node:domain:475:12)
14:15:37.209 at emitErrorNT (node:internal/streams/destroy:157:8)
14:15:37.210 at emitErrorCloseNT (node:internal/streams/destroy:122:3)
14:15:37.210 at processTicksAndRejections (node:internal/process/task_queues:83:21)
14:15:37.209 ERROR request to https://xxxxx.cdn.prismic.io/api/v2 failed, reason: connect ETIMEDOUT 108.138.94.18:443
14:15:37.209
14:15:37.209 at ClientRequest.<anonymous> (node_modules/node-fetch/lib/index.js:1491:11)
14:15:37.209 at ClientRequest.emit (node:events:390:28)
14:15:37.209 at ClientRequest.emit (node:domain:475:12)
14:15:37.209 at TLSSocket.socketErrorListener (node:_http_client:447:9)
14:15:37.209 at TLSSocket.emit (node:events:390:28)
14:15:37.209 at TLSSocket.emit (node:domain:475:12)
14:15:37.209 at emitErrorNT (node:internal/streams/destroy:157:8)
14:15:37.210 at emitErrorCloseNT (node:internal/streams/destroy:122:3)
14:15:37.210 at processTicksAndRejections (node:internal/process/task_queues:83:21)
JohnDotAwesome
JohnDotAwesome12mo ago
Perhaps Prismic has a way to plumb http agents through to node-fetch? https://github.com/node-fetch/node-fetch#custom-agent <-- If node-fetch uses an http agent with keepAlive: true set, then it will work
GitHub
GitHub - node-fetch/node-fetch: A light-weight module that brings t...
A light-weight module that brings the Fetch API to Node.js - GitHub - node-fetch/node-fetch: A light-weight module that brings the Fetch API to Node.js
alexh
alexh12mo ago
Thanks for this John. I’ll have a look into that! Has there been any progress working with the AWS side with their packet losses?
JohnDotAwesome
JohnDotAwesome12mo ago
Nothing significant to update on right now 😦
Lion7de
Lion7de12mo ago
Hey @alexh, i do have the same issue with the same stack. Would be awesome to know if you found a solution for it!
alexh
alexh12mo ago
I haven’t had time to investigate the suggestion yet but I’ll post in here when I do!
JohnDotAwesome
JohnDotAwesome12mo ago
Y'all could always use something like https://www.npmjs.com/package/patch-package to modify your node_modules to set the keepAlive option. I know that's far from ideal, but it is an option
npm
patch-package
Fix broken node modules with no fuss. Latest version: 8.0.0, last published: 4 months ago. Start using patch-package in your project by running npm i patch-package. There are 804 other projects in the npm registry using patch-package.
alexh
alexh12mo ago
Thanks @JohnDotAwesome @Lion7de I've forked the @prismicio/client library to both use the keepAlive option, as well as retry each request 5 times. It does cause build times to be longer, but it appears to work. https://github.com/studiotreble/prismic-client/tree/v5 You can install it by setting the version of the library in package.json to studiotreble/prismic-client#v5.2.4-custom Would be great to get the network issues ironed out though.
JohnDotAwesome
JohnDotAwesome12mo ago
hrmmmm we've found with the keepAlive working properly, none of the requests end up timing out due to dropped packets. I wonder why your situation is ending up with longer builds. Presumably that's because some requests are still timing out. Do you have control over the time to abort the request?
alexh
alexh12mo ago
Yes - keepAlive alone didn't seem to do the trick and was still getting timeouts.
JohnDotAwesome
JohnDotAwesome12mo ago
Interesting! In that case, I'd make sure those requests timeout in a reasonable time. In some situations, I saw clients waiting 30s before retrying (or just failing). Most of these requests to your CMS should respond in less than 2s and really even that's pushing it
alexh
alexh12mo ago
I actually set it to 30s to be conservative. I could try reducing that That would be consistent with the longer build times
JohnDotAwesome
JohnDotAwesome12mo ago
yeah, I'd bring that down quite a bit. I'd recommend 5s, but tweak as needed
alexh
alexh12mo ago
Okay will tweak, thanks!
Want results from more Discord servers?
Add your server