Deployments keep failing ❌

Deployments have been failing with no changes on our end.

[error] There was an error fetching the page: connect ETIMEDOUT 18.172.170.120:443
18:35:59.870      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)
18:35:59.870    [error]

[error] There was an error fetching the page: connect ETIMEDOUT 18.172.170.120:443
18:35:59.870      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)
18:35:59.870    [error]

We have the same build running on Netlify and locally, both running successfully. It just fails on Cloudflare Pages. It looks like it's failing to fetch files (something is blocking the IP)? Our stack is Nuxt 2 with Node 16 generating fully static pages. Url: https://website-poc.pages.dev/ (this has the last successful build, 5 days ago) We can't deploy any updates to the website at the moment, so it's a critical issue. Anyone else also experiencing something similar?

Deputy

Scheduling, Timesheet & Time Clock Software — Deputy

Deputy is easy-to-use employee scheduling, timesheet & time clock software. Get started in minutes for free today & see how much time you could save!

38 Replies

James•2y ago

cc @JohnDotAwesome this might be of interest

JohnDotAwesome•2y ago

👀 @Jorge when you get back online - I notice that IP address is to AWS - what CMS is it trying to hit? Storyblok? We started seeing issues with 3 other projects that use Nuxt, Node 16, and make requests to AWS infra starting on October 13th. Did you see any other potential failures longer ago than 5 days ago? @Jorge just a heads up, we're actively trying to resolve this issue and I'll keep you posted. In the event that you need to perform a deployment, do know that you can still perform deployments via Direct Upload I'll be continuing to take a look at this situation in the morning cc @Nevi | Pages Product Manager @natalier

JorgeOP•2y ago

Thanks @JohnDotAwesome. Yes, that IP address is from Storyblok. Yes, we first started experiencing issues 12 days ago (same October 13th you mentioned), initially just timing out builds with no error msgs:

22:00:38.371    Success: Finished cloning repository files
22:35:42.078    Failed: build exceeded the time limit and was terminated. Refer to https://developers.cloudflare.com/pages/platform/limits/#builds for build limits

22:00:38.371    Success: Finished cloning repository files
22:35:42.078    Failed: build exceeded the time limit and was terminated. Refer to https://developers.cloudflare.com/pages/platform/limits/#builds for build limits

It kept failing like that on every deploy for many days, until that specific error Error: connect ETIMEDOUT 18.172.170.120:443 started coming up in the logs on October 16th. It then managed to do 2 or 3 successful deploys on the following days while still failing the majority with that same error msg. The last successful one being that one 5 days ago. October 13th might be the common date here. I'm unaware of any incidents with AWS or Storyblok around that day. Have you heard anything on what might have triggered the issues? Could this be related to this issue? https://discord.com/channels/595317990191398933/1164972884863754240/1166111453967814656

JohnDotAwesome•2y ago

That's different, but coincidentally started happening around the same time. We've rolled back the changes that were causing that particular case but these timeouts have persisted Something we've wanted to try with the other cases is to update node.js. Obviously that's not super ideal, but it's worth a shot

kane•2y ago

23:06:54.909    ✨ Success! Uploaded 0 files (10 already uploaded) (0.43 sec)
23:06:54.910    
23:06:55.201    ✨ Upload complete!
23:06:59.721    Success: Assets published!
23:07:20.058    Error: Failed to publish your Function. Got error: Unknown internal error occurred.

23:06:54.909    ✨ Success! Uploaded 0 files (10 already uploaded) (0.43 sec)
23:06:54.910    
23:06:55.201    ✨ Upload complete!
23:06:59.721    Success: Assets published!
23:07:20.058    Error: Failed to publish your Function. Got error: Unknown internal error occurred.

is something internal broken

budparr•2y ago

I'm seeing this same error, too

Chaika•2y ago

@budparr That error is probably related to this: https://www.cloudflarestatus.com/incidents/s1hkh315y9s9 I would wait until that is resolved and try again

Cloudflare Dashboard and Cloudflare API service issues

budparr•2y ago

Thanks!

kane•2y ago

got it thanks

JorgeOP•2y ago

@JohnDotAwesome thanks. That seems to be an issue with Cloudflare Pages per se though? It's working locally and with Netlify, so it's likely related to the environment/infra they are being building on (CF). Is there anything we can try to fix it within our build container?

JohnDotAwesome•2y ago

Indeed. We believe there's an issue between Pages CI (which is in Google Cloud) and AWS. Netlify being in AWS does not see this issue. I have found that adding appropriate timeouts and retries to requests solves the issue, but again, that's obviously not ideal. I'll be sharing more details as I get them + code samples for how I've fixed other projects on Pages. We fully intend to get to the bottom of the networking issues between Pages CI<->AWS

JorgeOP•2y ago

That'd be great, thanks @johndotawesome.

JohnDotAwesome•2y ago

Alrighty. Just coming back to this @Jorge - If you can override how http requests are being made during your generate step, you can use a fetch function similar to the one below using the p-retry library:

async function fetchWithRetries({
  url,
  retries = 5,
  timeout = 3 * 1000,
  shouldRetry = (res) => !res.ok,
  ...requestInit
}) {
  try {
    return await pRetry(
      async (attemptCount) => {
        const res = await fetch(url, {
          ...requestInit,
          signal: AbortSignal.timeout(timeout),
        })

        if (shouldRetry(res) && attemptCount < retries) {
          throw new ResponseNotOkayError(url, res)
        }

        return res
      },
      { retries, randomize: true }
    )
  } catch (e) {
    if (e instanceof DOMException) {
      throw new RequestTimedOutError(url, timeout, retries)
    } else {
      throw e
    }
  }
}

class ResponseNotOkayError extends Error {
  constructor(url, res) {
    super(`Request to ${url} was not okay`)
  }
}

class RequestTimedOutError extends Error {
  constructor(url, timeout, retries) {
    super(
      `Request to ${url} timed out ${retries} times each after ${timeout}ms`
    )
  }
}

async function fetchWithRetries({
  url,
  retries = 5,
  timeout = 3 * 1000,
  shouldRetry = (res) => !res.ok,
  ...requestInit
}) {
  try {
    return await pRetry(
      async (attemptCount) => {
        const res = await fetch(url, {
          ...requestInit,
          signal: AbortSignal.timeout(timeout),
        })

        if (shouldRetry(res) && attemptCount < retries) {
          throw new ResponseNotOkayError(url, res)
        }

        return res
      },
      { retries, randomize: true }
    )
  } catch (e) {
    if (e instanceof DOMException) {
      throw new RequestTimedOutError(url, timeout, retries)
    } else {
      throw e
    }
  }
}

class ResponseNotOkayError extends Error {
  constructor(url, res) {
    super(`Request to ${url} was not okay`)
  }
}

class RequestTimedOutError extends Error {
  constructor(url, timeout, retries) {
    super(
      `Request to ${url} timed out ${retries} times each after ${timeout}ms`
    )
  }
}

JorgeOP•2y ago

Thanks @JohnDotAwesome. Where do you suggest we add this fetch function? We use Nuxt 2 with the Storyblok module [https://github.com/storyblok/storyblok-nuxt-2] to fetch the data.

JohnDotAwesome•2y ago

looks like they recommend overriding the global fetch

JohnDotAwesome•2y ago

You'd want to set it to something like this:

global.fetch = async (input, init) => {
  console.log('input', input)
  console.log('init', init)

  try {
    if (input instanceof URL || typeof input === 'string') {
      return await fetchWithRetries({ url: input, ...init })
    }

    return await fetchWithRetries({ ...input, ...init })
  } catch (e) {
    console.error(e)
    throw e
  }
}

global.fetch = async (input, init) => {
  console.log('input', input)
  console.log('init', init)

  try {
    if (input instanceof URL || typeof input === 'string') {
      return await fetchWithRetries({ url: input, ...init })
    }

    return await fetchWithRetries({ ...input, ...init })
  } catch (e) {
    console.error(e)
    throw e
  }
}

Perhaps keep the console.logs initially for debugging

JorgeOP•2y ago

Hi @JohnDotAwesome thanks for your suggestion. I appreciate the workaround solution you've provided. I attempted to incorporate the custom fetch into that library, but unfortunately, it didn't seem resolve the issue. I will keep trying. While this workaround might temporarily address the problem, it's a significant concern that we hope will receive the necessary attention. In addition, we're encountering timeouts during our deployments on Pages, specifically when trying to download Node. This issue isn't related to our build process but appears to be a problem within the Pages environment.

08:09:05.339    Installing nodejs 18.18.1
08:09:06.047    Trying to update node-build... ok
08:09:06.302    Downloading node-v18.18.1-linux-x64.tar.gz...
08:09:06.303    -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:11:12.875    error: failed to download node-v18.18.1-linux-x64.tar.gz
08:11:12.875    -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:15:58.882    error: failed to download node-v18.18.1-linux-x64.tar.gz
08:15:58.882    
08:15:58.907    BUILD FAILED (Ubuntu 22.04 using node-build 4.9.122-28-g4fd6e213)

08:09:05.339    Installing nodejs 18.18.1
08:09:06.047    Trying to update node-build... ok
08:09:06.302    Downloading node-v18.18.1-linux-x64.tar.gz...
08:09:06.303    -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:11:12.875    error: failed to download node-v18.18.1-linux-x64.tar.gz
08:11:12.875    -> https://nodejs.org/dist/v18.18.1/node-v18.18.1-linux-x64.tar.gz
08:15:58.882    error: failed to download node-v18.18.1-linux-x64.tar.gz
08:15:58.882    
08:15:58.907    BUILD FAILED (Ubuntu 22.04 using node-build 4.9.122-28-g4fd6e213)

You can see it attempted to download Node for 6 minutes but eventually timed out and caused the build to fail. Have there been any recent major changes, work, or updates to Pages? Frankly, in light of the incident we experienced yesterday, it seems like things may be becoming unmanageable. What is the latest status on resolving these issues, and when can we expect to have stability and trust restored?

JohnDotAwesome•2y ago

with respect to the node.js download, unless you need that specific version of node, you can specify 18 via the NODE_VERSION env var or version file to use the pre-installed node.js 18 version The v2 build image ships with pre-installed major versions of nodejs from 14 through 20 The team is awaiting a response from our cloud provider (GCP) on the timeout issues to Cloudfront. I will update this thread when I know more w.r.t. stability issues, I can say that this is an issue we are taking very seriously internally. I don't want to say more than that for now; I'll leave that for senior leadership to say in the upcoming IR

JorgeOP•2y ago

Right, thanks @JohnDotAwesome appreciate your support on this. Regarding the Node issue, yes we do use NODE_VERSION and version file to set our Node version. It has always worked ok. I raised the issue now because it started to hang on that downloading task.

JohnDotAwesome•2y ago

Totally understand. I'm just saying if you specify only the value 18 you won't have that problem again. We are working internally on Tool Caching (rather than dependency and build output), but that won't be available for a while. In order to solve your problem more immediately, I was suggesting being fuzzy with your node verision rather than exact

JorgeOP•2y ago

Right, got it! That helps, thanks. It doesn't look like it's happening anymore, but we won't hurt by specifying just the major version (18) if that helps with stability. 👌 Our main concern remains the Storyblok timeouts during deployments (GCP -> Cloudfront issue). We are running an alternative pipeline with Github actions to manually build and upload the artifacts for now. This has been very stable but it's not the ideal scenario for us. Please keep us updated on the response from GCP about this issue and hopefully we can fix that as soon as possible. Thanks @JohnDotAwesome. Hi @JohnDotAwesome just touching base to check whether there has been any progress with Pages cloud provider (GCP) on the timeout issues to Cloudfront? Our deploys within Pages are still failing. Thanks

JohnDotAwesome•2y ago

Hey, Jorge. Indeed. Google has shown us that it's actually Cloudfront dropping packets. We did a session with one of their engineers inspecting packets from GCP->Cloudfront and indeed, Cloudfront is the culprit. This is regional, Cloudfront is presumably only showing this behavior for select points of presence (and in particular, the PoP closest to the Pages Build cluster). We're still trying to find the right way to engage with AWS since we're not a customer The best advice here still is to upgrade to node v20 which re-uses TLS connections by default

JorgeOP•2y ago

@JohnDotAwesome Just to update this thread: I've upgraded to Node v20 but that hasn't fixed the issue. What did fix it was to upgrade to latest version of storyblok-js-client and storyblok-nuxt (https://github.com/storyblok/storyblok-js-client & https://github.com/storyblok/storyblok-nuxt). I believe these modules don't use axios anymore and have better ways to handle this fetch issue.

GitHub

GitHub - storyblok/storyblok-js-client: Universal JavaScript client...

Universal JavaScript client for Storyblok's API. Contribute to storyblok/storyblok-js-client development by creating an account on GitHub.

JohnDotAwesome•2y ago

Interesting! The http clients in the previous version may have been explicitly not been using keepAlives. I really appreciate the follow-up. This was such a strange debugging saga for us

alexh•17mo ago

Hello - I'm getting similar issues trying to connect to the Prismic API using a Nuxt 2 app within Cloudflare Pages. I unfortunately can't move beyond Node 16 due to incompatibilities to do with Nuxt 2 dependencies, and the Prismic Client library hasn't been updated in the same way that Storyblok's has. This is really concerning, as multiple clients are currently unable to update their sites. Is there any progress on getting these connection with Cloudfront issues sorted? It really isn't ideal. A sample error looks as follows:

14:15:37.209 ERROR request to https://xxxxx.cdn.prismic.io/api/v2 failed, reason: connect ETIMEDOUT 108.138.94.18:443
14:15:37.209
14:15:37.209 at ClientRequest.<anonymous> (node_modules/node-fetch/lib/index.js:1491:11)
14:15:37.209 at ClientRequest.emit (node:events:390:28)
14:15:37.209 at ClientRequest.emit (node:domain:475:12)
14:15:37.209 at TLSSocket.socketErrorListener (node:_http_client:447:9)
14:15:37.209 at TLSSocket.emit (node:events:390:28)
14:15:37.209 at TLSSocket.emit (node:domain:475:12)
14:15:37.209 at emitErrorNT (node:internal/streams/destroy:157:8)
14:15:37.210 at emitErrorCloseNT (node:internal/streams/destroy:122:3)
14:15:37.210 at processTicksAndRejections (node:internal/process/task_queues:83:21)

14:15:37.209 ERROR request to https://xxxxx.cdn.prismic.io/api/v2 failed, reason: connect ETIMEDOUT 108.138.94.18:443
14:15:37.209
14:15:37.209 at ClientRequest.<anonymous> (node_modules/node-fetch/lib/index.js:1491:11)
14:15:37.209 at ClientRequest.emit (node:events:390:28)
14:15:37.209 at ClientRequest.emit (node:domain:475:12)
14:15:37.209 at TLSSocket.socketErrorListener (node:_http_client:447:9)
14:15:37.209 at TLSSocket.emit (node:events:390:28)
14:15:37.209 at TLSSocket.emit (node:domain:475:12)
14:15:37.209 at emitErrorNT (node:internal/streams/destroy:157:8)
14:15:37.210 at emitErrorCloseNT (node:internal/streams/destroy:122:3)
14:15:37.210 at processTicksAndRejections (node:internal/process/task_queues:83:21)

JohnDotAwesome•17mo ago

Perhaps Prismic has a way to plumb http agents through to node-fetch? https://github.com/node-fetch/node-fetch#custom-agent <-- If node-fetch uses an http agent with keepAlive: true set, then it will work

GitHub

GitHub - node-fetch/node-fetch: A light-weight module that brings t...

A light-weight module that brings the Fetch API to Node.js - GitHub - node-fetch/node-fetch: A light-weight module that brings the Fetch API to Node.js

alexh•17mo ago

Thanks for this John. I’ll have a look into that! Has there been any progress working with the AWS side with their packet losses?

JohnDotAwesome•17mo ago

Nothing significant to update on right now 😦

Lion7de•17mo ago

Hey @alexh, i do have the same issue with the same stack. Would be awesome to know if you found a solution for it!

alexh•17mo ago

I haven’t had time to investigate the suggestion yet but I’ll post in here when I do!

JohnDotAwesome•17mo ago

Y'all could always use something like https://www.npmjs.com/package/patch-package to modify your node_modules to set the keepAlive option. I know that's far from ideal, but it is an option

npm

patch-package

Fix broken node modules with no fuss. Latest version: 8.0.0, last published: 4 months ago. Start using patch-package in your project by running npm i patch-package. There are 804 other projects in the npm registry using patch-package.

alexh•17mo ago

Thanks @JohnDotAwesome @Lion7de I've forked the @prismicio/client library to both use the keepAlive option, as well as retry each request 5 times. It does cause build times to be longer, but it appears to work. https://github.com/studiotreble/prismic-client/tree/v5 You can install it by setting the version of the library in package.json to studiotreble/prismic-client#v5.2.4-custom Would be great to get the network issues ironed out though.

JohnDotAwesome•17mo ago

hrmmmm we've found with the keepAlive working properly, none of the requests end up timing out due to dropped packets. I wonder why your situation is ending up with longer builds. Presumably that's because some requests are still timing out. Do you have control over the time to abort the request?

alexh•17mo ago

Yes - keepAlive alone didn't seem to do the trick and was still getting timeouts.

JohnDotAwesome•17mo ago

Interesting! In that case, I'd make sure those requests timeout in a reasonable time. In some situations, I saw clients waiting 30s before retrying (or just failing). Most of these requests to your CMS should respond in less than 2s and really even that's pushing it

alexh•17mo ago

I actually set it to 30s to be conservative. I could try reducing that That would be consistent with the longer build times

JohnDotAwesome•17mo ago

yeah, I'd bring that down quite a bit. I'd recommend 5s, but tweak as needed

alexh•17mo ago

Okay will tweak, thanks!

Gaming

Programming

Deployments keep failing ❌

Did you find this page helpful?