Pages build times have gone from 20 mins to 35+ mins
Last week we started seeing our pages build step time increase by about 15 minutes with no code changes to warrant such a jump. This results in the project exceeding the Cloudflare Pages build time limit and failing. I've run tests locally, in Vercel, and Jenkins as a comparison to validate that claim.
Vercel Production build time: 24:38
Local Production build time: 22:12
Jenkins production build time: 21:28
Kind of at a loss here as we are unsure of how to resolve this issue. This is becoming very worrisome for my team and I as we have been unable to deploy through Cloudflare Pages for a week now.
118 Replies
Are you using build caching on your pages project?
I don't believe we are for that project specifically. I think we only enabled it for a different project, but didn't see much of a difference. But I think that may be due to the fact that we are using Nuxt.JS as our framework, if I recall correctly I don't think there is support for Nuxt just yet. on the docs it looks like Next, gatsby, and Astro.
It should still do node_modules caching but yeah I misread Nuxt and Next. Are there certain steps that are taking longer?
It seems like it's the static build when running
nuxt generate
during the Cloudflare Pages build
step that started increasing in duration. Here is a deployment ID if that helps at all: 53207dd6-24fb-437b-8190-c9772a48afc7
@elijah y'all use Nacelle, right?
Yes sir! we are beginning to think it is a network issue between Cloudflare and Nacelle, just got out of a couple meetings discussing this topic
This isn't the first Nacelle/Nuxt project I've looked at for being slow today. Something funky is going on there. Another project I looked at had maaaaany timeouts going to Nacelle. Not all build outputs are reporting that
Yep, that's what's been making this so difficult for us to identify. We have run so many different tests and ruled out quite a few options already.
Does Nacelle have an IP allowlist option?
I can ask and get you an answer for that. Would it be possible to do a traceroute so we can know where that timeout is happening?
They already whitelisted all the publics IPs on their end, didn't seem to make any impact though
To clarify - Are y'all using direct upload as well? Using wrangler to publish your assets not from Pages CI?
On wrangler we are having no issues come up and we are uploading fine, we are only seeing the issue from Pages CI
Okay cool. Just double checking that I was looking at the right thing 👍
All good, I really appreciate you looking into this. I'll be available if you have anymore questions or need anything else.
This shouldn't make a difference on the timeouts to Nacelle, but you really ought to update the your Build System to v2
That's the plan, and we have on our other projects. We just paused any upgrades and toggling caching until we get this issue resolved.
It might help out. The version of Ubuntu is updated. Perhaps something in the networking stack on that older verison of Ubuntu is triggering a WAF rule on Nacelle's end. It's long shot, though
When you say they've allowlisted all public IP's on their end, do they mean Cloudflare colo IP's? Pages Builds unfortunately are not included there
Pages Build IP's are not fixed. They'll be coming from Google Cloud's ASN
I'll talk with my team tomorrow and bring this up in the conversation.
All IPv4 and IPv6 here: https://www.cloudflare.com/ips/
Cloudflare
IP Ranges | Cloudflare
This page is intended to be the definitive source of Cloudflare’s current IP ranges.
Ah yeah - those IP ranges are edge IP's, not Pages Build IPs
AFAIK, static IP's for Pages Builds is not something we can do, but like I said earlier, Google Cloud ASN will be where they're coming from
That makes sense. I'll let Nacelle know incase they haven't been told this already.
Thank you very much for the info man, greatly appreciated.
One more question - Do you have an idea when exactly this started occurring for you on Pages?
This started happening either the night of 10/13/23 or early hours of 10/14/23
We are also having trouble accessing our pages deployments in that project, the request that pulls in the paginated deployments 504s when requesting 15 items per page. I could maybe use a script that's been working for us that pulls 10 deployments in per page and get you an exact time if you need that. Just lmk and I'll do some digging
This is a silly issue on our end. It's getting fixed very soon
The deployments requests taking forever that is
Update here - We spoke with Nacelle. I'm still leaning towards this being a WAF rule on their end, but I'm going to keep trying debug from the Pages end. They're going to try an Allow rule just for Pages builds. We'll see if that helps
That's amazing, thank you!
We are running some tests with them right now, keeping them in the loop on results
Truly appreciate your help digging into these issues.
🙏 Likewise! Also, looks like the deployments listing request duration fix won't be out until Monday, unfortunately
I kicked this deployment off via the create deployment API:
2ce23058-84e6-48d0-a571-4a195e6d9f98
if that the logs will help you guys at all for the build issue
Monday is fine, I'll probably be the only one tinkering on our side over the weekend and I got myself situated well with the API to where i can get whatever info from the dash I need
One more Deployment ID from Rhone's side: 9affa82b-c3f8-4e4b-8154-559f8f4c1669
Both of these were done after WAF changes on Nacelle's end.I'm looking to revert y'all to our old build cluster over the next 2-3 hours. I appreciate your patience
🙏 Thank you very much, I really appreciate all your time and help. If you need me for anything on our end I'll be available.
@elijah when you get a chance, could you try more builds? I've got y'all reverted now
You may find that build initialization is a little bit longer at first, but that will clear up after a little bit
Yeah, I'll kick one off right now, sorry for lag. Meetings upon meetings 🤣
np I appreciate it!
b15c0b09-5786-4fce-9cd2-40269f0a3310
Here is the deployment ID, I'll keep you posted on how the build goes. Should be about 20 or so mins@elijah looks like the build timed out still :/
Ah yeah. looks to have timed out again. you have any recommendations for things we can do/try on our end?
Also had another question regarding the 504s when listing deployments, did that go out yet today or still in queue?
Let me check on that listing latency issue
That has not gone yet, but I will try and move it along today
I feel like I asked before, but have you tried the v2 image?
Much appreciated, is there any feeds that I could watch so I can know when this goes out? Apologies for the possibly stupid question
We have tried it on another repo, but not this one yet. I'll see what I can do on my end about giving that go
Lemme try reverting some changes in the v1 build system that were made. They were completely unrelated to what we see here, but that would be the only thing remaining
sounds good to me, lmk if/when you want me to kick off another test build. Again, I really appreciate your efforts, you've been nothing but helpful through this issue.
Alrighty. I've got you opted into a reverted v1 image at the moment. Mind giving it another go? As usual with things like this, your initialization time will be slow at first (it might take up to 3min in some cases)
If this does fix the issue, I will be very surprised but also very happy 😂
On it. Deployment ID:
215301bb-b1f7-43ca-b71b-ea4efffcf29a
Initialization time seemed to be pretty quick(7s
).
Haha, worth a shot! Thank you very much 🙏
No luck :(, exceeded time limit again.I see you're giving v2 a go as well. Just a heads up, the build image reversion never took effect because... I'm an idiot lol I did not press the "enable" toggle for you heh
SO before I force you into the old version of v1, maybe let's get v2 working for you
Looks like you're having an issue with a private npm package install
I'm having issues getting the thumbs up on switching to v2 😭
alright. lemme force you into that v1 reversion for real this time
Alrighty. Sorry for the back and forth. Your builds should now be reverted to the old v1 image (regardless of what build system your project is selected to use)
Can I ask what's preventing the thumbs-up?
i'll try and kick off another build.
I think it's just with the current issues and having to bump the node version is causing some discomfort with moving forward with that right now. On top of that we are supposed to be in a code freeze at the moment.
One thing I am debating though is just cutting a project on my personal CF account and attempting with Build System 2. I just met with the dev who was testing it out and running into those npm issues, which seem to just be 404ing with a specific private package we maintain
FWIW, your node version will stay the same. Ubuntu will be updated. The default node version (that is you don't specify one) will update to 18, but you'll still be on 16 since it's expressed as an env var
Aaaah, that sounds promising. I'll bring that up with the wider audience. 🙏
Do you know by chance why the npm install would be having issues installing that private package with the same npm auth key in V1 vs V2?
Lemme see if I can repro
Test build failed with time limit exceeded:
c87b3cc9-d6a6-4731-96fa-eabb02e30e80
Just got a personal project cloned and setup
I just removed that package and where it's used, going to seee if i can get the app built 🤞Hey jumping in the convo as we are in the same boat here. Was the conclusion that the cluster reversion didn't help? Support mentioned our container getting rolled back but I never heard back if it happened for us or not.
Hey Ben, unfortunately we saw no luck with the cluster reversion.
No luck with project clone either, still exceeding build time.
I'm also still seeing the requests timeout on the v2 build system.
Have you been able to get ahold of a network trace for the nacelle requests?
Not really sure how I would pull that out during the build process but happy to try out any suggestions you have. I can catch the request failure and log it, but the caught error just displays the requested resource as below:
Error: [Network] request to https://storefront.api.nacelle.com/graphql/v1/spaces/97747e15-8457-4e3e-b865-f1597f46d1e7?operationName=Navigation&variables=%7B%7D&extensions=%7B%22persistedQuery%22%3A%7B%22sha256Hash%22%3A%228021fb9d1efefe1d02156fb84cf359c35157a97535bac55a55b02a2c0e507a90%22%2C%22version%22%3A1%7D%7D failed, reason: connect ETIMEDOUT 13.224.14.92:443
The actual request is handled in the SDK so I think further logging might have to take place in the SDK's code.
I think you might have to request a network trace from Cloudflare.
I really thought the cluster reversion would solve this since the timing sounded about right for the cluster upgrade and when we both started seeing these issues.
Hey @Ben Harker @elijah - we're escalating this. I and some other colleagues will be continuing to debug today. Thank you for help so far
@Ben Harker do you have a
.pages.dev
domain you can share with me for debugging?rhone-nacelle and rhone-staging. We have a workaround for feature branches but the master builds on rhone-nacelle reflect the issue.
Yeah sorry just realized you're the Ben Harker! I met you the other day
@elijah can we get you to retry a deployment again? My reversion of your build image was faulty. I've just corrected it. @Ben Harker I will put Rhone in the same bucket
On it!
Deployment ID:
107398f9-c8f6-4571-98ff-77fb6b7d7657
@Ben Harker looks like the latest deploy worked for you
47419d61-c48a-44e2-9ebc-9d7560949daa
Still, quite a bit longer than I'd have expected@JohnDotAwesome that preview build has a workaround in place. Let me ping Ben on Slack to attempt a rebuild on their master branch
(This is Spencer from Nacelle btw)
Yep I cancel the master builds when I see the issue. The feature branches that are working have retry logic in place and that is why they take longer. Master builds haven't added that logic so we can still test for the issue there. The last build was 1ad142928cc72bfc1ad38ad1f85b35adba32b6c4 which was this morning and you'll see the errors in there before I cancelled the build.
👋
Can we try a build that doesn't have the retries? We've forced your project into an older build image just in case that was it
Just read your earlier message let me try a rebuild now
If this doesn't work, then I'm unclear what next steps are. October 13th between 10am-1pm is definitely the inflection point for y'all, but we just don't see this on other API's. We're going to keep digging for other possible changes, but at this point you're rolled back on everything
Seems to persist. de8c04d8-bac2-485c-9a5e-396bad758974
@Nacelle this is a silly question, but do y'all have any changes made to your infra on October 13th?
exceeded time limit
I understand that you've tested on Vercel and Netlify without issue, but it's still possible a change made could only affect certain ingress
Yeah, just saw 😢 - This is sort of good news because if that did fix the problem I would have no idea why
@Ben Harker your test script you sent me, did you run that on Pages?
No just locally
I see. I tried a similar script in Pages and did not see the error either
I'm chatting with the team now - Y'all tried to lower the request concurrency and still saw the same thing?
I wonder if it's possible for us to fork a repo of y'alls? We'd love to be able to make changes more quickly
Concurrency didn't seem to help, cranking the interval to 1000 ms seemed to help, but then builds timed out. Are you able to trigger builds in the rhone-nacelle env? We have switched prod to a different project while we debug so you wouldn't hurt anything in there.
I could probably add someone in github as well if you want to spool off some config tests in a branch.
Are you able to trigger builds in the rhone-nacelle env?I am not. Most folks will only have READONLY access to your CF resources. This seems like the way. We could trigger builds via preview branch commits @Nevi | Pages Product Manager
@Ben Harker have you had any luck with preview branches deploying through pages?
I was able to get one successful build in pages, but it was a preview branch and using a script that limits the number of pages we are building.
@JohnDotAwesome If you can provide a github user in the support ticket I can get them added. Just want to have that documented there.
Hey @Nacelle - any chance that the API's that we're seeing timeouts on are running on node.js 18 or 20?
Just asked a Nacelle eng: The API is designed for compatibility with Node 18+ first but most merchants are still on 16
@JohnDotAwesome for context I took Rhone's build (Node 16) and upgraded them to Node 18 when I first starting testing this. That didn't resolve the issue for us unfortunately
I ask because node18,20 received a security update fixing the http/2 rapid reset issues. They released on the same day issues started for y'all
I thought that perhaps some node 16 clients would maybe exhibit some sort of problem connecting to node18/20 servers. We have another project - not using nacelle - but having the same problem on node 16 + nuxt
They're occasionally seeing timeouts connecting to Storyblok, also hosted on AWS
We also see these timeouts exclusively to AWS so we're thinking that it may also be related to AWS's mitigation of the HTTP/2 Rapid Reset Attack https://aws.amazon.com/security/security-bulletins/AWS-2023-011/
Amazon Web Services, Inc.
CVE-2023-44487 - HTTP/2 Rapid Reset Attack
Hey @Ben Harker I believe I've managed to get Rhone preview builds working with timeouts+retries in very reasonable build times. Will share an update later today or tomorrow morning
Thanks looks similar to the patch that we are using on preview builds, but looks like this times out after 3 sec instead of the default 30 in the other client which would help build times a lot.
I definitely recommend a lower timeout + more retries over a higher timeout and fewer or no retries. Generally those API's don't long to respond 🙂
@JohnDotAwesome I've been testing out a proxy endpoint that's hosted in AWS (
node:lts-alpine
), this endpoint is only being used at build time during the nuxt generate step for reaching out to Nacelle. Some of the request times are a bit longer than we'd like, but with that being said I have yet to see a failed build during my testing so far. Hoping this can at least give us a decent work around for now so we can get back to using Pages again.@elijah That's interesting! You may want to try the same setup I provided to Rhone, with retries + timeouts
I can give that a go today. What changes did you make to get those previews working?
Using the
p-retry
, I provided a new fetch function to Nacelle's SDK. The implementation looks something like this:
With Ben's setup, they have a way to pass fetchClient
to Nacelle
@elijah if you want, you can add my github username jrf0110
to your repo temporarily and I can take a look at. Just let me knowMuch appreciated! I can get you setup in the test project I've been working off of. I'll get you that invite, do you want me to invite you to the Pages project as well?
It's under a different account, since that one has been seeing issues with the dashboard/API still.
No worries on the Pages project. I'll just assist you code-wise and I can read your Pages resources via admin tools for debugging if necessary
Sounds good to me, much appreciated!
Hey folks, I just wanted to follow-up with an update. We've been able to reproduce the problem outside of Pages Builds, but inside Google Cloud when connecting to AWS, specifically with node-fetch. We confirmed that this does not happen when calling out to AWS from Digital Ocean. It also doesn't happen when calling from GCP to other providers. We're working with Google Cloud to get the issue resolved.
However, in the meantime, we do hope that you try out the timeout+retries solution I've provided. I think longterm, it's best to have that in place anyway. Let me know if you have any questions
@elijah I'm trying out code changes on your test project now
Sounds good, if you need me for anything I'll definitely be available. I just changed a key in the preview env for that project so it's using the standard endpoint now instead of the proxy endpoint I was testing out
If you don't mind canceling deployment
c646b157-5f80-4541-883b-635c8f9c212b
for me, that'd be awesome. Pushed up not great code
Pushed up more code and it's waiting to buildCanceled, looks like the next one is runnning
Mind if I ask you an off topic pages question?
Of course
I'll ping you in the other thread I opened up since it wasn't related to this issue specifically
Thanks for the updates John, glad we were able to get things reproduced and isolated.
Heads up, I'm not sure if you pinged me, but I didn't get a notification for it
Was getting late in the day, just pinged you
One more thing here @elijah @Ben Harker @Nacelle - we found that in our tests, upgrading to node20 solves the issue due to http clients using the
keepAlive
option by default, thereby re-using tls connections. Y'alls builds are making thousands of requests (in Rhone's case, close to 10k), re-negotiating tls connections each time. Upgrading to node20 means connections will get re-used.
So that's another solution
I'm going to also have that info posted on the support ticketsWhere would these upgrades need to take place? In just the client, server, or both?
This would be when the build takes place. So locally, you'd want to install/use node20 and then
rm -rf node_modules && rm package-lock.json && npm install
to regenerate your package-lock with versions of modules that are node20 compatible. You'll also want to set NODE_VERSION: 20
as an env var on your pages projectNoted! Thank you for the info
Going to test that out in the
bb-test
project
The overall build succeeded and deployed just fine, found some errors but could be related to something else. Going to dig into that now.I will test this tomorrow, but also wanted to note that our builds are going through right now without any change. I will check again in the morning but am curious if any other changes took place between the respective parties.
I also did some discovery that should reduce at least a few k of those requests as several are duplicates (info that's the same for every page) and Nuxt re-requests them by default. Moving those requests into a build module and passing the needed info in the route payload avoids that repetition.
Update as today I am again seeing the issue, so assuming there wasn't a fix here.
@JohnDotAwesome attempting to do the Node20 upgrade but I'm getting some errors from node-gyp. Doing a global install seemed to help locally but I'm still seeing the errors on the Cloudflare builds. I've tried both v1 and v2 build versions and am seeing the same issue. Any suggestions? My builds are in the rhone-staging project if you want to look.
Just double checking you've updated your package-lock.json after upgrading node20?
yep wiped node_modules and package-lock
and reinstalled
As far as a permanent fix, we're still waiting on GCP unfortuantelY 😦
This is the same repo you shared with me, right? Mind if I give it a go?
I was actually testing in a different project, but you should be fine to test in the one you were in. I can update the env vars for you just let me know the values you want to test
Yeah, can you set
NODE_VERSION: 20
in preview env for me?Updated. Sorry just saw this message
kk pushing up
ah jaykay npm wasn't able to resolve all deps
Mmm I'm rhone-nacelle not boll-and-branch fyi
@JohnDotAwesome wanted to thank you for all the help you've provided, I upgraded to Node 20 and builds have been going very well. We are back on pages 🎉
is there any way to stay in the loop on updates for this? Or are you going to update this thread once a permanent fix goes live/
@JohnDotAwesome we had a bad package (fibers). I've removed and am running a build now. From what I can tell we weren't using it.
Build completed in 8 min 👍
Ah shoot sorry. Got my wires crossed!
I'm going to update this thread when we hear back from Google. As you might expect, you can't exactly get an engineer on Discord over there :p
Haha, that's one of the perks of being on Cloudflare. Appreciate the future updates 🙏