profiling timings in next-on-pages (CF workers)
we've been adopting next-on-pages recently and it's been working really, really great (kudos to Dario/James/any other contributors I might be missing!)
(on a somewhat similar note to the messages above) I did want to know if either of you had any methods to reliably profile timings in Cloudflare workers – we were surprised to see that:
seemed to be reporting
XXX ms
in lab conditions even on cache hits (where in a plain worker we see XX ms
_. Then we realized it wasn't KV but rather the first IO we had (and with the CF Workers security measures, we understand we could be measuring things before that I/O that just wait for the I/O to tick?)4 Replies
1. Is my understanding correct? e.g.
await
s like https://github.com/cloudflare/next-on-pages/blob/main/packages/next-on-pages/templates/_worker.js/index.ts#L27 would not update timings and just bubble up to our first actual I/O fetch/kv get?
2. Do you have any internal way you've been measuring perf (e.g. is promisfying ALS adding any delays, etc)?
I think if there's any way I can take a look and see if I can contribute anything back (or if it might just be something on our end like importing non-trivial application code)
Thanks so much!Thanks for the kind words @juanferreras. I'm glad you're finding next-on-pages useful! ❤️
yes you're right in our runtime profiling like
Date.now
and performance.now
are disabled for security reasons (specter and heartbleed I think?) and there is no real way to make them work unfortunately
we have some tools to enabled them on special internal cases but even those are very limited, I don't think any external usage is ever allowed
and yeah generally the time in our runtime is incremented only after I/O operations (e.g. fetches, timeouts, etc...)
so they can't really be relied on
the only way for you to measure how the app is doing on workers is, I'm afraid to test that externally
like by having a script or something that sends fetches and sees how much they tookfor sure! yeah that makes a lot of sense – I've discovered this using WPT. https://www.webpagetest.org/result/240321_BiDcK6_HVH/?test=240321_BiDcK6_HVH&medianMetric=TTFB
the codebase is the screenshot (but it's a real project so very likely large bundles being imported). If you check the URL yourself I think you'll see what seems like KV cache (e.g. a few times XXX ms, then YY ms only). But given this should cache for up to a day, makes me think there's something else actually delaying the first I/O
that WebPageTest link did 3 runs from a north-east US location (and it's a repeat so in theory COLOs in north-east would have already cached the KV value) – regardless there was 200-250ms of waiting between the two Date.now()
in general makes me think there could be up to ~200ms I can shave off from next-on-pages processing the request - but all the awaits return 0ms which makes it hard to optimize for. Also it's interesting that personally I do get XXms timings after a few requests (but not as sticky as if this was a KV cache hit either)
I think I'll see if I can patch-package a few bits and get any insights and/or come up with a minimal reproduction to confirm it isn't the await import(usercode)
lol adding that above before my code makes KV act as expected (10-20ms reads) – obviously same/worse TTFB. https://www.webpagetest.org/result/240321_AiDc85_HYA/
this is going to be either extremely fun to understand or extremely embarrassing 😂
----
hm after experimenting more this feels pretty much like some
Script startup time
– based on the gzipped tar -czf _worker.tar.gz _worker.js/
(not sure this is really representative), the first I/O operation will carry a penalty of:
- worker.tar.gz: 500 kB => 50-60ms
- worker.tar.gz: 1,100 kB => 100-120ms
- worker.tar.gz: 1,500 kB => ~200ms
(not really a penalty but more to say, it'd tick timers by an excess of...)
then it makes sense that the difference I was seeing wasn't really KV cache hit/miss, but rather worker cold starts or not (which for >1MB are more significant)
also why I wasn't able to get a reproducion easily. On brighter news, the actual code on next-on-pages isn't introducing any extra delays, seemed super efficient 👏
(Edit: I ended up creating a simple repro and reporting it https://github.com/cloudflare/next-on-pages/issues/722)