profiling timings in next-on-pages (CF workers)

we've been adopting next-on-pages recently and it's been working really, really great (kudos to Dario/James/any other contributors I might be missing!) (on a somewhat similar note to the messages above) I did want to know if either of you had any methods to reliably profile timings in Cloudflare workers – we were surprised to see that:
const t0 = Date.now();
const data = env.KV.get(key);
const t1 = Date.now();
const t0 = Date.now();
const data = env.KV.get(key);
const t1 = Date.now();
seemed to be reporting XXX ms in lab conditions even on cache hits (where in a plain worker we see XX ms_. Then we realized it wasn't KV but rather the first IO we had (and with the CF Workers security measures, we understand we could be measuring things before that I/O that just wait for the I/O to tick?)
4 Replies
juanferreras
juanferrerasOP10mo ago
1. Is my understanding correct? e.g. awaits like https://github.com/cloudflare/next-on-pages/blob/main/packages/next-on-pages/templates/_worker.js/index.ts#L27 would not update timings and just bubble up to our first actual I/O fetch/kv get? 2. Do you have any internal way you've been measuring perf (e.g. is promisfying ALS adding any delays, etc)? I think if there's any way I can take a look and see if I can contribute anything back (or if it might just be something on our end like importing non-trivial application code) Thanks so much!
No description
Dario
Dario10mo ago
Thanks for the kind words @juanferreras. I'm glad you're finding next-on-pages useful! ❤️ yes you're right in our runtime profiling like Date.now and performance.now are disabled for security reasons (specter and heartbleed I think?) and there is no real way to make them work unfortunately we have some tools to enabled them on special internal cases but even those are very limited, I don't think any external usage is ever allowed and yeah generally the time in our runtime is incremented only after I/O operations (e.g. fetches, timeouts, etc...) so they can't really be relied on the only way for you to measure how the app is doing on workers is, I'm afraid to test that externally like by having a script or something that sends fetches and sees how much they took
juanferreras
juanferrerasOP10mo ago
for sure! yeah that makes a lot of sense – I've discovered this using WPT. https://www.webpagetest.org/result/240321_BiDcK6_HVH/?test=240321_BiDcK6_HVH&medianMetric=TTFB the codebase is the screenshot (but it's a real project so very likely large bundles being imported). If you check the URL yourself I think you'll see what seems like KV cache (e.g. a few times XXX ms, then YY ms only). But given this should cache for up to a day, makes me think there's something else actually delaying the first I/O
No description
juanferreras
juanferrerasOP10mo ago
that WebPageTest link did 3 runs from a north-east US location (and it's a repeat so in theory COLOs in north-east would have already cached the KV value) – regardless there was 200-250ms of waiting between the two Date.now() in general makes me think there could be up to ~200ms I can shave off from next-on-pages processing the request - but all the awaits return 0ms which makes it hard to optimize for. Also it's interesting that personally I do get XXms timings after a few requests (but not as sticky as if this was a KV cache hit either) I think I'll see if I can patch-package a few bits and get any insights and/or come up with a minimal reproduction to confirm it isn't the await import(usercode)
const tA = Date.now();
await Promise.race([
new Promise((resolve) => setTimeout(resolve, 300))
]);
const tB = Date.now();
const tA = Date.now();
await Promise.race([
new Promise((resolve) => setTimeout(resolve, 300))
]);
const tB = Date.now();
lol adding that above before my code makes KV act as expected (10-20ms reads) – obviously same/worse TTFB. https://www.webpagetest.org/result/240321_AiDc85_HYA/ this is going to be either extremely fun to understand or extremely embarrassing 😂 ---- hm after experimenting more this feels pretty much like some Script startup time – based on the gzipped tar -czf _worker.tar.gz _worker.js/ (not sure this is really representative), the first I/O operation will carry a penalty of: - worker.tar.gz: 500 kB => 50-60ms - worker.tar.gz: 1,100 kB => 100-120ms - worker.tar.gz: 1,500 kB => ~200ms (not really a penalty but more to say, it'd tick timers by an excess of...) then it makes sense that the difference I was seeing wasn't really KV cache hit/miss, but rather worker cold starts or not (which for >1MB are more significant) also why I wasn't able to get a reproducion easily. On brighter news, the actual code on next-on-pages isn't introducing any extra delays, seemed super efficient 👏 (Edit: I ended up creating a simple repro and reporting it https://github.com/cloudflare/next-on-pages/issues/722)

Did you find this page helpful?