Tail Workers & Outgoing Request Ratelimits

this channel works!
20 Replies
sam
sam2mo ago
right on, thanks!! i'll describe the situation here so we dont spam the top level Our tail worker is set up to forward logs and metrics to our telemetry platform, datadog. Initially, we posted these directly to the Datadog API but faced rate limits from Datadog -- with response bodies that lined up with their expected API shapes. We started a new strategy simply enqueue these logs onto google pubsub, to be ingested inside our k8s cluster. As a rollout plan, we tried falling back from the ratelimited datadog API calls to then enqueue onto google pubsub. But then, almost all the requests to the publish operation also got immediately rate-limited with 429 status code, no response headers, and no response body. This was particularly strange because the same credentials, queue, and publish endpoint did not face any rate-limiting inside our primary non-tail worker (or locally). Messages get published there at at least 3x the rate of this new tail worker use case, and never get a 429 response. Additionally, the fact that there was no response body means that the google's endpoint likely didnt serve this request, as they have a json response shape that would have been in the body. It didnt reject the request all the time, but the vast majority of the time. A small handful would creep in every 20 minutes or so as successful. So when the request was allowed to happen, it succeeded. To make things even more odd, I tried reversing the flow to enqueue first (without ever hitting datadog's API directly, avoiding the fallback logic) and that actually worked right away! Seemingly, the occurrence of the datadog 429 directly caused the fallback POST to google to fail, but when avoiding the first DD call, it succeeds.
sam
sam2mo ago
Now we always enqueue to the telemetry pipeline without hitting Datadog directly. It still hits ratelimits consistently, but at least the majority of the ~600-800 rps of messages are being enqueued.
No description
sam
sam2mo ago
Given what we've observed, this issue seems specific to the tail worker as other operations using the same credentials and endpoints do not face these limits in our normal worker. Could you help us understand why this might be happening and if there's something specific in the tail worker runtime causing these rate limits? I have seen this doc: https://developers.cloudflare.com/workers/platform/limits/#request, Which is what prompted me to start this line of thinking, since there are scenarios where cloudflare would be injecting these 429s artificially. Unfortunately, I did not see any of the mentioned events logged in Security>Events. But given that the act of switching which endpoint we tried to hit first (DD vs Google) caused a change in observed 429 behavior, it seems there must be some rules/limits that I'm not yet understanding --- apologies for the wall of text haha(: just wanted to make sure the full picture was clear, given how odd some of this feels
rohin
rohinOP2mo ago
Who is sending you the 429? Cloudflare? Google? And in the screenshot you shared, what service is responding with "Failed publish message fetch request" -- Datadog?
sam
sam2mo ago
I am led to believe its cloudflare thats forcing these 429's through some sort of burst ratelimit / anti-abuse limit, and the requests never actually make it to google, but I'm not 100% certain
sam
sam2mo ago
The actual request is (intended to be) going to google. This request is sent from our normal API worker at upwards of ~2k tps without issue, and is getting ratelimited only inside our tail worker, at significantly less throughput. (Initially, it was every request to Google ratelimited, when the Google call was a fallback for the Datadog call. But that changed when we switched the ordering, something that shouldnt happen with independent services like that.) I'm mainly just trying to figure out if there is some sort of anti-abuse/ddos/burst limits in place on Tail workers that is not shared with normal workers. As everything else in our setup is the same, from the creds to the URL and code used
No description
sam
sam2mo ago
hmm interesting. that is good to know regardless! though, I'm not sure that would necessarily explain how the act of avoiding initial requests to Datadog would cause the requests to Google to stop being rate limited 🤔
Walshy
Walshy2mo ago
We have https://developers.cloudflare.com/workers/platform/limits/#burst-rate If there's a lot of outgoing requests this may be firing Should be able to check your zone WAF events - if you see rule ID worker then that'll be it nah subrequests if it is that, let me know your zone id and i'll raise it
sam
sam2mo ago
ah awesome! my org's account id is 056879e63aa83db17aadc76220f52953 out of curiousity, does this burst limit apply equally to both normal workers & tail workers? or varies somehow depending on steady-state throughput for each deployment? (iirc tail handlers were still in beta, so idk if they're nerfed in some way for now) thanks again for taking the time here y'all ❤️
Walshy
Walshy2mo ago
It's a bit complex to explain but it sounds like it is happening What's the zone though? I need to lift it on the specific zone
sam
sam2mo ago
oh my b i thought zone was synonymous with org -- one sec, i'll go look again think this would be it: 8615a0f8e442523371429211d0a5120e (found by going to "Account Home" > openrouter.ai domain > "Zone ID" on the side bar) and in case it makes a difference: our primary worker has routes assigned to it under that openrouter.ai domain, however the tail worker is a different worker that has no fetch handler / no routes/bindings to the openrouter.ai domain ill go beef up my understanding of zones in the meantime
Walshy
Walshy2mo ago
Lifted the rate limit - let me know if you continue to see the issue
sam
sam2mo ago
thanks! it appears the issue is continuing unfortunately, but ill check back in a few min only thought was maybe because this tail worker is not in that zone? would it be in the workers.dev zone if it doesnt have a fetch handler/any routes attached?
Walshy
Walshy2mo ago
It would be but we lifted for that Definitely not gcp rate limiting you?
sam
sam2mo ago
the best i can tell, no they're not -- all the gcp quota dashboards for this operation show us at like <0.1% of the maximum throughput the tail worker is only doing like ~500 rps to that endpoint, while our prod worker on the openrouter.ai domain does 2-3x that without seeing any 429s --- edit: and yeah just to confirm, adding a fetch handler and assigning a route on our zone to the tail worker did not help, as you expected. i let it sit for 15-20m, but reverted that now
sam
sam2mo ago
requests vs subrequests for the tail worker (the taper i assume is just ingestion delay for your analytics engine, not concerned about that) at a glance, this appears to at least highlight the symptoms
No description
No description
sam
sam2mo ago
zooming in on one version, the spikes & troughs def seem to alternate as some quota is reset every couple min
No description
No description
sam
sam2mo ago
(also just to be super clear, i dont expect y'all to respond right away to these. its a community support forum after all((: ❤️ just posting stuff as i was investigating, not to instigate a response)
Walshy
Walshy2mo ago
Apologies for the late reply, there should be nothing limiting you with the burst limit removed. If you can log the response body it'd be interesting to see As for analytics, it definitely shouldn't taper over such a period. And zooming in, I also wouldn't expect that behaviour
sam
sam2mo ago
np! yeah the response body is empty unfortunately, no body or headers
As for analytics, it definitely shouldn't taper over such a period
yeah the 2h taper must have been ingestion delay on cloudflare's end, since it did not line up with our own request counts, and eventually caught up to reality today i do not see the same taper on 6h view -- but the spiky ratelimiting pattern for requests vs subrequests in that view still remains is there a particular experiment you think might be helpful for us to run to give you more information? maybe switching to have the tail worker POST to our production worker, and enqueue to GCP there? or avoid using tailworkers altogether, just doing our telemetry pushing from inside the main worker, in a ctx.waitUntil?

Did you find this page helpful?