6 Replies
right on, thanks!! i'll describe the situation here so we dont spam the top level
Our tail worker is set up to forward logs and metrics to our telemetry platform, datadog.
Initially, we posted these directly to the Datadog API but faced rate limits from Datadog -- with response bodies that lined up with their expected API shapes.
We started a new strategy simply enqueue these logs onto google pubsub, to be ingested inside our k8s cluster. As a rollout plan, we tried falling back from the ratelimited datadog API calls to then enqueue onto google pubsub. But then, almost all the requests to the publish operation also got immediately rate-limited with 429 status code, no response headers, and no response body.
This was particularly strange because the same credentials, queue, and publish endpoint did not face any rate-limiting inside our primary non-tail worker (or locally). Messages get published there at at least 3x the rate of this new tail worker use case, and never get a 429 response. Additionally, the fact that there was no response body means that the google's endpoint likely didnt serve this request, as they have a json response shape that would have been in the body.
It didnt reject the request all the time, but the vast majority of the time. A small handful would creep in every 20 minutes or so as successful. So when the request was allowed to happen, it succeeded.
To make things even more odd, I tried reversing the flow to enqueue first (without ever hitting datadog's API directly, avoiding the fallback logic) and that actually worked right away! Seemingly, the occurrence of the datadog 429 directly caused the fallback POST to google to fail, but when avoiding the first DD call, it succeeds.
Now we always enqueue to the telemetry pipeline without hitting Datadog directly. It still hits ratelimits consistently, but at least the majority of the ~600-800 rps of messages are being enqueued.
Given what we've observed, this issue seems specific to the tail worker as other operations using the same credentials and endpoints do not face these limits in our normal worker.
Could you help us understand why this might be happening and if there's something specific in the tail worker runtime causing these rate limits?
I have seen this doc: https://developers.cloudflare.com/workers/platform/limits/#request, Which is what prompted me to start this line of thinking, since there are scenarios where cloudflare would be injecting these 429s artificially.
Unfortunately, I did not see any of the mentioned events logged in Security>Events. But given that the act of switching which endpoint we tried to hit first (DD vs Google) caused a change in observed 429 behavior, it seems there must be some rules/limits that I'm not yet understanding
---
apologies for the wall of text haha(: just wanted to make sure the full picture was clear, given how odd some of this feels
Who is sending you the 429? Cloudflare? Google?
And in the screenshot you shared, what service is responding with "Failed publish message fetch request" -- Datadog?
I am led to believe its cloudflare thats forcing these 429's through some sort of burst ratelimit / anti-abuse limit, and the requests never actually make it to google, but I'm not 100% certain
The actual request is (intended to be) going to google. This request is sent from our normal API worker at upwards of ~2k tps without issue, and is getting ratelimited only inside our tail worker, at significantly less throughput.
(Initially, it was every request to Google ratelimited, when the Google call was a fallback for the Datadog call. But that changed when we switched the ordering, something that shouldnt happen with independent services like that.)
I'm mainly just trying to figure out if there is some sort of anti-abuse/ddos/burst limits in place on Tail workers that is not shared with normal workers. As everything else in our setup is the same, from the creds to the URL and code used