Random drops / 3 sec response time
There is this issue going on for weeks now, at random time intervals the response time for pages go from around 0.15 seconds to 3 seconds causing huge slowdowns and drops in traffic because of it. My own nginx requests log show the exact same as cloudflare's panel. Issues appear to be L7 at CF
I have spends day debugging, and in short this is what i found out.
1) It happens to all my sites across cloudflare, across all datacenters in the Netherlands.
2) I tried various things such as disabling tiered caching, moving firewall rules with no changes. There is also no changes in WAF logs or any ddos or such going on.
3) Bypassing cloudflare and going directly to the server solves all issues during the downpeaks.
4) This is not an issue with origin.
5) I have made a ticket #3193135 but its been EIGHT DAYS with NO RESPONSE.
68 Replies
You're using CloudFlare as a proxy, they have a set amount of IPs that act as "shields" for your domain. It's possible one it being hit and/or not operating as fast due to high traffic or other issues. 3 seconds is not long if the average response is faster than that.
Your direct host could serve it at lower millisecond for sure but understand you're through a proxy that's designed to firewall and protect you and some nodes cloudflare is sending out your way could have added latency for many reasons.
An avarage pageload going from 0.08 seconds to 3.22 seconds and it causes traffic to drop by about 50% for the time its going on and this happening up to 3 times a day, is a very big deal
Well how is the server looking? High CPU? High Network?
read OP
Yes
I mean during these slow-downs, is the 3s lasting long?
As in does this 3s latency for full loads happening for minutes->hours?
Yes, its like 5 to 15 min at a time where pageloads are 3 seconds
During these stretches are you monitoring your traffic, cpu and all sorts to ensure it's no CPU overhead?
Its not origin
Have you logged it with htop and other metrics?
Origin may be getting smoked if it's all cloudflare entries
A single cloudflare entry can be delayed but by how you describe
it sounds like ALL cloudflare nodes users come through it's 3s
I have smokeping, uptime monitoring, constant curl requests gonig both direct to server and trough cloudflare
Do yo u know the results? Could you post them?
Stuff like this:
No, because it will expose my ips and infrastructure
Well there's mine, it's public facing anyways.
You want to determine if the origin server is having issues if say all CF pointing users to your network
are having 3s delays.
A single CF pointing user could have that 3s delay, but if all users are getting it, your server is busy processing or something...
I'd say monitor your server if you got root access, run htop and maybe CBM.
I think you misunderstand the scale here, i got 14 servers with most of them running 64 core epyc's and its happening to every datacenter i run at in my country at the same time, the drop causes all cpu and network traffic to decrease as its being choked by cloudflare. Like i said in OP, bypassing CF has no increase in pageloads
MTR for traces
Well damn if it's t hat saucy, you may have a cleaner idea on what's happening.
this isen't a home site, if you look at the OP that requests is in billions
Oh I can dig it.
So is the 3sec drop for every cloudflare IP?
Have you logged/analysed that?
but i also have smaller sites with CF, free plan on those and they also have the exact same issue at the same time. Even servers that don't even run in the same datacenter
No needing to flex here, I'm helpful regardless.
š
Yes, i'm running a curl loop with 5 seconds delay. One to CF > my site and one directly to my site and only the one trough cloudflare is having issues
Try logging the results when you're getting these 3s delays, if it's some cloudflare IPs that's fine, if it's every cloudflare connection you may need to worry.
That doesn't help, you want a more dynamic approach.
I can't log per cloudflare ip to see what nodes at CF has the issues, i don't have that data or way to route traffic that way
No but you know if it's all users having 3s delay.
If it's a single or few nodes delayed that's AOK!
If all nodes 3s delay you got a server problem and need to aduit.
audit*
I could probe from outside my country to see, that might take a differenr route but that's as far as i can take it. But i have already debugged the issue to the point where i can 100% say that its not an origin issue anymore
Well of coarse you never want it to be origin issue or everyone goes offline that's using that server...
That's why I say if it's 3s for every CF client, then you got server issues
if it's the odd connection that's fine.
You need to confirm with audits, run htop and track the server when it's at load and see if CPU is cranking 100% and check CBM for if network is cranked.
Well CF could say that for example the server has issues, or the datacenter isps have issues, and thus is why it fails but none of this is at hand. Because the connection doesn't flat out fail, but just takes really long there is also nothing in cf error analytics, or any logs on origin that tell me anything
CF is not jesus/god now.
They won't turn coke into rum.
If origin has issues, it would show in the logs. Or munin would report high cpu or smokeping (that runs from a different datacenter) would report high latancy
You need to do better audits of your origin server if you have root access and compare to these outages, and if o utage affects all users
but in case primary server fails, or blows up it would automatically switchover to failover systems
CF will reflect but their proxy is designed to take the load off your serve by caching and serving tons of data for you,
How do you know though
I haven't seen a result, I have these anytime I'm having outages or delays.
how do i know what?
You have no auditting to prove anything, just your word.
You puzzle us.
I'm still unsure if everyone has 3s delays or if it's specific users(nodes).
No audit/analytics it's kind of a who knows situation.
With crazy services no monitoring?
I litterly can not test it, because its happening somewhere on cloudflare's servers and CF analytics do not report per-node issues
You can test trace-routes.
Your server is a sending point, if it's on maintenance mode it can send pings t owards every safe point to tell you what's up.
What are you talking about, everything is monitored. CPU, mem, interrups, sql queries, disks stats etc
test any IP to determine, AKA get you host compensation.
So when there was outage, did it affect all users, or a range?
.
there are timestamps on all CF analytics data, because the drops are visable on all of them
Traffic drop 50%
and its happening up to 3 times a day, with random intervals
with the add of
You're just fluffing me right?
Yeah I think you need to do a bit more auditing and analytic work with your servers to know better, this topic won't help you.
please stop trolling
Afraid I've been trolled, anyways good luck I hope you sort it out.
This problem exists in my country too, it is not a problem related to your country. Some cf-ray ids can cause this problem. If you have 100 users on your site, this problem occurs for 10 or 20 people, but the other people do not have this problem because they access it from different ray-id. I've been trying to explain this for days, but everyone was saying that the ISPs of the country I live in were blocking cloudflare servers. No such thing. There is a problem with Cloudflare Ray ID and loading times are increasing. Cloudflare should take this as an issue and work to resolve it. There is no troll here, my friend, this is completely real.
The reason i said you were trolling is because of your condescending tone and the only reason I'm actually accurately are able to debug when and how its happening is because i DO have analytics, provided by my munin and monit stats. It keeps track of every single thing happening on all servers, including making graphs of web requests/bandwidth/sql requests etc etc. You saying that i should use 'htop' to check my servers, when clearly its happening only for 15 min per day is not a viable solution. I talked to some of my IT people and they told me that they don't think you're intentionally trolling, its just your lack of knowledge. Knowing a little bit, but just not enough can be real dangerous. I think in the end this is a good learning curve for both of us, you knowing that not everybody hosts their site from their garage and me for making the mistake of not checking the level of technical knowledge provided for this type of issue. I was just real desperate, because of how long this was going on for.
i'm happy to report tho that cloudflare fixed it, i will paste the response @rootalien . if this is also related to your issue
Well, this was an interesting read š
Thank you very much for your message, where did you read this?
this was send to me on my ticket
Can you send me the link of your ticket?
you can't see it, its not public
hmm ok how about a screenshot?
It's not that I don't believe you, this problem was really a headache, I was on vacation for a few days and couldn't deal with the problem. Even though I used cloudflare workers, I was having problems.
And unfortunately no one took me seriously here š
i know how you feel..... as you most likely read this topic :facepalm:
ahahah š
I hope the problem has been solved, I will observe this when I return from holiday on Friday and let you know here.
yea i hope this was the cause of your issue to š
I hope so, thank you very much. š«
i can't express in words of how frustrating this issue was, you'd think if i have multiple sites in multiple datacenters all having the issue at the same time i'd be easier to get away from blaming origin but turns out that wasn't true. Then i upgraded my account to business in the hopes i could get faster support as this was an issue that's been going on for over a month. But then the billing issue prevented my account from being upgraded, even though i paid. Actually its still not upgraded to this day as the billing issue is STILL ongoing. So i'm stuck in a loop of not being able to get support because PRO support takes forever and my account not being business because of the billing issue.
its truly a catch 22
I use workers, maybe I pay 100 dollars a month, but I was still limited. As far as I know, Workers is a rival system to cloudfront.net and azureedge.net. However, as we saw, ToS was coming and uploads were increasing significantly. I have said many times that this increases when there are Champions League matches.
yea i don't touch workers, i just use my own hardware for any processing related things
unfortunately the problem persists. I think cloudflare only solved it for your website.
Still a issue?
Yes, when the demand increases, CF loading times reach 20/40 seconds.
cf-cache-status: HIT Unfortunately, the files are like this.
Cloudflare said they did this to reduce abuse, but it was also happening on non-abusive sites. Even though he said he fixed this, unfortunately the situation is still the same.
I even saw a 429 request error recently š It's ridiculous.
š
all resolved for me, no more gaps or timeouts
nice