Does Cloudflare offer a database solution for Analytics type data?
I'll have a stream of events, potentially TB's / month. Does cloudflare offer any NoSQL/SQL solution that could support saving this amount of data? There's also the need to do some basic querying on it later.
34 Replies
Workers Analytics Engine (beta) · Cloudflare Analytics docs
Workers Analytics Engine is a new way to quickly get analytics about anything, using Cloudflare Workers. Whether you are developing a new software …
Though it does sample at some point
Yea that's pretty much the closest option, based on Clickhouse, but a few things to keep in mind:
- Limited to 91 days retention
- Currently in beta and free at the moment (pricing yet to be announced)
What do you mean by "sample"?
Ah, i got it, it's a way of not recording every single event
Seems a little heavy handed. i already have all of that written, i just need somewhere to push the data
it's not quite as simple as a set amount and then sampled, there's a document here explaining it all if you're interested: https://docs.google.com/document/d/17iThVQ3wr40JlRJ_8Y0qQr2YP1S8Cbn3FujQ5WYLG8E/edit#heading=h.9hfgsp5qfkgp
Google Docs
PUBLIC: Understanding Sampling with Workers Analytics Engine
Understanding Sampling with Workers Analytics Engine Workers Analytics Engine allows you to write an effectively unbounded amount of data and query it back very quickly, at a price that is very low (or free). In order to allow users to write an unbounded amount of data at a bounded cost, we use w...
OK wow, checking out ClickHouse now... there is definitely way more beyond NoSQL/SQL when it comes to this stuff. I need to do some learning
I'm using postgres right now, but it's just incredibly slow to query @ 1 million records (with proper indexing)
Analytics Engine probably isn't what you want then, it's more for use alongside Workers and such. You can only input 25 entries per http requests, not made to have a bunch of stuff shoved into it at once
Look into TimescaleDB then maybe, postgres extension for that kind of data
I have events grouped by "session_id", so i figured a NoSQL solution would be better, since i can just keep the events in the session document
Yeah, Clickhouse is very powerful. It is what powers Cloudflare's Analytics, both for your domains, and for Cloudflare Radar
The thing is events for the same session_id could be separated by years... so I'm not sure if i can take advantage of timescaledb
I'm guessing it gets it's performance boost by having well defined time bounds?
NoSQL databases are not meant for analytics. NoSQL databases are generally very OLTP. You want on OLAP database (clickhouse for example).
Ok excellent. I knew i was missing something. Thanks all
I know Clickhouse would let you partition by session_id or other keys, not sure about for timescaledb
Clickhouse is it's own DB? what are they running underneath, SQL?
own db
The main table engine you use with Clickhouse is MergeTree: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree (or the various types of it), it stores files in parts. Clickhouse is also a columnar database rather then row based
MergeTree | ClickHouse Docs
The MergeTree engine and other engines of this family (*MergeTree) are the most commonly used and most robust ClickHouse table engines.
"SQL" is not really a database fwiw its just a query language. You do use SQL for clickhouse
ah of course
yea Clickhouse supports postgres wire protocol as well, http interface and a few other ones
This is wild. I think i'll need to set up multiple partition keys, i've got bucket_id for different tenants, visitor_id for different people, and session_id for different sessions
all within the "events" table, which i'd like to grow to infinity
One of the big advantages of Clickhouse is Materialized Views, you can create a view that aggregates by time/tenant/etc, the magic is unlike normal views which are put together when you query them, Materialized Views update on rows being inserted onto the table they are made on (like a trigger could), so cheap to query
ok woah, so they are like indexes... but views?
i wonder what happens if you modify a row that was already materialized, does the entire Materialized View get rebuilt?
eh I think it's easier to just think of them as Views but you do more work on on insert instead of on query
Clickhouse isn't very modification or deletion friendly. If possible, avoid doing that entirely
that makes sense
events should be append only anyways
so i'm good with that
this seems like just what i needed, so glad
Yea you can't do updates or deletes normally in Clickhouse either (although iirc you can enable that now if you wanted to), they're called mutations https://clickhouse.com/docs/en/sql-reference/statements/alter#mutations
ALTER | ClickHouse Docs
Most ALTER TABLE queries modify table settings or data:
One of the things that makes Clickhouse fast is that when you definite a partition key (let's say, by month), it defines how the data is stored into parts (single files) on the disk. When you insert data, it's super quick because it's a new part. In the background Clickhouse merges all of these parts (well, any that can be merged because they're in the same partition key). Deleting/updating/mutating data thus requires rewriting the part and is expensive
So there would be no need to have separate tables per tenant_id, i can simply define a partition key and let Clickhouse do it's magic?
I think that is generally not recommended as you really want to limit those part counts, and instead maybe use materialized views for that
https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key
In most cases you do not need a partition key, and in most other cases you do not need a partition key more granular than by months. You should never use too granular of partitioning. Don't partition your data by client identifiers or names. Instead, make a client identifier or name the first column in the ORDER BY expression.
oh wow, so good
There's a lot of content out there about Clickhouse, this is a good watch:
https://www.youtube.com/watch?v=6WICfakG84c
I explored it a fair bit a while ago and I'm sure I've forgotten quite a few things by now lol
Altinity
YouTube
Secrets of ClickHouse Query Performance
Here's the link to the updated webinar: https://youtu.be/1TGGCIr6dMY
#ClickHouse is famous for speed. That said, you can almost always make it faster! In this webinar (September 2019), Robert Hodges and Altinity Engineering Team use examples to teach you how to deduce what queries are actually doing by reading the system log and system tables. ...
how have i not heard of this? Is this fairly new product? Cloudflare is using this for all of their analytics?
Maybe because it is one of those things where it is a comparatively niche product? Might also be because it was originally built by Yandex
It's been around for a bit. Cloudflare uses it for all the major analytics (dns, http, radar, etc), they have a bunch of blog posts about it, ex:
https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/
The Cloudflare Blog
HTTP Analytics for 6M requests per second using ClickHouse
One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:
Excellent, time to dive in. Thank you!