creating a historical database for graph generation
Anyone have opinions / thoughts / experience on the best way to store database history? Like if I was storing the values of a stock ticker and wanted to generate a graph of stock price over time (not my actual use case but similar idea).
Seems like there are a lot of wrong ways to do it.
I'll be accessing the DB with an expressJS server and I haven't choosen a database stack yet
59 Replies
Sounds like a time series database.
Basically you store the information per time then aggregate it when needed. Depending on the storage needs of your project you could combine certain datapoints to a more condensed entry.
As an example, say you’re recording the temperature outside your house for reasons. You start with an hourly reading. Then two weeks later you condense/combine that data to a daily average (freeing up the hourly recordings). Two months later you combine the daily data into a weekly average.
That reduces your database from 720 hourly entries in a month to 4 weekly entries.
And while there are many TSDBs to choose from, any database can be used. The timestamp is your key and the info you’re recording is a value. SQL, k/v pair, document, etc. Unless you’re actually recording stock prices for a stock trader you can use wherever db you’re most comfortable with
Heres the thing, I can't really compress / average old data because it is still important.... But for the most part, individual rows will be exactly the same every hour, do you know of tech that handles that really nicely, or is this something I need to build out myself
there will be a 2 week period where rows will update hourly or possibly even every 30 minutes
and then the rest of the time everything will update on a day/week scale
Depending on the data being stored you’ll probably be ok for a while with any database engine.
Is the old data needed immediately or just for historical records? I’d it’s not needed right away you can archive it—most DB engines have a way you “store” old data as a zip file or something that can be stored elsewhere and pulled when needed.
It is kind of needed for making graphs, I'll probably just start with postgres and making a table for the historical data, but doing some shenanigans so that if there is already a database entry with the exact same value, I'll add/update a "validTo" column so I'm not just making new rows for storing the same data
That will slow down data uploading a little bit bc I'll have to query the database on upload, and it might slow down data access because I'll have to parse and generate nicely formatted data in the API layer
to give "a while" some perspective: MariaDB on relatively modest hardware has zero issues with tens of millions of rows in a table. Individual rows are found effectively instantly still, and with the right indexes even aggregate queries are still very fast
Honestly, without knowing more of what exactly you're trying to store in the DB the only real advice I can give is: just store the damn data. If/when the DB starts to slow down or get full then worry about optimizing or finding "better" ways to do things.
As Jochem pointed out, you're gonna be good for a long while with GB of data with most DBs out there.
you can have a "cold storage" database
if you use mariadb/mysql, you can have a table with quick data - data that you WILL use this month or so
then, have a table with pre-calculated pre-historic data
then, you can either store the raw historical data in mongdb or in an archive table
The historical data will need to be accessed often unfortunately
define "often"
once a month?
and ALL OF IT???
i very strongly doubt that
all of this is premature optimization btw
just build it and worry about performance if you hit the 50M record mark
i mean, if you hit 50 million, you WON'T read 50 million everytime
he's talking about graphics
you DON'T need to read 50 million records to make one
you pre-compute most of it ahead of time, while moving the data to "cold storage"
I'm just saying premature optimization is wasted time
in my opinion, this stinks to bad architecture or bad planning
oh, yes, premature optimization is the root of all evil
you may not hit the amount of records you need where performance becomes an issue, and performance bottlenecks are incredibly hard to predict. Just write your software with some quick-win stuff in place, but don't worry about changing your architecture until you need to
it's wasted time every single time
that's a very good point
we also don't know how much data he has to process now and for how many people
there's article after article by tech startups about how they completely failed to predict where their bottlenecks would be
that's true
but we need more context
but the answer is probably "don't worry about it"
Basically its a site for monitoring class seats during course registration. Data will update often every 30 minutes or so for roughly 2000 classes over the period of a week and then rarely update over the remainder of the year. After the year is over the historical data will still be displayed on the site in an interactive graph
I'm thinking about starting with postgres, and then by doing a read before every write, I can avoid adding a new row for every scrape when the data updates are infrequent and that should solve the data size issue
2000 classes * 48 updates a day * 365 sounds like a lot of data and I was hoping that an out of the box solution could just work, but maybe this is a fairly unique application?
you're not going to be showing that much data all the time
just pre-compute what you will show
and shove the actual data somewhere else
are you showing aggregate data over classes, or just temporal data over 2000 individual classes?
Also, how realistic is that number of 2000 classes in real world use?
realistic according to the api
temporal data I think, may aggregate later but for now just the graphs
then you have nothing to worry about
put an index on the classid column and you're talking about 17k rows a year
yeah, 17k a year is nearly nothing
and you can partition by school year as well, if you want
wdym
which will make the database even faster
you're not going to record 48 entry points for 365 days
if you're just displaying temporal data over a single class, then each class's fetch will only need to access 17k records a year (48*365)
you're going to record 16 points every 200 days
that's a non-issue, even with tens of millions of rows to access
(the 17k rows)
just make sure it's indexed properly, which is probably already going to be automatic because you'll have a foreign key on the class, right?
I still need to store 17k records for each class now
so?
the performance issues are on aggregating the data
that's baby's first steps in database world
postgres can store hundreds of millions of rows and not break a sweat
The storage of 350 million rows :Sadge:
that is just for one college
🤷
Ideally this works well and I scale for 10s and 100s of colleges
if you want to store raw data, store somewhere else
are you building this for a known, existing customer base, or just starting out hoping for a customer base?
keyword "existing"
its just a project for fun :)
Not trying to sell anything
then really, really REALLY don't worry about it
then get it to work first
later on, you deal with the supposed 350 million columns
I'll do the optimization I mentioned so I'm not storing millions of rows and not dealing with that many cols
🤦♂️
dude, get something working first
do a quick google on "premature optimization" and read the first five results
you don't know what you will need yet
maybe my amazing idea that almost all companies use won't be such a good idea
and something more custom will be better for you
maybe something even less complicated
but for now, you know absolutely nothing about what you will need in the future
that is, to scale to, for example, 10 schools
it isn't super custom its literally just not writing a new row if the data hasn't changed since the last scrape
so I'm not storing millions of the exact same row in my db
that's not exactly a solution if you need historically accurate data
also, if you need to process it later on, for some reason, you won't have it
but still, just get something to work
in fact, while working on it, you may have a better idea
wdym
well, if you need the data you didn't save, in 5 months, what will you do?
it doesn't save only if the data is unchanged
so the previous data point can be used
and I can increment a validTo date column
basically, it does the thing the browser console does, just have a little (127) at the end of a row to show it's repeated 127 times
Honestly, I would recommend you don't do that. It would complicate your code a lot
it's still premature optimization
diskspace is cheap, indexed queries are fast, and I can guarantee that you do not understand database optimization (because very few people outside of RDBMS devs really do)
yeah, but how will you handle spans of time that are ridiculously large, like 3 days of always the same data?
update the validTo date
to the current date
i see it as a very brittle thing, that will just overcomplicate everything
this would make your code a nightmare
just dump the data somewhere, like a little dumb piece of code would do
read, save
read, save
read, save
that's it
the processing can come later
as someone who has worked with a bunch of large datasets in mysql/mariadb and postgres: just leave it to the database and worry about it when it breaks
nah it'll be fine because I'll write some code to create a datapoint at the valid from and at the valid to dates
you're overcomplicating with no upsides
I can't stop you, but my professional opinion as someone who has actually worked on datasets with millions of rows is that the database devs are smarter than either of us