creating a historical database for graph generation

Anyone have opinions / thoughts / experience on the best way to store database history? Like if I was storing the values of a stock ticker and wanted to generate a graph of stock price over time (not my actual use case but similar idea). Seems like there are a lot of wrong ways to do it. I'll be accessing the DB with an expressJS server and I haven't choosen a database stack yet
59 Replies
13eck
13eck4mo ago
Sounds like a time series database. Basically you store the information per time then aggregate it when needed. Depending on the storage needs of your project you could combine certain datapoints to a more condensed entry. As an example, say you’re recording the temperature outside your house for reasons. You start with an hourly reading. Then two weeks later you condense/combine that data to a daily average (freeing up the hourly recordings). Two months later you combine the daily data into a weekly average. That reduces your database from 720 hourly entries in a month to 4 weekly entries. And while there are many TSDBs to choose from, any database can be used. The timestamp is your key and the info you’re recording is a value. SQL, k/v pair, document, etc. Unless you’re actually recording stock prices for a stock trader you can use wherever db you’re most comfortable with
Shane
Shane4mo ago
Heres the thing, I can't really compress / average old data because it is still important.... But for the most part, individual rows will be exactly the same every hour, do you know of tech that handles that really nicely, or is this something I need to build out myself there will be a 2 week period where rows will update hourly or possibly even every 30 minutes and then the rest of the time everything will update on a day/week scale
13eck
13eck4mo ago
Depending on the data being stored you’ll probably be ok for a while with any database engine. Is the old data needed immediately or just for historical records? I’d it’s not needed right away you can archive it—most DB engines have a way you “store” old data as a zip file or something that can be stored elsewhere and pulled when needed.
Shane
Shane4mo ago
It is kind of needed for making graphs, I'll probably just start with postgres and making a table for the historical data, but doing some shenanigans so that if there is already a database entry with the exact same value, I'll add/update a "validTo" column so I'm not just making new rows for storing the same data That will slow down data uploading a little bit bc I'll have to query the database on upload, and it might slow down data access because I'll have to parse and generate nicely formatted data in the API layer
Jochem
Jochem4mo ago
to give "a while" some perspective: MariaDB on relatively modest hardware has zero issues with tens of millions of rows in a table. Individual rows are found effectively instantly still, and with the right indexes even aggregate queries are still very fast
13eck
13eck4mo ago
Honestly, without knowing more of what exactly you're trying to store in the DB the only real advice I can give is: just store the damn data. If/when the DB starts to slow down or get full then worry about optimizing or finding "better" ways to do things. As Jochem pointed out, you're gonna be good for a long while with GB of data with most DBs out there.
ἔρως
ἔρως4mo ago
you can have a "cold storage" database if you use mariadb/mysql, you can have a table with quick data - data that you WILL use this month or so then, have a table with pre-calculated pre-historic data then, you can either store the raw historical data in mongdb or in an archive table
Shane
Shane4mo ago
The historical data will need to be accessed often unfortunately
ἔρως
ἔρως4mo ago
define "often" once a month? and ALL OF IT??? i very strongly doubt that
Jochem
Jochem4mo ago
all of this is premature optimization btw just build it and worry about performance if you hit the 50M record mark
ἔρως
ἔρως4mo ago
i mean, if you hit 50 million, you WON'T read 50 million everytime he's talking about graphics you DON'T need to read 50 million records to make one you pre-compute most of it ahead of time, while moving the data to "cold storage"
Jochem
Jochem4mo ago
I'm just saying premature optimization is wasted time
ἔρως
ἔρως4mo ago
in my opinion, this stinks to bad architecture or bad planning oh, yes, premature optimization is the root of all evil
Jochem
Jochem4mo ago
you may not hit the amount of records you need where performance becomes an issue, and performance bottlenecks are incredibly hard to predict. Just write your software with some quick-win stuff in place, but don't worry about changing your architecture until you need to it's wasted time every single time
ἔρως
ἔρως4mo ago
that's a very good point we also don't know how much data he has to process now and for how many people
Jochem
Jochem4mo ago
there's article after article by tech startups about how they completely failed to predict where their bottlenecks would be
ἔρως
ἔρως4mo ago
that's true but we need more context but the answer is probably "don't worry about it"
Shane
Shane4mo ago
Basically its a site for monitoring class seats during course registration. Data will update often every 30 minutes or so for roughly 2000 classes over the period of a week and then rarely update over the remainder of the year. After the year is over the historical data will still be displayed on the site in an interactive graph I'm thinking about starting with postgres, and then by doing a read before every write, I can avoid adding a new row for every scrape when the data updates are infrequent and that should solve the data size issue 2000 classes * 48 updates a day * 365 sounds like a lot of data and I was hoping that an out of the box solution could just work, but maybe this is a fairly unique application?
ἔρως
ἔρως4mo ago
you're not going to be showing that much data all the time just pre-compute what you will show and shove the actual data somewhere else
Jochem
Jochem4mo ago
are you showing aggregate data over classes, or just temporal data over 2000 individual classes? Also, how realistic is that number of 2000 classes in real world use?
Shane
Shane4mo ago
realistic according to the api temporal data I think, may aggregate later but for now just the graphs
Jochem
Jochem4mo ago
then you have nothing to worry about put an index on the classid column and you're talking about 17k rows a year
ἔρως
ἔρως4mo ago
yeah, 17k a year is nearly nothing and you can partition by school year as well, if you want
Shane
Shane4mo ago
wdym
No description
ἔρως
ἔρως4mo ago
which will make the database even faster you're not going to record 48 entry points for 365 days
Jochem
Jochem4mo ago
if you're just displaying temporal data over a single class, then each class's fetch will only need to access 17k records a year (48*365)
ἔρως
ἔρως4mo ago
you're going to record 16 points every 200 days
Jochem
Jochem4mo ago
that's a non-issue, even with tens of millions of rows to access (the 17k rows) just make sure it's indexed properly, which is probably already going to be automatic because you'll have a foreign key on the class, right?
Shane
Shane4mo ago
I still need to store 17k records for each class now
Jochem
Jochem4mo ago
so? the performance issues are on aggregating the data
ἔρως
ἔρως4mo ago
that's baby's first steps in database world
Jochem
Jochem4mo ago
postgres can store hundreds of millions of rows and not break a sweat
Shane
Shane4mo ago
The storage of 350 million rows :Sadge: that is just for one college
Jochem
Jochem4mo ago
🤷
Shane
Shane4mo ago
Ideally this works well and I scale for 10s and 100s of colleges
ἔρως
ἔρως4mo ago
if you want to store raw data, store somewhere else
Jochem
Jochem4mo ago
are you building this for a known, existing customer base, or just starting out hoping for a customer base? keyword "existing"
Shane
Shane4mo ago
its just a project for fun :) Not trying to sell anything
Jochem
Jochem4mo ago
then really, really REALLY don't worry about it
ἔρως
ἔρως4mo ago
then get it to work first later on, you deal with the supposed 350 million columns
Shane
Shane4mo ago
I'll do the optimization I mentioned so I'm not storing millions of rows and not dealing with that many cols
ἔρως
ἔρως4mo ago
🤦‍♂️ dude, get something working first
Jochem
Jochem4mo ago
do a quick google on "premature optimization" and read the first five results
ἔρως
ἔρως4mo ago
you don't know what you will need yet maybe my amazing idea that almost all companies use won't be such a good idea and something more custom will be better for you maybe something even less complicated but for now, you know absolutely nothing about what you will need in the future that is, to scale to, for example, 10 schools
Shane
Shane4mo ago
it isn't super custom its literally just not writing a new row if the data hasn't changed since the last scrape so I'm not storing millions of the exact same row in my db
ἔρως
ἔρως4mo ago
that's not exactly a solution if you need historically accurate data also, if you need to process it later on, for some reason, you won't have it but still, just get something to work in fact, while working on it, you may have a better idea
Shane
Shane4mo ago
wdym
ἔρως
ἔρως4mo ago
well, if you need the data you didn't save, in 5 months, what will you do?
Shane
Shane4mo ago
it doesn't save only if the data is unchanged so the previous data point can be used and I can increment a validTo date column
Jochem
Jochem4mo ago
basically, it does the thing the browser console does, just have a little (127) at the end of a row to show it's repeated 127 times Honestly, I would recommend you don't do that. It would complicate your code a lot it's still premature optimization diskspace is cheap, indexed queries are fast, and I can guarantee that you do not understand database optimization (because very few people outside of RDBMS devs really do)
ἔρως
ἔρως4mo ago
yeah, but how will you handle spans of time that are ridiculously large, like 3 days of always the same data?
Shane
Shane4mo ago
update the validTo date to the current date
ἔρως
ἔρως4mo ago
i see it as a very brittle thing, that will just overcomplicate everything
Jochem
Jochem4mo ago
this would make your code a nightmare
ἔρως
ἔρως4mo ago
just dump the data somewhere, like a little dumb piece of code would do read, save read, save read, save that's it the processing can come later
Jochem
Jochem4mo ago
as someone who has worked with a bunch of large datasets in mysql/mariadb and postgres: just leave it to the database and worry about it when it breaks
Shane
Shane4mo ago
nah it'll be fine because I'll write some code to create a datapoint at the valid from and at the valid to dates
ἔρως
ἔρως4mo ago
you're overcomplicating with no upsides
Jochem
Jochem4mo ago
I can't stop you, but my professional opinion as someone who has actually worked on datasets with millions of rows is that the database devs are smarter than either of us