Kevin Powell - Community•13mo ago

creating a historical database for graph generation

Anyone have opinions / thoughts / experience on the best way to store database history? Like if I was storing the values of a stock ticker and wanted to generate a graph of stock price over time (not my actual use case but similar idea). Seems like there are a lot of wrong ways to do it. I'll be accessing the DB with an expressJS server and I haven't choosen a database stack yet

59 Replies

13eck•13mo ago

Sounds like a time series database. Basically you store the information per time then aggregate it when needed. Depending on the storage needs of your project you could combine certain datapoints to a more condensed entry. As an example, say you’re recording the temperature outside your house for reasons. You start with an hourly reading. Then two weeks later you condense/combine that data to a daily average (freeing up the hourly recordings). Two months later you combine the daily data into a weekly average. That reduces your database from 720 hourly entries in a month to 4 weekly entries. And while there are many TSDBs to choose from, any database can be used. The timestamp is your key and the info you’re recording is a value. SQL, k/v pair, document, etc. Unless you’re actually recording stock prices for a stock trader you can use wherever db you’re most comfortable with

ShaneOP•13mo ago

Heres the thing, I can't really compress / average old data because it is still important.... But for the most part, individual rows will be exactly the same every hour, do you know of tech that handles that really nicely, or is this something I need to build out myself there will be a 2 week period where rows will update hourly or possibly even every 30 minutes and then the rest of the time everything will update on a day/week scale

13eck•13mo ago

Depending on the data being stored you’ll probably be ok for a while with any database engine. Is the old data needed immediately or just for historical records? I’d it’s not needed right away you can archive it—most DB engines have a way you “store” old data as a zip file or something that can be stored elsewhere and pulled when needed.

ShaneOP•13mo ago

It is kind of needed for making graphs, I'll probably just start with postgres and making a table for the historical data, but doing some shenanigans so that if there is already a database entry with the exact same value, I'll add/update a "validTo" column so I'm not just making new rows for storing the same data That will slow down data uploading a little bit bc I'll have to query the database on upload, and it might slow down data access because I'll have to parse and generate nicely formatted data in the API layer

Jochem•13mo ago

to give "a while" some perspective: MariaDB on relatively modest hardware has zero issues with tens of millions of rows in a table. Individual rows are found effectively instantly still, and with the right indexes even aggregate queries are still very fast

13eck•13mo ago

Honestly, without knowing more of what exactly you're trying to store in the DB the only real advice I can give is: just store the damn data. If/when the DB starts to slow down or get full then worry about optimizing or finding "better" ways to do things. As Jochem pointed out, you're gonna be good for a long while with GB of data with most DBs out there.

ἔρως•13mo ago

you can have a "cold storage" database if you use mariadb/mysql, you can have a table with quick data - data that you WILL use this month or so then, have a table with pre-calculated pre-historic data then, you can either store the raw historical data in mongdb or in an archive table

ShaneOP•13mo ago

The historical data will need to be accessed often unfortunately

ἔρως•13mo ago

define "often" once a month? and ALL OF IT??? i very strongly doubt that

Jochem•13mo ago

all of this is premature optimization btw just build it and worry about performance if you hit the 50M record mark

ἔρως•13mo ago

i mean, if you hit 50 million, you WON'T read 50 million everytime he's talking about graphics you DON'T need to read 50 million records to make one you pre-compute most of it ahead of time, while moving the data to "cold storage"

Jochem•13mo ago

I'm just saying premature optimization is wasted time

ἔρως•13mo ago

in my opinion, this stinks to bad architecture or bad planning oh, yes, premature optimization is the root of all evil

Jochem•13mo ago

you may not hit the amount of records you need where performance becomes an issue, and performance bottlenecks are incredibly hard to predict. Just write your software with some quick-win stuff in place, but don't worry about changing your architecture until you need to it's wasted time every single time

ἔρως•13mo ago

that's a very good point we also don't know how much data he has to process now and for how many people

Jochem•13mo ago

there's article after article by tech startups about how they completely failed to predict where their bottlenecks would be

ἔρως•13mo ago

that's true but we need more context but the answer is probably "don't worry about it"

ShaneOP•13mo ago

Basically its a site for monitoring class seats during course registration. Data will update often every 30 minutes or so for roughly 2000 classes over the period of a week and then rarely update over the remainder of the year. After the year is over the historical data will still be displayed on the site in an interactive graph I'm thinking about starting with postgres, and then by doing a read before every write, I can avoid adding a new row for every scrape when the data updates are infrequent and that should solve the data size issue 2000 classes * 48 updates a day * 365 sounds like a lot of data and I was hoping that an out of the box solution could just work, but maybe this is a fairly unique application?

ἔρως•13mo ago

you're not going to be showing that much data all the time just pre-compute what you will show and shove the actual data somewhere else

Jochem•13mo ago

are you showing aggregate data over classes, or just temporal data over 2000 individual classes? Also, how realistic is that number of 2000 classes in real world use?

ShaneOP•13mo ago

realistic according to the api temporal data I think, may aggregate later but for now just the graphs

Jochem•13mo ago

then you have nothing to worry about put an index on the classid column and you're talking about 17k rows a year

ἔρως•13mo ago

yeah, 17k a year is nearly nothing and you can partition by school year as well, if you want

ShaneOP•13mo ago

wdym

ἔρως•13mo ago

which will make the database even faster you're not going to record 48 entry points for 365 days

Jochem•13mo ago

if you're just displaying temporal data over a single class, then each class's fetch will only need to access 17k records a year (48*365)

ἔρως•13mo ago

you're going to record 16 points every 200 days

Jochem•13mo ago

that's a non-issue, even with tens of millions of rows to access (the 17k rows) just make sure it's indexed properly, which is probably already going to be automatic because you'll have a foreign key on the class, right?

ShaneOP•13mo ago

I still need to store 17k records for each class now

Jochem•13mo ago

so? the performance issues are on aggregating the data

ἔρως•13mo ago

that's baby's first steps in database world

Jochem•13mo ago

postgres can store hundreds of millions of rows and not break a sweat

ShaneOP•13mo ago

The storage of 350 million rows :Sadge: that is just for one college

Jochem•13mo ago

🤷

ShaneOP•13mo ago

Ideally this works well and I scale for 10s and 100s of colleges

ἔρως•13mo ago

if you want to store raw data, store somewhere else

Jochem•13mo ago

are you building this for a known, existing customer base, or just starting out hoping for a customer base? keyword "existing"

ShaneOP•13mo ago

its just a project for fun :) Not trying to sell anything

Jochem•13mo ago

then really, really REALLY don't worry about it

ἔρως•13mo ago

then get it to work first later on, you deal with the supposed 350 million columns

ShaneOP•13mo ago

I'll do the optimization I mentioned so I'm not storing millions of rows and not dealing with that many cols

ἔρως•13mo ago

🤦‍♂️ dude, get something working first

Jochem•13mo ago

do a quick google on "premature optimization" and read the first five results

ἔρως•13mo ago

you don't know what you will need yet maybe my amazing idea that almost all companies use won't be such a good idea and something more custom will be better for you maybe something even less complicated but for now, you know absolutely nothing about what you will need in the future that is, to scale to, for example, 10 schools

ShaneOP•13mo ago

it isn't super custom its literally just not writing a new row if the data hasn't changed since the last scrape so I'm not storing millions of the exact same row in my db

ἔρως•13mo ago

that's not exactly a solution if you need historically accurate data also, if you need to process it later on, for some reason, you won't have it but still, just get something to work in fact, while working on it, you may have a better idea

ShaneOP•13mo ago

wdym

ἔρως•13mo ago

well, if you need the data you didn't save, in 5 months, what will you do?

ShaneOP•13mo ago

it doesn't save only if the data is unchanged so the previous data point can be used and I can increment a validTo date column

Jochem•13mo ago

basically, it does the thing the browser console does, just have a little (127) at the end of a row to show it's repeated 127 times Honestly, I would recommend you don't do that. It would complicate your code a lot it's still premature optimization diskspace is cheap, indexed queries are fast, and I can guarantee that you do not understand database optimization (because very few people outside of RDBMS devs really do)

ἔρως•13mo ago

yeah, but how will you handle spans of time that are ridiculously large, like 3 days of always the same data?

ShaneOP•13mo ago

update the validTo date to the current date

ἔρως•13mo ago

i see it as a very brittle thing, that will just overcomplicate everything

Jochem•13mo ago

this would make your code a nightmare

ἔρως•13mo ago

just dump the data somewhere, like a little dumb piece of code would do read, save read, save read, save that's it the processing can come later

Jochem•13mo ago

as someone who has worked with a bunch of large datasets in mysql/mariadb and postgres: just leave it to the database and worry about it when it breaks

ShaneOP•13mo ago

nah it'll be fine because I'll write some code to create a datapoint at the valid from and at the valid to dates

ἔρως•13mo ago

you're overcomplicating with no upsides

Jochem•13mo ago

I can't stop you, but my professional opinion as someone who has actually worked on datasets with millions of rows is that the database devs are smarter than either of us

Gaming

Programming

creating a historical database for graph generation

Did you find this page helpful?