How does database sharding work
I understand the concept of load balancing where for example Service.myApi redirects the request to a server like s1, s2, etc.myApi or even media servers that balance the media into mulitple servers like media.myapi -> m3.myApi
But in the case of the media server the data where a media file is stored, is stored in the database so it's just a matter of asking the database for such data
But then how can I know what database is targeted when there is db.myapi which would have db1.myApi and db2.myApi since I can't just ask the database for data without knowing where it's stored
38 Replies
The "sharded" databases I've worked with all rely on the concept of a partition key
all items with the same partition key are guaranteed to be on the same shard, and you must know the partition key to be able to query the database
What exactly is this "partition key"
essentially just a column like any other, except it must have a known value
My most recent example would be for a car manufacturer, where we used a sharded database to store all the cars they ever produced
we used the market the car was produced for as the partitionkey there
that was using AWS DynamoDB, for the record
analysing your access patterns and knowing what data you will have is super important when working with document databases or something like this
Let's say I want to load a bunch of messages from a database
I'd make an API call to my server which then fetches the data from the database to return it
Now if I understand your approach correctly I need a collection of where things are but how will I know that if that requires a database call too? Because that approach sounds like I'm just cloning a huge record of everything just less detailed onto every server which would destroy the porpuse of having multiple servers (in terms of storage)
What database engine are we talking here? Implementing shard on your own in something like MSSQL is quite different from doing it in a database that has it from scratch
If we assume you are doing it on your own in a relational engine like MSSQL or Postgres, you would have a "master" database that keeps track of what other databases contain what, on an overarching level. Something like "Customer 1246234 has its data in
database-6
"
you should not split/shard/partition data willy nilly, if they belong together, keep them togetherOur options are PSQL or MariaDB
But I tend to go with MariaDB cause I find it more flexible to use and query to
so relational dbs
not something you would normally partition
My plan is that each "community" has its own database with tables and stuff, on the user's account there is a record of all communities they are in
But that doesn't help that messages might end up with a huge size
Okay, so you're doing multi-tenancy
My biggest concern is storage really
We have dedicated servers for media and stuff but messages are :SCgetoutofmyhead:
¯\_(ツ)_/¯
No good ideas from me, I've never worked at quadrillion-items scale
a single relational DB with a single table can easily handle billions of messages
as long as you have indexes on the things you will query on
Things like message history will be fetched by getting the last few messages along with a token that I can query to load the previous messages
thats easy enough, if you have some kind of incremental id so the database can easily "order" the items, thats just a
SELECT * FROM messages limit 10;
Also what do you think of the idea that the database manager for the backend software is a seperate package the backend uses
This way we can keep the database connector stuff as a seperate "service" for easier maintaining
im not sure what that means
Uh
Basically the code that handles database queries is a NuGet package the backend calls
are you doing manual ADO.NET queries in 2024? 😛
I have no clue what ADO.NET is :SCgetoutofmyhead:
SqlConnection
and friendsAh
Yes.
I'd uh... recommend not doing that.
What's a more modern approach
EF Core, or Linq2Db
I make the SQL queries and then map the result to objects using Dapper.
Right
Which of those do you personally recommend I should take a look at primarily
Dapper is good, and a few years ago that was the best way to get good performance
EF Core probably
its the defacto standard for all .NET development
Oki
the nice thing with it is that you no longer need to write sql
Can you give me an example of how a SELECT query would look
you just do
var messages = await _dbContext.Messages.Where(x => x.SenderId == myUserId).TakeAsync(50);
Ah
you work with a type called
IQueryable<T>
that represents the query before it being sent to the database
so you can gradually build up a query with more and more filtering or mapping etc
when you are done and want to issue the command, you run a finalizer like ToListAsync
or SingleAsync
etc
and then the query runs and you get a List<T>
or T
back
depending on what finalizer you ran1. Does T in this case represent the object I want to map
2. Will this also be safe from SQL injection and such
1: T would be your domain model.
2: yes.
:soPortuguese:
@Pobiega question
As I said before every community gets it's own database
Now if I want to query all users for example I'd do something like
SELECT * FROM myCommunityDb.myUsersTable;
Is it possible to query like that on EF Coreyes, but its not trivial. look up "multi tenancy in EF Core"
Oki
Thanks a lot!
I'll take a look and come back some time later