C
C#•8mo ago
OptoCloud

Filesystem packer slows down after 30k files

After the filesystem packer has hashed all 255k files then the DB operations starts to slow down the entire application. The DB writes get to 30k files before the TAR writer catches up and slows down to the DB writers speed. then it uses hours maybe days to finish... Any way I can speed this up? https://github.com/OptoCloud/OptoPacker Current status:
GitHub
GitHub - OptoCloud/OptoPacker: Pre-packs huge filesystems containin...
Pre-packs huge filesystems containing repositories or other projects for compression, parsing gitignore files to exclude unnessecary files from packing - OptoCloud/OptoPacker
No description
3 Replies
OptoCloud
OptoCloudOP•8mo ago
tar writer has to wait for DB job before doing its thing because if a BLOB with a matching hash has already been written to the TAR file then there is no use writing it again Application workflow:
Discover all files, respecting gitignore files along the way

Register all directories traversed to the database and build a directory graph

Then asynchronously using IAsyncEnumerator, with batching and other stuff, for every file discovered:
Hash file, the hash will be used to ensure file contents of identical files are only written to TAR file once, a unique entry of file contents is refered to as a BLOB

Check database if a BLOB has already been registered in the database, if not then write the blob entry to the database

Register the file record with relation to the blob record in the database and set its relation to the directory hierarchy

**Only if** the blob record was inserted into the database, write the file to the tar file with a filename that is the hash of its content (the blob hash)
Discover all files, respecting gitignore files along the way

Register all directories traversed to the database and build a directory graph

Then asynchronously using IAsyncEnumerator, with batching and other stuff, for every file discovered:
Hash file, the hash will be used to ensure file contents of identical files are only written to TAR file once, a unique entry of file contents is refered to as a BLOB

Check database if a BLOB has already been registered in the database, if not then write the blob entry to the database

Register the file record with relation to the blob record in the database and set its relation to the directory hierarchy

**Only if** the blob record was inserted into the database, write the file to the tar file with a filename that is the hash of its content (the blob hash)
Jimmacle
Jimmacle•8mo ago
dumping a ton of binary blobs into a database (especially sqlite) sounds slow do you really need the file contents in there?
OptoCloud
OptoCloudOP•8mo ago
im not dumping the binary blobs into the database only their hashes the blobs im writing to a tar stream thats streamed into 7zip sorry for the miswrite I tried to explain it better now I updated my code a bit more to try to maybe optimize it? Idk how much more I can do Bump? @Jimmacle 🤔
Want results from more Discord servers?
Add your server