Periodic Data deletion in Janusgraph
Hi folks,
Wanted to know what are the best practices for deleting data older than some certain days to maintain graph size within limits.
Does janusgraph provides any standard SoP or tooling for daily data deletion jobs?
I know for static vertexes it provides TTL but We don't have static vertices.
6 Replies
Hi, JanusGraph doesn't have such tools out of the box so you'll probably have to implement some logic by yourself.
We have also implemented logic like this at my workplace. Our solution was to add a mixed index on a modification date that is present on each vertex and edge. Then we implemented a k8s Cronjob which uses the index to fetch vertices & edges older than the configured retention time and then deletes these vertices & edges
You need to be careful with concurrent modifications of course. Otherwise, you'll risk getting ghost vertices: https://docs.janusgraph.org/common-questions/#ghost-vertices
As a side note, in case you use Cassandra you should also run a compaction process from time to time because deletes produce tombstones which may slow down Cassandra performance in case there are too many tombstones.
thanks @florianhockmann . Just curious why did you use the mixed index? is it because of running range queries on node creation/event timestamp. Also given number of nodes / edges to be deleted could be very large. Did you use some kind of batching or single query to delete everything? and how did it impact read write on the instance?
thanks @porunov we are using bigtable. any specific guidance for this?
Yes, exactly, because of the range queries. Yep, we're creating batches. We currently split the interval to be deleted into batches of max 100 vertices / edges.
We use dedicated JanusGraph Server instances only for these delete jobs so the impact on the performance of other applications accessing JanusGraph only comes via the backends
@Florian Hockmann the deletion process is not documented well, could you share your cronjob query to delete particular vertex or edge?
Deleting a particular vertex is really just
g.V([vertexID]).drop().iterate()
. This will also delete all incident edges. If you want to delete an individual edge, then you can just use g.E([edgeID]).drop().iterate()
More interesting is how you determine which vertices / edges you want to delete. I originally wrote here that we used a mixed index on the modification date, but we moved away from that solution in the meantime as the load on ES became too high with this solution.
Our current solution consists of a dedicated retention DB based on Timescale where we store the modification date together with the vertex ID. A cronjob then regularly fetches 10min intervals and sends these vertex ids to a clean up services which then drops them in JanusGraph