Concurrent updates during a REINDEX
ππ» Hello. I was reading JanusGraph documentation on reindexing (https://docs.janusgraph.org/schema/index-management/index-reindexing/) which says
JanusGraph can begin writing incremental index updates right after an index is defined. However, before the index is complete and usable, JanusGraph must also take a one-time read pass over all existing graph elements associated with the newly indexed schema type(s). Once this reindexing job has completed, the index is fully populated and ready to be used. The index must then be enabled to be used during query processing.which made me wonder how JanusGraph handles incremental updates happening concurrently to a
REINDEX
.
For instance, if we consider a slow reindexing process (e.g. done through the ManagementSystem
interface) that can take several hours, how are concurrent additions/updates/deletions of vertices/edges/properties handled?
Won't they just be overwritten by the reindexing process that could be based on old data?
Any thought / pointer to code would be greatly appreciated. Thanks!9 Replies
In theory, there's a small chance that this could happen, yes. I imagine the problem would be smaller with ManagementSystem interface coz IIRC, it doesn't cache much data while it's doing the reindexing. In other words, it keeps pulling from data storage and reindexes the data. There's nothing like "creating a snapshot and then doing the reindexing".
Thanks Boxuan. I'm trying to link that to the code.
IIUC this is because keys are iterated and SliceQueries are built/emitted on the fly as the job is making progress. Is that right?
Also, as a follow-up, my understanding of the code is that a REINDEX does not clear the existing index data and will only reindex what's currently in the backend's edgestore. Is that correct? I imagine for such cases the work in https://github.com/JanusGraph/janusgraph/issues/1099 could help.
GitHub
Not able to delete ghost vertices caused by stale index Β· Issue #10...
Over the period of time we are observing some ghost vertices on our production setup where if we do gremlin query for having msid =18038893 shows ghost vertex. gremlin> g.V().has('msid',...
IIUC this is because keys are iterated and SliceQueries are built/emitted on the fly as the job is making progress. Is that right?Yeah that's what I remembered. But I could be wrong... haven't touched that piece of code for a long time now.
my understanding of the code is that a REINDEX does not clear the existing index data and will only reindex what's currently in the backend's edgestoreI just quickly looked at the code and I feel you are right, but I vaguely remember JanusGraph does clear existing index data. A simple experiment would be the best.
Yes, thatβs right. JanusGraph doesnβt clear stale indexes during the REINDEX operation.
I guess the job may be split on 2 parts:
1. Remove stale index records.
2. Add missing index records.
At this point JanusGraph does the second job only during reindexing operation.
Thank you both! ππ»
wait if that's the case then index lookups will return a lot of stale data?
Yes. That's how it works right now. Thus, you either need to remove stale index records yourself (with StaleIndexRecordUtil) or migrate your data to the new keyspace. For sure it would be great if we add a tool which scans all the index records and removes any stale record.
Oh I see
Sorry I mistakenly mixed OLAP reindexing with OLTP updates.
OLTP updates will delete the stale index, while OLAP reindexing doesn't
but that's fine from functionality perspective, right? If one needs a reindexing, the old index will not be used anyways. It's just stale data that takes up storage space, but won't do any harm to the data correctness.