J
JanusGraphβ€’9mo ago
cdegroc

Concurrent updates during a REINDEX

πŸ‘‹πŸ» Hello. I was reading JanusGraph documentation on reindexing (https://docs.janusgraph.org/schema/index-management/index-reindexing/) which says
JanusGraph can begin writing incremental index updates right after an index is defined. However, before the index is complete and usable, JanusGraph must also take a one-time read pass over all existing graph elements associated with the newly indexed schema type(s). Once this reindexing job has completed, the index is fully populated and ready to be used. The index must then be enabled to be used during query processing.
which made me wonder how JanusGraph handles incremental updates happening concurrently to a REINDEX. For instance, if we consider a slow reindexing process (e.g. done through the ManagementSystem interface) that can take several hours, how are concurrent additions/updates/deletions of vertices/edges/properties handled? Won't they just be overwritten by the reindexing process that could be based on old data? Any thought / pointer to code would be greatly appreciated. Thanks!
9 Replies
Bo
Boβ€’9mo ago
In theory, there's a small chance that this could happen, yes. I imagine the problem would be smaller with ManagementSystem interface coz IIRC, it doesn't cache much data while it's doing the reindexing. In other words, it keeps pulling from data storage and reindexes the data. There's nothing like "creating a snapshot and then doing the reindexing".
cdegroc
cdegrocβ€’8mo ago
Thanks Boxuan. I'm trying to link that to the code. IIUC this is because keys are iterated and SliceQueries are built/emitted on the fly as the job is making progress. Is that right?
cdegroc
cdegrocβ€’8mo ago
Also, as a follow-up, my understanding of the code is that a REINDEX does not clear the existing index data and will only reindex what's currently in the backend's edgestore. Is that correct? I imagine for such cases the work in https://github.com/JanusGraph/janusgraph/issues/1099 could help.
GitHub
Not able to delete ghost vertices caused by stale index Β· Issue #10...
Over the period of time we are observing some ghost vertices on our production setup where if we do gremlin query for having msid =18038893 shows ghost vertex. gremlin> g.V().has('msid',...
Bo
Boβ€’8mo ago
IIUC this is because keys are iterated and SliceQueries are built/emitted on the fly as the job is making progress. Is that right?
Yeah that's what I remembered. But I could be wrong... haven't touched that piece of code for a long time now.
my understanding of the code is that a REINDEX does not clear the existing index data and will only reindex what's currently in the backend's edgestore
I just quickly looked at the code and I feel you are right, but I vaguely remember JanusGraph does clear existing index data. A simple experiment would be the best. I just don't want to believe that JanusGraph could have such a notable bug πŸ˜†
porunov
porunovβ€’8mo ago
Yes, that’s right. JanusGraph doesn’t clear stale indexes during the REINDEX operation. I guess the job may be split on 2 parts: 1. Remove stale index records. 2. Add missing index records. At this point JanusGraph does the second job only during reindexing operation.
cdegroc
cdegrocβ€’8mo ago
Thank you both! πŸ™‡πŸ»
Bo
Boβ€’8mo ago
wait if that's the case then index lookups will return a lot of stale data?
porunov
porunovβ€’8mo ago
Yes. That's how it works right now. Thus, you either need to remove stale index records yourself (with StaleIndexRecordUtil) or migrate your data to the new keyspace. For sure it would be great if we add a tool which scans all the index records and removes any stale record.
Bo
Boβ€’8mo ago
Oh I see Sorry I mistakenly mixed OLAP reindexing with OLTP updates. OLTP updates will delete the stale index, while OLAP reindexing doesn't but that's fine from functionality perspective, right? If one needs a reindexing, the old index will not be used anyways. It's just stale data that takes up storage space, but won't do any harm to the data correctness.
Want results from more Discord servers?
Add your server