JanusGraph•14mo ago

Concurrent updates during a REINDEX

👋🏻 Hello. I was reading JanusGraph documentation on reindexing (https://docs.janusgraph.org/schema/index-management/index-reindexing/) which says

JanusGraph can begin writing incremental index updates right after an index is defined. However, before the index is complete and usable, JanusGraph must also take a one-time read pass over all existing graph elements associated with the newly indexed schema type(s). Once this reindexing job has completed, the index is fully populated and ready to be used. The index must then be enabled to be used during query processing.

which made me wonder how JanusGraph handles incremental updates happening concurrently to a REINDEX. For instance, if we consider a slow reindexing process (e.g. done through the ManagementSystem interface) that can take several hours, how are concurrent additions/updates/deletions of vertices/edges/properties handled? Won't they just be overwritten by the reindexing process that could be based on old data? Any thought / pointer to code would be greatly appreciated. Thanks!

9 Replies

Bo•14mo ago

In theory, there's a small chance that this could happen, yes. I imagine the problem would be smaller with ManagementSystem interface coz IIRC, it doesn't cache much data while it's doing the reindexing. In other words, it keeps pulling from data storage and reindexes the data. There's nothing like "creating a snapshot and then doing the reindexing".

cdegrocOP•14mo ago

Thanks Boxuan. I'm trying to link that to the code. IIUC this is because keys are iterated and SliceQueries are built/emitted on the fly as the job is making progress. Is that right?

cdegrocOP•14mo ago

Also, as a follow-up, my understanding of the code is that a REINDEX does not clear the existing index data and will only reindex what's currently in the backend's edgestore. Is that correct? I imagine for such cases the work in https://github.com/JanusGraph/janusgraph/issues/1099 could help.

GitHub

Not able to delete ghost vertices caused by stale index · Issue #10...

Over the period of time we are observing some ghost vertices on our production setup where if we do gremlin query for having msid =18038893 shows ghost vertex. gremlin> g.V().has('msid',...

Bo•14mo ago

IIUC this is because keys are iterated and SliceQueries are built/emitted on the fly as the job is making progress. Is that right?

Yeah that's what I remembered. But I could be wrong... haven't touched that piece of code for a long time now.

my understanding of the code is that a REINDEX does not clear the existing index data and will only reindex what's currently in the backend's edgestore

I just quickly looked at the code and I feel you are right, but I vaguely remember JanusGraph does clear existing index data. A simple experiment would be the best. ~~I just don't want to believe that JanusGraph could have such a notable bug 😆~~

porunov•14mo ago

Yes, that’s right. JanusGraph doesn’t clear stale indexes during the REINDEX operation. I guess the job may be split on 2 parts: 1. Remove stale index records. 2. Add missing index records. At this point JanusGraph does the second job only during reindexing operation.

cdegrocOP•14mo ago

Thank you both! 🙇🏻

Bo•14mo ago

wait if that's the case then index lookups will return a lot of stale data?

porunov•14mo ago

Yes. That's how it works right now. Thus, you either need to remove stale index records yourself (with StaleIndexRecordUtil) or migrate your data to the new keyspace. For sure it would be great if we add a tool which scans all the index records and removes any stale record.

Bo•14mo ago

Oh I see Sorry I mistakenly mixed OLAP reindexing with OLTP updates. OLTP updates will delete the stale index, while OLAP reindexing doesn't but that's fine from functionality perspective, right? If one needs a reindexing, the old index will not be used anyways. It's just stale data that takes up storage space, but won't do any harm to the data correctness.

Gaming

Programming

Concurrent updates during a REINDEX

Did you find this page helpful?