Can `CqlInputFormat` do predicate pushdowns/query based prefilters?
Hi! First of all, thank you all for your work on JanusGraph.
In my use case, I have a medium-large graph, ~3TB currently, might be 1-2 orders of magnitude bigger later. The data in it is generally clustered in a time-based fashion, e.g. newer vertices are mostly connected to other newer vertices (a timestamp is stored as a vertex property).
I am writing an OLAP pipeline with Spark where JanusGraph, backed by Cassandra, is the source, and I use Tinkerpop's
hadoop-gremlin
to build vertex programs and run OLAP gremlin queries. Per my understanding, in this setup the only point of contact with JanusGraph is through the CqlInputFormat
and the server itself is not involved at all. Is that correct?
A very common operation that I'm going to have to do, based on the above clustering assumption, is pre-filtering vertices by a timestamp range before running my logic on the subgraph. As an example, I would like to, say, download the last couple days' worth of vertices on my laptop for running some tests. Per my understanding, currently this would entail unconditionally loading the entire dataset in the Spark cluster's memory every time. Is that correct? Is there an alternative?
I have looked into CqlInputFormat
's code and I noticed that you can add WHERE
clauses, but it looks like there are caveats to that and I could not understand how to map a (simple) predicate on a vertex property to a CQL clause. I was considering rolling my own input format class once I grokked how to run CQL queries directly. I'm not super familiar with JanusGraph's codebase, nor I am a Java expert really, but I'm willing to get my hands dirty -- could I please ask for a bird's eye view explanation of how graph data is mapped into the backend, or even just pointers into how to navigate the codebase pertaining to that? Or do you have other suggestions that could point me in the right direction?
Thank you! 🙌
cc @criminosis6 Replies
@porunov / @Bo any ideas or any recommendations on who may know? We weren't sure if this was more meaningful to ask here or on the Tinkerpop side.
👋🏻 Hey! We've used
CqlInputFormat
to dump entire graphs.
I agree with your analysis. I think the WHERE
clause you're refering to is https://github.com/JanusGraph/janusgraph/blob/v1.0/cassandra-hadoop-util/src/main/java/org/apache/cassandra/hadoop/cql3/CqlConfigHelper.java#L61.
I believe this is a CQL configuration option and not a JanusGraph one. Since JanusGraph encodes rows in its own binary format, I doubt this type of filtering would work well (Happy to be wrong though!).Moreover, Hadoop's
CqlInputFormat
is an old class, which IIRC was deprecated in favor of https://github.com/datastax/spark-cassandra-connector That new project could be a better long-term solution.GitHub
GitHub - datastax/spark-cassandra-connector: DataStax Connector for...
DataStax Connector for Apache Spark to Apache Cassandra - datastax/spark-cassandra-connector
that looks great, I wasn't aware of that. though I don't see any way to make it play with Tinkerpop? it needs an
InputFormat
implementation somewhere for the graphReader
property and spark-cassandra-connector
has no references to that.Yes I think a joint effort w/ the TinkerPop community would be needed
@johndisandonato probably good to move this to the Tinkerpop discord for that side of things