shivam.choudhary Posts - Answer Overflow

shivam.choudhary

Explore posts from servers

•Created by shivam.choudhary on 12/14/2023 in #questions

Olap using spark cluster taking much more time than expected.

Hi All, We have setup a spark cluster to run olap queries on janusgraph with bigtable as storage backend. Details:

Backend: *Bigtable*
Vertices: *~4 Billion*
Data in backend: *~3.6 TB*
Spark workers: *2 workers each having 6 cpu, 25 gb ram*
Spark executors: *6 executors on each worker having 1 cpu 4gb ram*

Backend: *Bigtable*
Vertices: *~4 Billion*
Data in backend: *~3.6 TB*
Spark workers: *2 workers each having 6 cpu, 25 gb ram*
Spark executors: *6 executors on each worker having 1 cpu 4gb ram*

Now I'm trying to count all the vertices with the label ticket which we know are of the order of ~100k, the query fired to do that is as follows:

graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
g = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
g.withComputer(Computer.compute().vertices(hasLabel(ticket))).V().count()

graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
g = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
g.withComputer(Computer.compute().vertices(hasLabel(ticket))).V().count()

The query is running from the past 36 hours and is still not completed, looking at the average throughput (>50 mb/sec) at which data is being read it should have read the ~3.6TB of data by now. Is it possible to use indexes while running the olap query resulting in faster loading of the subgraph into spark rdds (currently it is scanning the full graph) ?

2 replies

ATApache TinkerPop

•Created by shivam.choudhary on 12/10/2023 in #questions

Implementing Graph Filter for Sub-Graph Loading in Spark Cluster with JanusGraph

Hello, I'm currently utilizing JanusGraph 0.6.4 with Bigtable as the storage backend and encountering difficulties when attempting to run OLAP queries on my graph via SparkGraphComputer. The graph is quite large, containing billions of vertices, and I'm only able to execute queries on significantly smaller graphs. My queries are being run through the Gremlin console, and the problem appears to be related to loading the graph into Spark RDD. I'm interested in applying a filter to load only vertices and edges with specific labels before running the query. I've noticed that creating a vertex program and using it as described in the Tinkerpop documentation(https://tinkerpop.apache.org/docs/current/reference/#graph-filter) only loads the specified subgraph into the Spark RDDs:

graph.computer().
  vertices(hasLabel("person")).
  vertexProperties(__.properties("name")).
  edges(bothE("knows")).
  program(PageRankVertexProgram...)
)

graph.computer().
  vertices(hasLabel("person")).
  vertexProperties(__.properties("name")).
  edges(bothE("knows")).
  program(PageRankVertexProgram...)
)

Is it possible to implement this filtering directly through the Gremlin console? I've attempted to use g.V().limit(1), but without success. I suspect this is because the entire graph is being loaded into the RDD for this query as well. Here's the code I used:

graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
hg = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
hg.V().limit(1)

graph = GraphFactory.open("conf/hadoop-graph/read-hbase-cluster.properties")
hg = graph.traversal().withComputer(org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer)
hg.V().limit(1)

Any insights or suggestions would be greatly appreciated. Thank you.

4 replies

JJanusGraph

•Created by shivam.choudhary on 8/14/2023 in #questions

Impact of ID Pool Initialisation on Query Performance

Greetings everyone, We're currently operating a JanusGraph setup with cluster.max-partitions set at 1024 and ids.num-partition at 10. Our primary goal is to ensure high availability for the cluster instances. However, we've noticed that the initialization of the ID pool is causing disruptions, during server restarts. The root cause seems to be the initialisation of ID pool threads for each partition until every partition has a ID pool. Upon server restart, the ID pool is initialised based on write operations. Unfortunately, this process has been negatively impacting the performance of query execution. To mitigate this challenge, we're exploring the possibility of implementing an eager initialisation approach for the ID pool threads. Is there a way to achieve it? Thank you for your attention and assistance in addressing this matter.

5 replies

ATApache TinkerPop

•Created by shivam.choudhary on 7/31/2023 in #questions

User-Agent Metric Not Exposed in Gremlin Server - Need Help Troubleshooting

Hey everyone, I've been working with Gremlin and noticed that we can pass the User-Agent in requests to the Gremlin server. According to the documentation (reference: https://tinkerpop.apache.org/docs/current/reference/#metrics), the server should maintain a metric called user-agent.*, which counts the number of connection requests from clients providing a specific user agent. We have already implemented sending the User-Agent in our HTTP requests to the Gremlin server, but the metric mentioned in the documentation doesn't seem to be exposed or working as expected. Has anyone encountered a similar issue? Do we need to enable the metric in some way, or could there be something else causing the problem? Any help or insights on this matter would be greatly appreciated. Thanks!

9 replies

JJanusGraph

•Created by shivam.choudhary on 7/30/2023 in #questions

JanusGraph Instance startup failure due to id block allocation

2 replies

JJanusGraph

•Created by shivam.choudhary on 7/27/2023 in #questions

Data Storage wit TTL

Hi Everyone, We have a requirement in which we have to store around ~80 million records daily in our graph storage and we should have a TTL of 90 (~7 Billion) days for this data but the issue we are having is that we can have TTL on static vertex only and we don't want to do that as that restrict us from further updates on that vertex (correct me if I'm wrong). Please suggest some way in which we can store this data so that we can have TTL also. We are using bigtable backend, so will it be possible for us if we directly can have gc policy(90 days) on Bigtable column families?

4 replies

ATApache TinkerPop

•Created by shivam.choudhary on 7/18/2023 in #questions

ReadOnlyStrategy for remote script execution to make a read only server instance

Hi all, I am setting up a read only cluster of gremlin server, I have conifgured the initialization script like this: globals << [g : traversal().withEmbedded(graph).withStrategies(ReferenceElementStrategy)] Now when I'm using g and sending a write request to the gremlin server I'm getting the proper exception and not able to add data. The issue I'm facing is that I can access the graph instance directly and able to send request like graph.traversal().addV() in place of g.addV, is there a way I can restrict this and make the server only accept write request? TIA.

13 replies

JJanusGraph

•Created by shivam.choudhary on 6/27/2023 in #questions

JanusGraph metrics data having value 0 for most metrics

I have a janusgrpah server with metrics enabled along with jmx.metric enabled, The issue is all the metric starting with org_janusgraph have a constant 0. I'm able to see all the metrics with name org_apache_tinkerpop but not the janusgraph ones. Can someone please suggest if I m missing or how can I enable it.

11 replies

ATApache TinkerPop

•Created by shivam.choudhary on 6/27/2023 in #questions

[parameterized queries] Increased time in query evaluation when gremlin server starts/restarts

Hi folks, High latency observed in query evaluation time whenever janusgraph server restarts/starts & the latency degradation is there for atleast 5 mins. * I'm using parameterized queries for the janusgraph server requests, So I know there will be some increased latency whenever sever starts/restarts but the issue is this degradation does not go away for atleast 5 mins and the latency for evaluation goes from aoound 300ms to 5k ms. * Janusgraph is deployed in a kubernetes cluster with 20 pods, So everytime I redeploy the janusgraph cluster this issue arises which results in timeouts at client side. Wanted to know if there is some other way to add all the parameterized queries to the cache so that whenever started/restarted janusgraph pod is ready to serve requests all the parametrized query should be already in cache

24 replies

JJanusGraph

•Created by shivam.choudhary on 5/31/2023 in #questions

Reading the writes on Instance A done by Instance B

We are facing an issue with the janusgraph cluster having around 20 instances. The issue here is that if we are writing a vertex on instance A and try to read it from other instance it takes few seconds for the write to reflect on other instances. I have checked that storage.batch-loading is set to false.

5 replies

ATApache TinkerPop

•Created by shivam.choudhary on 3/10/2023 in #questions

Verifying the count of ingested vertex and edges after bulk loading in Janusgraph.

I have bulk loaded around 600k Vertices and 800k Edges into my janusgraph cluster backed with bigtable, I want to verify the number of vertex with a given label 'A' using gremlin query but I'm getting evaluation timeout error. The evaluation timeout is set to 5 minutes. Gremlin query used is = g.V().hasLabel('A').count() Can anyone help me on how I can verify the count of vertices and edges loaded into the graph? Thanks.

4 replies

Gaming

Programming