Hands-On Experience: Import Data to Nebula Graph with Spark

This article is written by Liu Jiahao, an engineer at the big data team of IntSig Information Co. Ltd (IntSig). He has been playing around with NebulaGraph and is one of our proud GitHub contributors. This post shares his experience importing data to NebulaGraph with Spark.

Why NebulaGraph?

The graph-related business has grown more and more complex, and performance bottlenecks are identified in some popular graph databases. For example, a single machine has difficulties in scaling to larger graphs. In terms of performance, the native graph storage of Neo4j has irreplaceable advantages. In my survey, JanusGraph, Dgraph, and other graph databases cannot be comparable to Neo4j in this regard. JanusGraph performs very well in OLAP and can support OLTP to some extent. However, this cannot be an advantage of JanusGraph anymore, because some technologies, such as GraphFrame, are sufficient for the OLAP requirements. Besides, since Spark 3.0 starts to support Cypher, I found that comparing with the OLTP requirements of graphs, their OLAP requirements can be satisfied with more technologies. Therefore, NebulaGraph undoubtedly turns out to be a breakthrough to the low efficiency distributed OLTP databases.

I did a lot of researches and deployed some graph databases. When I did a test on the OLTP efficiency of JanusGraph and found that it cannot meet my online business requirements, I stopped requiring the equal or close performance in both OLAP and OLTP of a graph database. But it occurred to me that the architecture of NebulaGraph has all the features that can satisfy the requirements for graphs, such as:

Distributed: NebulaGraph adopts the shared-nothing distributed architecture in storage.
High-speed OLTP, which means the performance should be comparable to Neo4j: In NebulaGraph, its storage layer queries directly map physical addresses, which can be regarded as native graph storage.
Highly available services, which means, without human interruptions, the database can continuously provide stable services: The services are available in case of partial failures, and the snapshot feature is available.
Guaranteed scalability: NebulaGraph supports linear expansion, and it supports custom development because it is open source.

It seems that the architecture of NebulaGraph meets our actual requirements in the production environment. Therefore, I conducted research, deployment, and testing of NebulaGraph. In terms of deployment and performance testing, you can find some detailed information on NebulaGraph official website and some technical blogs. See Meituan's benchmarking and Tencent Cloud's performance test. This article mainly describes my understanding and experience in NebulaGraph after my usage of a Spark application to import data into NebulaGraph.

Test Environment

A Nebula Graph cluster, composed of:
1. 3 servers of 32 cores (actually limited to 16 cores)
2. 400 GB RAM (actually configured to 100 GB)
3. SSD
4. Version: Nebula Graph 1.0.0 (the test was done very early)
Network: 10G networking
Graph size: billions of vertices (with few properties), tens of billion edges (directed and no property or weighted)
Spark cluster: Spark 2.1.0 was used.

In this test, a total of 2 TG memory was used by Nebula Graph, which was calculated by the equation: (3*30 executor + 1 driver) * 25 GB.

Use Spark to Batch Import Data

Procedure

Package sst.generator, which is required by Spark to generate SST files.
Configure the Nebula Graph cluster, start the Nebula Graph cluster, and then create a Schema.
Modify the Spark configuration file (config.conf). For more information, see the Spark configuration file.
Do a check and make sure that no conflicting packages exist in the Spark cluster.
After Spark starts, use the configuration file and sst.generator to import data.
Verify the data.

Some Tips to Follow

Here are some tips for you to use the Spark application:

I recommend that you create an index before data import.

Batch import data can be executed for only offline graphs. In NebulaGraph, you can choose to create an index for an online or offline graph service. But an index must be rebuilt when the service is offline. To prevent possible problems during the REBUILD INDEX process, I recommend that you create an index before data insertion. Of course, such an operation may slow down the speed of batch import of vertices.

I recommend that you use int type values as the vertex IDs.

You can use some algorithms such as the Snowflake algorithm to generate the values. If the vertex ID is not of the int type, you can use policy: "uuid" in the configuration file for vertex or edge type configuration.

If your Spark is deployed in the standalone mode, you can ignore the conflicting package problem, which may not occur in such a mode. If your Spark runs on clusters, this problem may occur, because some packages in sst.generator may conflict with some packages in the Spark cluster. To resolve the conflicts, you can shade or rename these conflicting packages.
Spark tuning You can adjust the parameters to meet your business requirements to reduce memory consumption as much as possible to save resources and to speed up the parallelism.

Data Import Performance

When the indexes are built in advance, it took about 20 hours to batch import billions of vertices (with few properties) and tens of billion edges (directed and no properties or weighted).

Lessons Learned During Submitting GitHub PRs to NebulaGraph

I was using the Spark application with an earlier version of NebulaGraph, so problems were inevitable. During the process, I made some modifications to SparkClientGenerator.scala.

In the early stage when I was using Spark Writer (Exchange for now) to import data into NebulaGraph, some columns cannot be aligned correctly. By reading the source code, I found that there was a bug in SparkClientGenerator.scala that the configuration file, but not the Parquet or JSON files, was read. I fixed the bug and submitted PR#2187, my first PR to NebulaGraph. I was so happy that the PR was approved.
After then, when I was using SparkClientGenerator to use the uuid() or hash() function, I found that duplicate double quotation marks were introduced so that the batch import cannot be completed.

The extra double quotation marks were introduced during the process of data type conversion. I found that a function called extraIndexValue could be used to convert the user-defined values from a non-string type to the String type. I thought some users might want to convert non-string (for example, an array value) indexes to uuid or hash, so I changed some source code and submitted a new PR.

Unfortunately, the new PR was very tough. I made several commits but still could not get the approval. After communication with @darionyaphet, a developer at NebulaGraph, I knew that I changed the format of the source data. In his opinion, generally, when users import some data of some unsupported format, the correct response is to throw an exception, but not to directly convert the format. Yes, that makes sense. I focused too much on my business scenario and my only goal was to get the code run successfully but did not consider the general purpose of the tool.

In response to this, I submitted another PR, PR#2258, and it was approved and merged. I learned a lot from this PR.

Later, I found that in nebula-python the thrift had conflicts with fbthrift. I thought I could shade the conflicts and submit another PR, but considering the huge modifications, I gave up and chose to raise an issue to NebulaGraph, which was fixed recently.

Word from NebulaGraph：Welcome to submit PRs to NebulaGraph on GitHub. Here are some issues for your reference: https://github.com/vesoft-inc/nebula/issues

Summary

Before I started using NebulaGraph, I had evaluated JanusGraph thoroughly. After comparisons, I was very impressed with the relatively fewer secret bugs and the active community of NebulaGraph. After the test, NebulaGraph proved itself with its efficiency and became the first choice for distributed graph systems.

The NebulaGraph forum, groups, and GitHub projects are very active. All the users can get replies very quickly, and the PRs are reviewed effectively. I think this is the most important factor of the rapid and strong growth of this graph database. I hope I can continuously witness the growth of NebulaGraph and contribute to the improvement of the NebulaGraph ecology!