user-review

Validating Import Performance of Nebula Importer

duspring
July 1, 2021
Validating Import Performance of Nebula Importer

Machine Specifications for Testing

| Host Name | OS | CPU Architecture | CPU Cores | Memory | Disk || -------- | --------- | ------- | ------- | ----- | ----- || hadoop 10 | CentOS 7.6 | x86_64 | 32 核 | 128 GB | 1.8 TB || hadoop 11 | CentOS 7.6 | x86_64 | 32 核 | 64 GB | 1 TB || hadoop 12 | CentOS 7.6 | x86_64 | 16 核 | 64 GB | 1 TB |

Environment of Nebula Graph Cluster

  • Operating System: CentOS 7.5 +
  • Necessary software for Nebula Graph Cluster, including gcc 7.1.0+, cmake 3.5.0, glibc 2.12+, and other necessary dependencies.
yum update
yum install -y make \
m4 \
git \
wget \
unzip \
xz \
readline-devel \
ncurses-devel \
zlib-devel \
gcc \
gcc-c++ \
cmake \
gettext \
curl \
redhat-lsb-core
  • Nebula Graph version: V2.0.0
  • Back-end storage: Three nodes, RocksDB

| Process \ Host Name | hadoop10 | hadoop11 | hadoop12 || ---------------- | -------- | -------- | -------- || # of metad processes | 1 | 1 | 1 || # of storaged processes | 1 | 1 | 1 || # of graphd processes | 1 | 1 | 1 |

Preparing Data and Introducing Data Format

| # of Vertices / File Size | # of Edges / File Size | # of Vertices and Edges / File Size || --------------------- | ------------------- | ------------------------ || 74,314,635 /4.6 G | 139,951,301 /6.6 G | 214,265,936 /11.2 G |

More details about the data:

  • edge.csv: 139,951,301 records in total, 6.6 GB
  • vertex.csv: 74,314,635 records in total, 4.6 GB
  • 214,265,936 vertices and edges in total, 11.2 GB
data size
vertices and edges
[root@hadoop10 datas]# wc -l edge.csv
139951301 edge.csv
[root@hadoop10 datas]# head -10 vertex.csv
-201035082963479683,实体
-1779678833482502384,值
4646408208538057683,胶饴
-1861609733419239066,别名: 饴糖、畅糖、畅、软糖。
-2047289935702608120,词条
5842706712819643509,词条(拼音:cí tiáo)也叫词目,是辞书学用语,指收列的词语及其释文。
-3063129772935425027,文化
-2484942249444426630,红色食品
-3877061284769534378,红色食品是指食品为红色、橙红色或棕红色的食品。
-3402450096279275143,否
[root@hadoop10 datas]# wc -l vertex.csv
74314635 vertex.csv
[root@hadoop10 datas]# head -10 edge.csv
-201035082963479683,-1779678833482502384,属性
4646408208538057683,-1861609733419239066,描述
-2047289935702608120,5842706712819643509,描述
-2047289935702608120,-3063129772935425027,标签
-2484942249444426630,-3877061284769534378,描述
-2484942249444426630,-2484942249444426630,中文名
-2484942249444426630,-3402450096279275143,是否含防腐剂
-2484942249444426630,4786182067583989997,主要食用功效
-2484942249444426630,-8978611301755314833,适宜人群
-2484942249444426630,-382812815618074210,用途

Validating Solution

Solution: Using Nebula Importer to import data in batch.

Edit a YAML file for importing data.
version: v1rc1
description: example
clientSettings:
concurrency: 10 # number of graph clients
channelBufferSize: 128
space: test
connection:
user: user
password: password
address: 191.168.7.10:9669,191.168.7.11:9669,191.168.7.12:9669
logPath: ./err/test.log
files:
- path: ./vertex.csv
failDataPath: ./err/vertex.csv
batchSize: 100
type: csv
csv:
withHeader: false
withLabel: false
schema:
type: vertex
vertex:
tags:
- name: entity
props:
- name: name
type: string
- path: ./edge.csv
failDataPath: ./err/edge.csv
batchSize: 100
type: csv
csv:
withHeader: false
withLabel: false
schema:
type: edge
edge:
name: relation
withRanking: false
props:
- name: name
type: string
Create schema

On Nebula Console, create a graph space, and then tags and edge types in the graph space.

# 1. Create a graph space.
(admin@nebula) [(none)]> create space test2(vid_type = FIXED_STRING(64));
# 2. Switch to the specified graph space.
(admin@nebula) [(none)]> use test2;
# 3. Create a tag.
(admin@nebula) [test2]> create tag entity(name string);
# 4. Create an edge type.
(admin@nebula) [test2]> create edge relation(name string);
# 5. View the definition of the tag.
(admin@nebula) [test2]> describe tag entity;
+--------+----------+-------+---------+
| Field | Type | Null | Default |
+--------+----------+-------+---------+
| "name" | "string" | "YES" | |
+--------+----------+-------+---------+
Got 1 rows (time spent 703/1002 us)
# 6. View the definition of the edge type.
(admin@nebula) [test2]> describe edge relation;
+--------+----------+-------+---------+
| Field | Type | Null | Default |
+--------+----------+-------+---------+
| "name" | "string" | "YES" | |
+--------+----------+-------+---------+
Got 1 rows (time spent 703/1041 us)
Compile

Compile Nebula Importer and run shell commands.

# Compile Nebula Importer.
make build
# Run the shell command where a YAML configuration file is specified.
/opt/software/nebulagraph/nebula-importer/nebula-importer --config /opt/software/datas/rdf-import2.yaml
View the output
Import results
# View part of logs.
2021/04/19 19:05:55 [INFO] statsmgr.go:61: Tick: Time(2400.00s), Finished(210207018), Failed(0), Latency AVG(32441us), Batches Req AVG(33824us), Rows AVG(87586.25/s)
2021/04/19 19:06:00 [INFO] statsmgr.go:61: Tick: Time(2405.00s), Finished(210541418), Failed(0), Latency AVG(32461us), Batches Req AVG(33844us), Rows AVG(87543.20/s)
2021/04/19 19:06:05 [INFO] statsmgr.go:61: Tick: Time(2410.00s), Finished(210901218), Failed(0), Latency AVG(32475us), Batches Req AVG(33857us), Rows AVG(87510.88/s)
2021/04/19 19:06:10 [INFO] statsmgr.go:61: Tick: Time(2415.00s), Finished(211270318), Failed(0), Latency AVG(32486us), Batches Req AVG(33869us), Rows AVG(87482.50/s)
2021/04/19 19:06:15 [INFO] statsmgr.go:61: Tick: Time(2420.00s), Finished(211685318), Failed(0), Latency AVG(32490us), Batches Req AVG(33873us), Rows AVG(87473.27/s)
2021/04/19 19:06:20 [INFO] statsmgr.go:61: Tick: Time(2425.00s), Finished(211959718), Failed(0), Latency AVG(32517us), Batches Req AVG(33900us), Rows AVG(87406.07/s)
2021/04/19 19:06:25 [INFO] statsmgr.go:61: Tick: Time(2430.00s), Finished(212220818), Failed(0), Latency AVG(32545us), Batches Req AVG(33928us), Rows AVG(87333.67/s)
2021/04/19 19:06:30 [INFO] statsmgr.go:61: Tick: Time(2435.00s), Finished(212433518), Failed(0), Latency AVG(32579us), Batches Req AVG(33963us), Rows AVG(87241.69/s)
2021/04/19 19:06:35 [INFO] statsmgr.go:61: Tick: Time(2440.00s), Finished(212780818), Failed(0), Latency AVG(32593us), Batches Req AVG(33977us), Rows AVG(87205.25/s)
2021/04/19 19:06:40 [INFO] statsmgr.go:61: Tick: Time(2445.01s), Finished(213240518), Failed(0), Latency AVG(32589us), Batches Req AVG(33973us), Rows AVG(87214.69/s)
2021/04/19 19:06:40 [INFO] reader.go:180: Total lines of file(/opt/software/datas/edge.csv) is: 139951301, error lines: 0
2021/04/19 19:06:42 [INFO] statsmgr.go:61: Done(/opt/software/datas/edge.csv): Time(2446.70s), Finished(213307919), Failed(0), Latency AVG(32585us), Batches Req AVG(33968us), Rows AVG(87181.95/s)
2021/04/19 19:06:42 Finish import data, consume time: 2447.20s
2021/04/19 19:06:43 --- END OF NEBULA IMPORTER ---

A special focus on the statistics of the statistics of results.

Time(2446.70s), Finished(213307919), Failed(0), Latency AVG(32585us), Batches Req
AVG(33968us), Rows AVG(87181.95/s)
2021/04/19 19:06:42 Finish import data, consume time: 2447.20s
2021/04/19 19:06:43 --- END OF NEBULA IMPORTER ---
Resource Requirements

High requirement of the machine specifications, including the number of CPU cores, memory size, and disk size.

  • hadoop 10
image
image

hadoop 11

image
image

hadoop 12

image
image

Recommendations on the machine specifications:

  1. By comparing the memory consumption of the three machines, we found that the memory consumption is great when more than 200 million data are imported, so we recommend that the memory capacity should be as large as possible.
  2. For the information about the CPU cores and disk size, see the documentation: https://docs.nebula-graph.io.
nGQL Statements Test

The native graph query language of Nebula Graph is nGQL. It is compatible with OpenCypher. For now, nGQL has not supported traversal of the total number of vertices and edges. For example, MATCH (v) RETURN v is not supported yet. Make sure that at least one index is available in a MATCH statement. If you want to create an index when related vertices, edges, or properties exist, rebuild the index after it is created to make it effective.

To test whether nGQL is compatible with OpenCypher.

# Test OpenCypher statements.
# Import an nGQL file.
./nebula-console -addr 191.168.7.10 -port 9669 -u user -p password -t 120 -f /opt/software/datas/basketballplayer-2.X.ngql
image
image

Conclusion

This test validated the performance of importing a large amount of data to a three-node Nebula Graph cluster. The batch writing performance of Nebula Importer can meet the performance requirements of the production scenario. However, if the data is imported as CSV files, it must be stored in HDFS and a YAML configuration file is needed to specify the configuration of the tags and edge types for processing by tools.

Would like to know more about Nebula Graph? Join the Slack channel!

Recommended for you
user-review
Validating Import Performance of Nebula Importer
duspring
1/7/2021
user-review
Hands-On Experience: Import Data to Nebula Graph with Spark
Liu Jiahao
1/12/2020
user-review
Performance Comparison: Neo4j vs Nebula Graph vs JanusGraph
Tencent Cloud Team
14/8/2020