Validating Import Performance of Nebula Importer

Machine Specifications for Testing

Host Name	OS	CPU Architecture	CPU Cores	Memory	Disk
hadoop 10	CentOS 7.6	x86_64	32 核	128 GB	1.8 TB
hadoop 11	CentOS 7.6	x86_64	32 核	64 GB	1 TB
hadoop 12	CentOS 7.6	x86_64	16 核	64 GB	1 TB

Environment of NebulaGraph Cluster

Operating System: CentOS 7.5 +
Necessary software for NebulaGraph Cluster, including gcc 7.1.0+, cmake 3.5.0, glibc 2.12+, and other necessary dependencies.

yum update
yum install -y make \
                 m4 \
                 git \
                 wget \
                 unzip \
                 xz \
                 readline-devel \
                 ncurses-devel \
                 zlib-devel \
                 gcc \
                 gcc-c++ \
                 cmake \
                 gettext \
                 curl \
                 redhat-lsb-core

NebulaGraph version: V2.0.0
Back-end storage: Three nodes, RocksDB

Process \ Host Name	hadoop10	hadoop11	hadoop12
# of metad processes	1	1	1
# of storaged processes	1	1	1
# of graphd processes	1	1	1

Preparing Data and Introducing Data Format

# of Vertices / File Size	# of Edges / File Size	# of Vertices and Edges / File Size
74,314,635 /4.6 G	139,951,301 /6.6 G	214,265,936 /11.2 G

More details about the data:

edge.csv: 139,951,301 records in total, 6.6 GB
vertex.csv: 74,314,635 records in total, 4.6 GB
214,265,936 vertices and edges in total, 11.2 GB

data size

vertices and edges

[root@hadoop10 datas]# wc -l edge.csv 
139951301 edge.csv
[root@hadoop10 datas]# head -10 vertex.csv 
-201035082963479683,实体
-1779678833482502384,值
4646408208538057683,胶饴
-1861609733419239066,别名: 饴糖、畅糖、畅、软糖。
-2047289935702608120,词条
5842706712819643509,词条（拼音：cí tiáo）也叫词目，是辞书学用语，指收列的词语及其释文。
-3063129772935425027,文化
-2484942249444426630,红色食品
-3877061284769534378,红色食品是指食品为红色、橙红色或棕红色的食品。
-3402450096279275143,否
[root@hadoop10 datas]# wc -l vertex.csv 
74314635 vertex.csv
[root@hadoop10 datas]# head -10 edge.csv 
-201035082963479683,-1779678833482502384,属性
4646408208538057683,-1861609733419239066,描述
-2047289935702608120,5842706712819643509,描述
-2047289935702608120,-3063129772935425027,标签
-2484942249444426630,-3877061284769534378,描述
-2484942249444426630,-2484942249444426630,中文名
-2484942249444426630,-3402450096279275143,是否含防腐剂
-2484942249444426630,4786182067583989997,主要食用功效
-2484942249444426630,-8978611301755314833,适宜人群
-2484942249444426630,-382812815618074210,用途

Validating Solution

Solution: Using Nebula Importer to import data in batch.

Edit a YAML file for importing data.

version: v1rc1
description: example
clientSettings:
  concurrency: 10 # number of graph clients
  channelBufferSize: 128
  space: test
  connection:
    user: user
    password: password
    address: 191.168.7.10:9669,191.168.7.11:9669,191.168.7.12:9669
logPath: ./err/test.log
files:
  - path: ./vertex.csv
    failDataPath: ./err/vertex.csv
    batchSize: 100
    type: csv
    csv:
      withHeader: false
      withLabel: false
    schema:
      type: vertex
      vertex:
        tags:
          - name: entity
            props:
              - name: name
                type: string
  - path: ./edge.csv
    failDataPath: ./err/edge.csv
    batchSize: 100
    type: csv
    csv:
      withHeader: false
      withLabel: false
    schema:
      type: edge
      edge:
        name: relation
        withRanking: false
        props:
          - name: name
            type: string

Create schema

On Nebula Console, create a graph space, and then tags and edge types in the graph space.

# 1. Create a graph space.
 (admin@nebula) [(none)]> create space test2(vid_type = FIXED_STRING(64));
# 2. Switch to the specified graph space.
 (admin@nebula) [(none)]> use test2;
# 3. Create a tag.
(admin@nebula) [test2]> create tag entity(name string);
# 4. Create an edge type.
(admin@nebula) [test2]> create edge relation(name string);
# 5. View the definition of the tag.
 (admin@nebula) [test2]> describe tag entity;
+--------+----------+-------+---------+
| Field  | Type     | Null  | Default |
+--------+----------+-------+---------+
| "name" | "string" | "YES" |         |
+--------+----------+-------+---------+
Got 1 rows (time spent 703/1002 us)
# 6. View the definition of the edge type.
 (admin@nebula) [test2]> describe edge relation;
+--------+----------+-------+---------+
| Field  | Type     | Null  | Default |
+--------+----------+-------+---------+
| "name" | "string" | "YES" |         |
+--------+----------+-------+---------+
Got 1 rows (time spent 703/1041 us)

Compile

Compile Nebula Importer and run shell commands.

# Compile Nebula Importer.
make build
# Run the shell command where a YAML configuration file is specified.
/opt/software/nebulagraph/nebula-importer/nebula-importer --config /opt/software/datas/rdf-import2.yaml

View the output

Import results

# View part of logs.
2021/04/19 19:05:55 [INFO] statsmgr.go:61: Tick: Time(2400.00s), Finished(210207018), Failed(0), Latency AVG(32441us), Batches Req AVG(33824us), Rows AVG(87586.25/s)
2021/04/19 19:06:00 [INFO] statsmgr.go:61: Tick: Time(2405.00s), Finished(210541418), Failed(0), Latency AVG(32461us), Batches Req AVG(33844us), Rows AVG(87543.20/s)
2021/04/19 19:06:05 [INFO] statsmgr.go:61: Tick: Time(2410.00s), Finished(210901218), Failed(0), Latency AVG(32475us), Batches Req AVG(33857us), Rows AVG(87510.88/s)
2021/04/19 19:06:10 [INFO] statsmgr.go:61: Tick: Time(2415.00s), Finished(211270318), Failed(0), Latency AVG(32486us), Batches Req AVG(33869us), Rows AVG(87482.50/s)
2021/04/19 19:06:15 [INFO] statsmgr.go:61: Tick: Time(2420.00s), Finished(211685318), Failed(0), Latency AVG(32490us), Batches Req AVG(33873us), Rows AVG(87473.27/s)
2021/04/19 19:06:20 [INFO] statsmgr.go:61: Tick: Time(2425.00s), Finished(211959718), Failed(0), Latency AVG(32517us), Batches Req AVG(33900us), Rows AVG(87406.07/s)
2021/04/19 19:06:25 [INFO] statsmgr.go:61: Tick: Time(2430.00s), Finished(212220818), Failed(0), Latency AVG(32545us), Batches Req AVG(33928us), Rows AVG(87333.67/s)
2021/04/19 19:06:30 [INFO] statsmgr.go:61: Tick: Time(2435.00s), Finished(212433518), Failed(0), Latency AVG(32579us), Batches Req AVG(33963us), Rows AVG(87241.69/s)
2021/04/19 19:06:35 [INFO] statsmgr.go:61: Tick: Time(2440.00s), Finished(212780818), Failed(0), Latency AVG(32593us), Batches Req AVG(33977us), Rows AVG(87205.25/s)
2021/04/19 19:06:40 [INFO] statsmgr.go:61: Tick: Time(2445.01s), Finished(213240518), Failed(0), Latency AVG(32589us), Batches Req AVG(33973us), Rows AVG(87214.69/s)
2021/04/19 19:06:40 [INFO] reader.go:180: Total lines of file(/opt/software/datas/edge.csv) is: 139951301, error lines: 0
2021/04/19 19:06:42 [INFO] statsmgr.go:61: Done(/opt/software/datas/edge.csv): Time(2446.70s), Finished(213307919), Failed(0), Latency AVG(32585us), Batches Req AVG(33968us), Rows AVG(87181.95/s)
2021/04/19 19:06:42 Finish import data, consume time: 2447.20s
2021/04/19 19:06:43 --- END OF NEBULA IMPORTER ---

A special focus on the statistics of the statistics of results.

Time(2446.70s), Finished(213307919), Failed(0), Latency AVG(32585us), Batches Req
AVG(33968us), Rows AVG(87181.95/s)
2021/04/19 19:06:42 Finish import data, consume time: 2447.20s
2021/04/19 19:06:43 --- END OF NEBULA IMPORTER ---

Resource Requirements

High requirement of the machine specifications, including the number of CPU cores, memory size, and disk size.

hadoop 10

hadoop 11

hadoop 12

Recommendations on the machine specifications:

By comparing the memory consumption of the three machines, we found that the memory consumption is great when more than 200 million data are imported, so we recommend that the memory capacity should be as large as possible.
For the information about the CPU cores and disk size, see the documentation: https://docs.nebula-graph.io.

nGQL Statements Test

The native graph query language of NebulaGraph is nGQL. It is compatible with OpenCypher. For now, nGQL has not supported traversal of the total number of vertices and edges. For example, MATCH (v) RETURN v is not supported yet. Make sure that at least one index is available in a MATCH statement. If you want to create an index when related vertices, edges, or properties exist, rebuild the index after it is created to make it effective.

To test whether nGQL is compatible with OpenCypher.

# Test OpenCypher statements.
# Import an nGQL file.
./nebula-console -addr 191.168.7.10 -port 9669 -u user -p password -t 120  -f /opt/software/datas/basketballplayer-2.X.ngql

Conclusion

This test validated the performance of importing a large amount of data to a three-node NebulaGraph cluster. The batch writing performance of Nebula Importer can meet the performance requirements of the production scenario. However, if the data is imported as CSV files, it must be stored in HDFS and a YAML configuration file is needed to specify the configuration of the tags and edge types for processing by tools.

Would like to know more about NebulaGraph? Join the Slack channel!