Analyzing Relationships in Game of Thrones With NetworkX, Gephi, and NebulaGraph (Part Two)

Analyzing Relationships in Game of Thrones With NetworkX, Gephi, and NebulaGraph

In the last post[1], we showed the character relationship for the _Game of Thrones_ by using NetworkX and Gephi. In this post, we will show you how to access data in NebulaGraph by using NetworkX.

NetworkX

NetworkX[2] is a modeling tool for the graph theory and complex networks written by Python. With its rich, easy-to-use built-in graphs and analysis algorithms, it's easy to perform complex network analysis and simulation modeling.

In NetworkX, a graph (network) is a collection of nodes together with a collection of edges. Attributes are often associated with nodes and/or edges and are optional. Nodes represent data. Edges are uniquely determined by two nodes, representing the relationship between the two nodes. Nodes and edges can also have many attributes so that more information can be stored.

The following four basic graph types are provided in NetworkX:

Graph: undirected graph.
DiGraph: directed graph.
MultiGraph: A flexible graph class that allows multiple undirected edges between pairs of nodes.
MultiDiGraph: A directed version of a MultiGraph.

Create an undirected graph in NetworkX:

import networkx as nx
G = nx.Graph()

Add nodes:

G.add_node(1)
G.add_nodes_from([2,3,4])
G.add_node(2,name='Tom',age=23)

Add edges:

G.add_edge(2,3)
G.add_edges_from([(1,2),(1,3)])
g.add_edge(1, 2, start_year=1996, end_year=2019)

In the last post (part one of this series), we have displayed the community detection algorithm Girvan-Newman provided by NetworkX.

Graph Database NebulaGraph

NetworkX usually uses local files as the data source, which is totally okay for static network researches. But when the graph network changes a lot, for example, some central nodes are deleted or important network topology changes are introduced, it is a little troublesome to generate, load, and analyze the new static files. At this time, we'd better persist the entire change process in a database, and load subgraphs or whole graphs directly from the database for real-time analysis. This post uses NebulaGraph [3] as the graph database for storing the graph data.

NebulaGraph provides two methods to get graph structures:

Query to return a subgraph.
Scan the underlying storage to get the whole graph.

The first method is suitable for obtaining certain nodes and edges that meet the requirements through fine filtering and pruning in a large-scale graph network. Whereas the second method is more suitable for the whole graph analysis. It is usually used in the early stage of the project when conducting heuristic exploration. When you further understand the graph, you can use the first method for fine pruning analysis.

By reading the code of the python client for NebulaGraph, we know that the nebula-python/nebula/ngStorage/StorageClient.py and the nebula-python/nebula/ngMeta/MetaClient.py file are the APIs that interact with the underlying storage. These files contain a series of interfaces to scan nodes, scan edges, read columns, and so on. The following two interfaces can be used to scan all nodes and edges:

def scanVertex(self, space, returnCols, allCols, limit, startTime, endTime)
def scanEdge(self, space, returnCols, allCols, limit, startTime, endTime)

Initialize a client and a ScanEdgeProcessor. The ScanEdgeProcessor decodes the edge data read from NebulaGraph:

metaClient = MetaClient([('192.168.8.16', 45500)])
metaClient.connect()
storageClient = StorageClient(metaClient)
scanEdgeProcessor = ScanEdgeProcessor(metaClient)

Initialize the parameters for the scanEdge interface:

spaceName = 'nba' # The name of the graph space to be read.
returnCols = {} # The edges (or nodes) and their property columns to be returned.
returnCols['serve'] = ['start_year', 'end_year']
returnCols['follow'] = ['degree']
allCols = False # Return all the property columns or not. When set to False, only the specified property columns in the returnCols are returned. When set to True, all property columns are returned.
limit = 100 # Maximum data entry number returned.
startTime = 0
endTime = sys.maxsize

Call the scanPartEdge interface to return the iterator for the canEdgeResponse object:

scanEdgeResponseIterator = storageClient.scanEdge(spaceName, returnCols, allCols, limit, startTime, endTime)

Keep reading the data in the ScanEdgeResponse object which the iterator points to till all the data are read:

while scanEdgeResponseIterator.hasNext():
    scanEdgeResponse = scanEdgeResponseIterator.next()
    if scanEdgeResponse is None:
        print("Error occurs while scaning edge")
        break
    processEdge(space, scanEdgeResponse)

In the preceding snippet, the process_edge is a custom function for processing the scanned edge data. This function first uses the scan_edge_processor to decode the data in the scan_edge_response. The decoded data can either be directly printed out or processed for other purposes such as importing these data into the calculation framework NetWorkX.

Process data. Here we import all the scanned data into Graph G in the NetworkX:

def processEdge(space, scanEdgeResponse):
    result = scanEdgeProcessor.process(space, scanEdgeResponse)
    # Get the corresponding rows by edge_name
    for edgeName, edgeRows in result.rows.items():
        for row in edgeRows:
            srcId = row.defaultProperties[0].getValue()
            dstId = row.defaultProperties[2].getValue()
            print('%d -> %d' % (srcId, dstId))
            props = {}
            for prop in row.properties:
                propName = prop.getName()
                propValue = prop.getValue()
                props[propName] = propValue
            G.add_edges_from([(srcId, dstId, props)]) # Add edges to Graph G in the NetworkX

Nodes scanning is similar to the preceding process.

In addition, for some of the distributed graph calculation frameworks [4], NebulaGraph provides batch reading and storing based on the partitions. We will show you in the later post.

Applying Graph Analysis in NetworkX

When importing all the nodes and edges into the NetworkX according to the preceding process, we can do some basic graph analysis and calculations:

Draw a plot:

nx.draw(G, with_labels=True, font_weight='bold')
import matplotlib.pyplot as plt
plt.show()
plt.savefig('./test.png')

The plot you just drew looks like following:

Graph of the Game of Thrones Plot

Print out all the nodes and edges in the plot:

print('nodes: ', list(G.nodes))
print('edges: ', list(G.edges))

Results printed:

nodes:  [109, 119, 129, 139, 149, 209, 219, 229, 108, 118, 128, 138, 148, 208, 218, 228, 107, 117, 127, 137, 147, 207, 217, 227, 106, 116, 126, 136, 146, 206, 216, 226, 101, 111, 121, 131, 141, 201, 211, 221, 100, 110, 120, 130, 140, 150, 200, 210, 220, 102, 112, 122, 132, 142, 202, 212, 222, 103, 113, 123, 133, 143, 203, 213, 223, 104, 114, 124, 134, 144, 204, 214, 224, 105, 115, 125, 135, 145, 205, 215, 225]
edges:  [(109, 100), (109, 125), (109, 204), (109, 219), (109, 222), (119, 200), (119, 205), (119, 113), (129, 116), (129, 121), (129, 128), (129, 216), (129, 221), (129, 229), (129, 137), (139, 138), (139, 212), (139, 218), (149, 130), (149, 219), (209, 123), (219, 130), (219, 112), (219, 104), (229, 147), (229, 116), (229, 141), (229, 144), (108, 100), (108, 101), (108, 204), (108, 206), (108, 214), (108, 215), (108, 222), (118, 120), (118, 131), (118, 205), (118, 113), (128, 116), (128, 121), (128, 201), (128, 202), (128, 205), (128, 223), (138, 115), (138, 204), (138, 210), (138, 212), (138, 221), (138, 225), (148, 127), (148, 136), (148, 137), (148, 214), (148, 223), (148, 227), (148, 213), (208, 127), (208, 103), (208, 104), (208, 124), (218, 127), (218, 110), (218, 103), (218, 104), (218, 114), (218, 105), (228, 146), (228, 145), (107, 100), (107, 204), (107, 217), (107, 224), (117, 200), (117, 136), (117, 142), (127, 114), (127, 212), (127, 213), (127, 214), (127, 222), (127, 226), (127, 227), (137, 136), (137, 213), (137, 150), (147, 136), (147, 214), (147, 223), (207, 121), (207, 140), (207, 122), (207, 134), (217, 126), (217, 141), (217, 124), (217, 144), (106, 204), (106, 212), (106, 113), (116, 141), (116, 126), (116, 210), (116, 216), (116, 121), (116, 113), (116, 105), (126, 216), (136, 210), (136, 213), (136, 214), (146, 202), (146, 210), (146, 215), (146, 222), (146, 226), (206, 123), (216, 144), (216, 105), (226, 140), (226, 112), (226, 114), (226, 144), (101, 100), (101, 102), (101, 125), (101, 204), (101, 215), (101, 113), (101, 104), (111, 200), (111, 204), (111, 215), (111, 220), (121, 202), (121, 215), (121, 113), (121, 134), (131, 205), (131, 220), (141, 124), (141, 205), (141, 225), (201, 145), (211, 124), (221, 104), (221, 124), (100, 125), (100, 204), (100, 102), (100, 113), (100, 104), (100, 144), (100, 105), (110, 204), (110, 220), (120, 150), (120, 202), (120, 205), (120, 113), (140, 114), (140, 214), (140, 224), (150, 143), (150, 213), (200, 142), (200, 104), (200, 145), (210, 124), (210, 144), (210, 115), (210, 145), (102, 203), (102, 204), (102, 103), (102, 135), (112, 204), (122, 213), (122, 223), (132, 225), (202, 133), (202, 114), (212, 103), (222, 104), (103, 204), (103, 114), (113, 104), (113, 105), (113, 125), (113, 204), (133, 114), (133, 144), (143, 213), (143, 223), (203, 135), (213, 124), (213, 145), (104, 105), (104, 204), (104, 215), (114, 115), (114, 204), (134, 224), (144, 145), (144, 214), (204, 105), (204, 125)]

One of the common application is to calculate the shortest path between two nodes:

p1 = nx.shortest_path(G, source=114, target=211)
print('The shortest path between Node 114 and Node 211: ', p1)

Results printed:

The shortest path between Node 114 and Node 211:  [114, 127, 208, 124, 211]

Calculate the pagerank value for each node in the graph to see their influences:

print(nx.pagerank(G))

Results printed:

{109: 0.011507076520104863, 119: 0.007835838669313514, 129: 0.015304593799331218, 139: 0.007772926737873626, 149: 0.0073896601012629825, 209: 0.0065558926178649985, 219: 0.014100908598251508, 229: 0.011454115940170253, 108: 0.01645334474680034, 118: 0.01010598371500564, 128: 0.01594717876199238, 138: 0.01671097227127263, 148: 0.015898676579503977, 208: 0.009437234075904938, 218: 0.0153795416919104, 228: 0.005900393773635255, 107: 0.009745182763645681, 117: 0.008716335675518244, 127: 0.021565565312365507, 137: 0.011642680498867146, 147: 0.009721031073465738, 207: 0.01040504770909835, 217: 0.012054472529765329, 227: 0.005615576255373405, 106: 0.007371191843767635, 116: 0.020955704443679106, 126: 0.007589432032220849, 136: 0.015987209357117116, 146: 0.013922108926721374, 206: 0.008554794629575304, 216: 0.011219193251536395, 226: 0.013613173390725904, 101: 0.016680863106330837, 111: 0.010121524312495604, 121: 0.017545503989576015, 131: 0.008531567756846938, 141: 0.014598319866130227, 201: 0.0058643663430632525, 211: 0.003936285336338021, 221: 0.009587911774927793, 100: 0.02243017302167168, 110: 0.007928429795381916, 120: 0.011875669801396205, 130: 0.0073896601012629825, 140: 0.01205992633948699, 150: 0.010045605782606326, 200: 0.015289870550944322, 210: 0.017716629501785937, 220: 0.008666577509181518, 102: 0.014865431161046641, 112: 0.007931095811770324, 122: 0.008087439927630492, 132: 0.004659566123187912, 142: 0.006487446038191551, 202: 0.013579313206377282, 212: 0.01190888044566142, 222: 0.011376739416933006, 103: 0.013438110749144392, 113: 0.02458154500563397, 123: 0.01104978432213578, 133: 0.00743370900670294, 143: 0.008011123394996112, 203: 0.006883198710237787, 213: 0.020392557117890422, 223: 0.012345866520333572, 104: 0.024902235588979776, 114: 0.019369722463816744, 124: 0.017165705442951484, 134: 0.008284361176173354, 144: 0.019363506469972095, 204: 0.03507634139024834, 214: 0.015500649025348538, 224: 0.008320315540621754, 105: 0.01439975542831122, 115: 0.007592722237637133, 125: 0.010808523955754608, 135: 0.006883198710237788, 145: 0.014654713389044883, 205: 0.014660118545887803, 215: 0.01337467974572934, 225: 0.009909720748343093}

In addition, similar to the first post in the series, you can access to Gephi [5] to make the visualization look better.

Refer [6] for the code in this post.

References

[1] Analyzing Relationships in Game of Thrones With NetworkX, Gelphi, and NebulaGraph (Part One)

[2] NetworkX

[3] NebulaGraph GitHub Repository

[4] GraphX

[5] Gephi

[6] Code PR