graph-database
use-cases

Introducing a metadata management system that uses Nebula Graph as graph database

Wey Gu
June 14, 2022
Do I have to create my own graph model and everything to set up a metadata management and data lineage system? Thanks to many great open-source projects, the answer is: No!  

Today, I would like to share my opinionated reference data infrastructure stack with some of those best open-source projects with modern ETL, Dashboard, Metadata Governance, and Data Lineage Management.

Metadata Governance system

A Metadata governance system is a system providing a single view of where and how all the data are formatted, generated, transformed, consumed, presented, and owned.

Metadata governance is like a catalog of all of the data warehouses, databases, tables, dashboards, ETL jobs, etc. With a metadata governance system in place, people don't have to ask redundant questions like "Hi everyone, could I change the schema of this table?", "Hey, anyone who knows how I could find the raw data of table-view-foo-bar?" This is probably why we need a metadata governance system in a mature data stack that deals with a relatively large scale of data and team(or ones that grow).

For the other term, data lineage, which is one type of metadata, should be managed when possible to ensure a trust chain in a data-driven team. Data lineage reveals the life cycle of data—it aims to show the complete data flow, from start to end.

The reference solution

Motivation

Both metadata and data lineage are by nature fitted for the graph model field. The relationship-oriented queries in data lineage, for instance, "find all n-depth data lineage per given component(i.e. a table)" is essentially a FIND ALL PATH query in a graph database.

This also explains one observation of mine as open-source software (OSS) contributor of Nebula Graph, a distributed graph database: (from their queries/graph modeling in discussions I could tell) a lot of teams who are already leveraging Nebula Graph on their tech stacks, are setting up a data lineage system on their own from scratch.

A metadata governance system needs some of the following components:

  • Metadata extractor: Metadata extractor is where different parties of the data stack like databases, data warehouses, dashboards, ETL pipelines and applications, push data to or pull data from the metadata governance system.
  • Metadata storage: The storage part could be either a database or even large JSON manifest files.
  • Metadata catalog: This could be a system providing API and/or a GUI interface to read/write the metadata and data lineage.

In Nebula Graph community, I have seen many graph database users building their in-house data lineage systems. It's itching witnessing this work not be standarized or jointly contributed, as most of their work is basically parsing metadata from well-known big-data projects, and storing it in a graph database.

Then I came to create this opinionated reference data infrastructure stack with some of the best open-source projects put together. Hopefully, those who were going to define and iterate their own fashion of graph modeling on Nebula Graph and create in-house metadata and data linage extracting pipelines can benefit from this project to have a relatively polished, beautifully designed metadata governance system out of the box with a fully evolved graph model.

To make the reference project self-contained and runnable, I tried to put layers of data infrastructure stack more than just pure metadata-related ones. Thus, perhaps it will help new data engineers who would like to try and see how far open-source pushed a modern data lab to.

This is a diagram of all the components in this reference data stack, where I see most of them as metadata sources:

diagram-of-ref-project

The data stack

First, let's introduce the components.

The database and data warehouse

For processing and consuming raw and intermediate data, we need one or more databases and/or warehouses.

It can be any DB/DW including Hive, Apache Delta, TiDB, Cassandra, MySQL, or Postgres. In this project, we simply choose one of the most popular ones: PostgreSQL. And our reference lab comes with the first service:

✅ - Data warehouse: PostgreSQL

DataOps

We should have some sort of DataOps setup to enable pipelines and environments to be repeatable, testable, and version-controlled.

Here, we used Meltano created by GitLab.

Meltano is a just-work DataOps platform that connects Singer as the Extract and Load (EL) and dbt as the Transform (T) in an elegant way. It is also connected to some other data infrastructure utilities such as Apache Superset and Apache Airflow.

Now, we have our DataOps ready.

✅ - GitOps: Meltano

ETL

And under the hood, we will E(extract) and L(load) data from many different data sources to data targets by leveraging Singer together with Meltano, and do the T(transformation) with dbt.

✅ - EL: Singer

✅ - T: dbt

Data visualization

How about creating dashboards, charts, and tables for getting the insights of all the data?

Apache Superset is one of the greatest visualization platforms we can choose from. Now let's just add it to our packet!

✅ - Dashboard: Apache Superset

Job orchestration

In most cases, our DataOps jobs grow to the scale to be executed in a long time that needs to be orchestrated, and here comes the Apache Airflow.

✅ - DAG: Apache Airflow

Metadata governance

With more components and data being introduced to a data infrastructure, there will be massive metadata in all lifecycles of databases, tables, schemas, dashboards, DAGs, applications. Their administrators and teams could be collectively managed, connected, and discovered.

Linux Foundation Amundsen is one of the best projects solving this problem.

✅ - Data Discovery: Linux Foundation Amundsen

With a graph database as the source of truth to accelerate the multi-hop queries together with Elasticsearch as the full-text search engine, Amundsen indexes all the metadata and their lineage smoothly and beautifully in the next level.

By default, Neo4j is used as the graph database. However, I will be using Nebula Graph instead in this project because I am more familiar with it.

✅ - Full-text Search: Elasticsearch

✅ - Graph Database: Nebula Graph

Now, with the components in our stack being revealed, let's have them assembled.

Environment bootstrap, component overview

The reference runnable project is open-source and you can find it here:

  • https://github.com/wey-gu/data-lineage-ref-solution

I will try my best to make things clean and isolated. It's assumed that you are running on a UNIX-like system with an internet connection and Docker Compose installed.

 Please refer to here to install Docker and Docker Compose before moving forward.

I am running it on Ubuntu 20.04 LTS X86_64, but there shouldn't be any issues with other distros or versions of Linux.

Run a data warehouse/ database

First, let's install Postgres as our data warehouse.

This oneliner will help create a Postgres running in the background with docker, and when being stopped it will be cleaned up(--rm).

class="bash language-bash"docker run --rm --name postgres \
   -e POSTGRES_PASSWORD=lineage_ref \
   -e POSTGRES_USER=lineage_ref \
   -e POSTGRES_DB=warehouse -d \
   -p 5432:5432 postgres

Then we could verify it with Postgres CLI or GUI clients.

Hint: You could use VS Code extension: SQL tools to quickly connect to multiple RDBMS(MariaDB, Postgres, etc.) or even Non-SQL DBMS like Cassandra in a GUI fashion.

Setup DataOps toolchain for ETL

Then, let's get Meltano with Singler and have dbt installed.

Meltano helps us manage ETL utilities(as plugins) and all of their configurations(the pipelines). Those meta-information sits in Meltano configurations and its system database, where the configurations are file-based(could be managed with git). By default, the system database is SQLite.

Install Meltano

The workflow using Meltano is to initiate a meltano project and start to add E, L, and T into the configuration files. The initiation of a project just requires a CLI command call: meltano init yourprojectname and to do that, we could install Meltano either with Python's package manager: pip or via a Docker image:

  • Install Meltano with pip in a python virtual env:
class="bash language-bash"mkdir .venv
# example in a debian flavor Linux distro
sudo apt-get install python3-dev python3-pip python3-venv python3-wheel -y
python3 -m venv .venv/meltano
source .venv/meltano/bin/activate
python3 -m pip install wheel
python3 -m pip install meltano

# init a project
mkdir meltano_projects && cd meltano_projects
# replace <yourprojectname> with your own one
touch .env
meltano init <yourprojectname>
  • Install Meltano via Docker
class="bash language-bash"docker pull meltano/meltano:latest
docker run --rm meltano/meltano --version

# init a project
mkdir meltano_projects && cd meltano_projects

# replace <yourprojectname> with your own one
touch .env
docker run --rm -v "$(pwd)":/projects \
            -w /projects --env-file .env \
            meltano/meltano init <yourprojectname>

Apart from meltano init, there are a couple of other commands like meltano etl to perform ETL executions, and meltano invoke <plugin> to call plugins' command, always check the cheatsheet for quick referencing.

The Meltano UI

Meltano also comes with a web-based UI, to start it, just run:

class="bash language-bash"meltano ui

It's listening to http://localhost:5000.

For Docker, just run the container with the 5000 port exposed, here we didn't provide ui in the end because the container's default command is meltano ui already.

class="bash language-bash"docker run -v "$(pwd)":/project \
            -w /project \
            -p 5000:5000 \
            meltano/meltano
Example Meltano projects

While writing this article, I noticed that Pat Nadolny had created great examples on an example dataset for Meltano with dbt(And with Airflow and Superset, too!). We will not recreate the examples and instead use Pat's great ones.

 Note that Andrew Stewart had created another one with a slightly older version of configuration files.

You could follow here to run a pipeline of:

  • tap-CSV(Singer), extracting data from CSV files
  • target-postgres(Singer), loading data to Postgres
  • dbt, transform the data into aggregated tables or views
 You should omit the step of running the local Postgres with docker as we had already created one. Make sure to change the Postgres user and password in your .env file.  

And it's basically as this(with meltano being installed as above):

class="bash language-bash"git clone https://github.com/pnadolny13/meltano_example_implementations.git
cd meltano_example_implementations/meltano_projects/singer_dbt_jaffle/
meltano install
touch .env
echo PG_PASSWORD="lineage_ref" >> .env
echo PG_USERNAME="lineage_ref" >> .env
# Extract and Load(with Singer)
meltano run tap-csv target-postgres

# Trasnform(with dbt)
meltano run dbt:run
# Generate dbt docs
meltano invoke dbt docs generate
# Serve generated dbt docs
meltano invoke dbt docs to serve
# Then visit http://localhost:8080

Now, I assumed you had finished trying out singer_dbt_jaffle following its README.md, and we could connect to the Postgres instance to see the loaded and transformed data being reflected as follow, the screenshot is from the SQLTool of VS Code:

Setup a BI Platform for Dashboard

Now, we have the data in data warehouses, with ETL toolchains to pipe different data sources into it. How could those data be consumed?

BI tools like the dashboard could be one way to help us get insights from the data.

With Apache Superset, dashboards, and charts based on those data sources could be created and managed smoothly and beautifully.

The focus of this project was not on Apache Superset itself, thus, we simply reuse examples that Pat Nadolny had created in Superset as a utility if meltano Example.

Bootstrap Meltano and Superset

Create a python venv with Meltano installed:

class="bash language-bash"mkdir .venv
python3 -m venv .venv/meltano
source .venv/meltano/bin/activate
python3 -m pip install wheel
python3 -m pip install meltano

Following Pat's guide, with tiny modifications:

  • Clone the repo, enter the jaffle_superset project
class="bash language-bash"git clone https://github.com/pnadolny13/meltano_example_implementations.git
cd meltano_example_implementations/meltano_projects/jaffle_superset/
  • Modify the meltano configuration files to let Superset connect to the Postgres we created:
class="bash language-bash"vim meltano_projects/jaffle_superset/meltano.yml

In my example, I changed the hostname to 10.1.1.111, which is the IP address of my current host. However, if you are running it on your macOS machine, this should be fine to leave with it, the diff before and after the change would be:

class="diff language-diff"--- a/meltano_projects/jaffle_superset/meltano.yml
+++ b/meltano_projects/jaffle_superset/meltano.yml
@@ -71,7 +71,7 @@ plugins:
              A list of database driver dependencies can be found here https://superset.apache.org/docs/databases/installing-database-drivers
    config:
      database_name: my_postgres
-      sqlalchemy_uri: postgresql+psycopg2://${PG_USERNAME}:${PG_PASSWORD}@host.docker.internal:${PG_PORT}/${PG_DATABASE}
+      sqlalchemy_uri: postgresql+psycopg2://${PG_USERNAME}:${PG_PASSWORD}@10.1.1.168:${PG_PORT}/${PG_DATABASE}
      tables:
      - model.my_meltano_project.customers
      - model.my_meltano_project.orders
  • Add Postgres credential to .env file:
class="bash language-bash"echo PG_USERNAME=lineage_ref >> .env
echo PG_PASSWORD=lineage_ref >> .env
  • Install the Meltano project, run ETL pipeline
class="bash language-bash"meltano install
meltano run tap-csv target-postgres dbt:run
  • Start Superset, please note that the ui is not a meltano command but a user-defined action in the configuration file.
class="bash language-bash"meltano invoke superset:ui
  • In another terminal, run the defined command load_datasources
meltano invoke superset:load_datasources
  • Access Superset in a web browser via http://localhost:8088/

We should now see Superset Web Interface:

Create a Dashboard!

Let's try to create a Dashboard on the ETL data in Postgres defined in this Meltano project:

  • Click + DASHBOARD, fill a dashboard name, then click SAVE, then clieck + CREATE A NEW CHART
  • In new chart view, we should select a chart type and DATASET. Here, I selected orders table as the data source and Pie Chart chart type:
  • After clicking CREATE NEW CHART, we are in the chart definition view, where, I selected Query of status as DIMENSIONS, and COUNT(amount) as METRIC. Thus, we could see a Pie Chart per order status's distribution.
  • Click SAVE , it will ask which dashboard this chart should be added to, after it's selected, click SAVE & GO TO DASHBOARD.
  • Then, in the dashboard, we could see all charts there. You could see that I added another chart showing customer order count distribution, too:
  • We could set the refresh interval, or download the dashboard as you wish by clicking the ··· button.

It's quite cool, ah? For now, we have a simple but typical data stack like any hobby data lab with everything open-source!

Imagine we have 100 datasets in CSV, 200 tables in a data warehouse and a couple of data engineers running different projects that consume, generate different applications, dashboard, and databases. When someone would like to discover some of those tables, datasets, dashboards and pipelines running across them, and then even modify some of them, it's proven to be quite costly in communication engineering.

Here comes the main part of our reference project: Metadata Discovery.

Metadata discovery

Then, we are stepping to deploy Amundsen with Nebula Graph and Elasticsearch.

 Note: For the time being, the PR Nebula Graph as the Amundsen backend is not yet merged, I am working with the Amundsen team to make it happen.

With Amundsen, we could have all metadata of the whole data stack being discovered and managed in one place. And there are mainly two parts of Amundsen:

We will be leveraging Data builder to pull metadata from different sources, and persist metadata into the backend storage of the Meta service and the backend storage of the Search service, then we could search, discover and manage them from the Frontend service or through the API of the Metadata service.

Deploy Amundsen
Metadata service

We are going to deploy a cluster of Amundsen with its docker-compose file. Since the Nebula Graph backend support is not yet merged, we are referring to my fork.

First, let's clone the repo with all sub-modules:

class="bash language-bash"git clone -b amundsen_nebula_graph --recursive git@github.com:wey-gu/amundsen.git
cd amundsen

Then, start all catalog services and their backend storage:

class="bash language-bash"docker-compose -f docker-amundsen-nebula.yml up
 You could add -d to put the containers running in the background:
class="bash language-bash"docker-compose -f docker-amundsen-nebula.yml up -d

And this will stop the cluster:

class="bash language-bash"docker-compose -f docker-amundsen-nebula.yml stop

This will remove the cluster:

class="bash language-bash"docker-compose -f docker-amundsen-nebula.yml down

Since this docker-compose file is for developers to play and hack Amundsen easily rather than for production deployment, it's building images from the codebase, which, will take some time at the first time.

After it's being deployed, please hold on a second before we load some dummy data into its storage with Data builder.

Data builder

Amundsen Data builder is just like Meltano but for ETL of Metadata to Metadata service and Search service‘s backend storage: Nebula Graph and Elasticsearch. The Data builder here is only a python module and the ETL job could either run as a script or orchestrated with a DAG platform like Apache Airflow.

With Amundsen Data builder being installed:

class="bash language-bash"cd databuilder
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install wheel
python3 -m pip install -r requirements.txt
python3 setup.py install

Let's call this sample Data builder ETL script to have some dummy data filled in.

class="bash language-bash"python3 example/scripts/sample_data_loader_nebula.py
Verify Amundsen

Before accessing Amundsen, we need to create a test user:

class="bash language-bash"# run a container with curl attached to amundsenfrontend
docker run -it --rm --net container:amundsenfrontend nicolaka/netshoot

# Create a user with id test_user_id
curl -X PUT -v http://amundsenmetadata:5002/user \
   -H "Content-Type: application/json" \
   --data \
   '{"user_id":"test_user_id","first_name":"test","last_name":"user", "email":"test_user_id@mail.com"}'

exit

Then we could view UI at http://localhost:5000 and try to search test, it should return some results.

Then you can click and explore those dummy metadata loaded to Amundsen during the sample_data_loader_nebula.py on your own.

Additionally, you could access the Graph Database with Nebula Studio(http://localhost:7001).

 Note in Nebula Studio, the default fields to log in will be:  
     
  • Hosts: graphd:9669
  •  
  • User: root
  •  
  • Password: nebula
  •  

This diagram shows some more details on the components of Amundsen:

class="asciiarmor language-asciiarmor"       ┌────────────────────────┐ ┌────────────────────────────────────────┐
      │ Frontend:5000          │ │ Metadata Sources                       │
      ├────────────────────────┤ │ ┌────────┐ ┌─────────┐ ┌─────────────┐ │
      │ Metaservice:5001       │ │ │        │ │         │ │             │ │
      │ ┌──────────────┐       │ │ │ Foo DB │ │ Bar App │ │ X Dashboard │ │
 ┌────┼─┤ Nebula Proxy │       │ │ │        │ │         │ │             │ │
 │    │ └──────────────┘       │ │ │        │ │         │ │             │ │
 │    ├────────────────────────┤ │ └────────┘ └─────┬───┘ └─────────────┘ │
┌─┼────┤ Search searvice:5002   │ │                  │                     │
│ │    └────────────────────────┘ └──────────────────┼─────────────────────┘
│ │    ┌─────────────────────────────────────────────┼───────────────────────┐
│ │    │                                             │                       │
│ │    │ Databuilder     ┌───────────────────────────┘                       │
│ │    │                 │                                                   │
│ │    │ ┌───────────────▼────────────────┐ ┌──────────────────────────────┐ │
│ │ ┌──┼─► Extractor of Sources           ├─► nebula_search_data_extractor │ │
│ │ │  │ └───────────────┬────────────────┘ └──────────────┬───────────────┘ │
│ │ │  │ ┌───────────────▼────────────────┐ ┌──────────────▼───────────────┐ │
│ │ │  │ │ Loader filesystem_csv_nebula   │ │ Loader Elastic FS loader     │ │
│ │ │  │ └───────────────┬────────────────┘ └──────────────┬───────────────┘ │
│ │ │  │ ┌───────────────▼────────────────┐ ┌──────────────▼───────────────┐ │
│ │ │  │ │ Publisher nebula_csv_publisher │ │ Publisher Elasticsearch      │ │
│ │ │  │ └───────────────┬────────────────┘ └──────────────┬───────────────┘ │
│ │ │  └─────────────────┼─────────────────────────────────┼─────────────────┘
│ │ └────────────────┐   │                                 │
│ │    ┌─────────────┼───►─────────────────────────┐ ┌─────▼─────┐
│ │    │ Nebula Graph│   │                         │ │           │
│ └────┼─────┬───────┴───┼───────────┐     ┌─────┐ │ │           │
│      │     │           │           │     │MetaD│ │ │           │
│      │ ┌───▼──┐    ┌───▼──┐    ┌───▼──┐  └─────┘ │ │           │
│ ┌────┼─►GraphD│    │GraphD│    │GraphD│          │ │           │
│ │    │ └──────┘    └──────┘    └──────┘  ┌─────┐ │ │           │
│ │    │ :9669                             │MetaD│ │ │  Elastic  │
│ │    │ ┌────────┐ ┌────────┐ ┌────────┐  └─────┘ │ │  Search   │
│ │    │ │        │ │        │ │        │          │ │  Cluster  │
│ │    │ │StorageD│ │StorageD│ │StorageD│  ┌─────┐ │ │  :9200    │
│ │    │ │        │ │        │ │        │  │MetaD│ │ │           │
│ │    │ └────────┘ └────────┘ └────────┘  └─────┘ │ │           │
│ │    ├───────────────────────────────────────────┤ │           │
│ └────┤ Nebula Studio:7001                        │ │           │
│      └───────────────────────────────────────────┘ └─────▲─────┘
└──────────────────────────────────────────────────────────┘

Connecting the dots, Metadata Discovery

With the basic environment being set up, let's put everything together.

Remember we had ELT some data to PostgreSQL as this?

How could we let Amundsen discover metadata regarding those data and ETL?

Extracting Postgres metadata

We started on the data source: Postgres, first.

We install the Postgres Client for python3:

class="bash language-bash"sudo apt-get install libpq-dev
pip3 install Psycopg2
Execution of Postgres metadata ETL

Run a script to parse Postgres Metadata:

class="bash language-bash"export CREDENTIALS_POSTGRES_USER=lineage_ref
export CREDENTIALS_POSTGRES_PASSWORD=lineage_ref
export CREDENTIALS_POSTGRES_DATABASE=warehouse

python3 example/scripts/sample_postgres_loader_nebula.py

If you look into the code of the sample script for loading Postgres metadata to Nebula, the main lines are quite straightforward:

class="python language-python"# part 1: PostgressMetadata --> CSV --> Nebula Graph
job = DefaultJob(
     conf=job_config,
     task=DefaultTask(
         extractor=PostgresMetadataExtractor(),
         loader=FsNebulaCSVLoader()),
     publisher=NebulaCsvPublisher())

...
# part 2: Metadata stored in NebulaGraph --> Elasticsearch
extractor = NebulaSearchDataExtractor()
task = SearchMetadatatoElasticasearchTask(extractor=extractor)

job = DefaultJob(conf=job_config, task=task)

The first job was to load data in path:PostgressMetadata --> CSV --> Nebula Graph

  • PostgresMetadataExtractor was used to extract/pull metadata from Postgres, refer here for its documentation.
  • FsNebulaCSVLoader was used to put extracted data intermediately as CSV files
  • NebulaCsvPublisher was used to publish metadata in form of CSV to Nebula Graph

The second job was to load in the path: Metadata stored in NebulaGraph --> Elasticsearch

  • NebulaSearchDataExtractor was used to fetch metadata stored in Nebula Graph
  • SearchMetadatatoElasticasearchTask was used to make metadata indexed with Elasticsearch.
Note, in production, we could trigger those jobs either in scripts or with an orchestration platform like Apache Airflow.
Verify the Postgres Extraction

Search payments or directly visit http://localhost:5000/table_detail/warehouse/postgres/public/payments, you could see the metadata from our Postgres like:

Then, metadata management actions like adding tags, owners, and descriptions could be done easily as it was in the above screen capture, too.

Extracting dbt metadata

Actually, we could also pull metadata from dbt itself.

The Amundsen DbtExtractor, will parse the catalog.json or manifest.json file to load metadata to Amundsen storage(Nebula Graph and Elasticsearch).

In above meltano chapter, we had already generated that file with meltano invoke dbt docs generate, and the output like the following is telling us the catalog.json file:

class="log language-log"14:23:15  Done.
14:23:15  Building catalog
14:23:15  Catalog written to /home/ubuntu/ref-data-lineage/meltano_example_implementations/meltano_projects/singer_dbt_jaffle/.meltano/transformers/dbt/target/catalog.json
Execution of dbt metadata ETL

There is an example script with a sample dbt output files:

The sample dbt files:

class="bash language-bash"$ ls -l example/sample_data/dbt/
total 184
-rw-rw-r-- 1 w w   5320 May 15 07:17 catalog.json
-rw-rw-r-- 1 w w 177163 May 15 07:17 manifest.json

We could load this sample dbt manifest with:

class="bash language-bash"python3 example/scripts/sample_dbt_loader_nebula.py

From this lines of python code, we could tell those process as:

class="python language-python"# part 1: Dbt manifest --> CSV --> Nebula Graph
job = DefaultJob(
     conf=job_config,
     task=DefaultTask(
         extractor=DbtExtractor(),
         loader=FsNebulaCSVLoader()),
     publisher=NebulaCsvPublisher())

...
# part 2: Metadata stored in NebulaGraph --> Elasticsearch
extractor = NebulaSearchDataExtractor()
task = SearchMetadatatoElasticasearchTask(extractor=extractor)

job = DefaultJob(conf=job_config, task=task)

And the only difference from the Postgres meta ETL is the extractor=DbtExtractor(), where it comes with following configurations to get the below information regarding dbt projects:

  • databases_name
  • catalog_json
  • manifest_json
class="python language-python"job_config = ConfigFactory.from_dict({
 'extractor.dbt.database_name': database_name,
 'extractor.dbt.catalog_json': catalog_file_loc,  # File
 'extractor.dbt.manifest_json': json.dumps(manifest_data),  # JSON Dumped objecy
 'extractor.dbt.source_url': source_url})
Verify the dbt Extraction

Search dbt_demo or visit http://localhost:5000/table_detail/dbt_demo/snowflake/public/raw_inventory_value to see:

 Tips: we could optionally enable debug logging to see what had been sent to Elasticsearch and Nebula Graph!
class="diff language-diff"- logging.basicConfig(level=logging.INFO)
+ logging.basicConfig(level=logging.DEBUG)

Or, alternatively, explore the imported data in Nebula Studio:

First, click "Start with Vertices", fill in the vertex id: snowflake://dbt_demo.public/fact_warehouse_inventory

Then, we could see the vertex being shown as the pink dot. Let's modify the Expand options with:

  • Direction: Bidirect
  • Steps: Single with 3

And double click the vertex(dot), it will expand 3 steps in bi-direction:

From this graph view, the insight of the metadata is extremely easy to be explored, right?

 Tips, you may like to click the 👁 icon to select some properties to be shown, which was done by me before capturing the screen as above.

And, what we had seen in the Nebula Studio echoes the data model of Amundsen metadata service, too:

Finally, remember we had leveraged dbt to transform some data in meltano, and the manifest file path is .meltano/transformers/dbt/target/catalog.json, you can try to create a data builder job to import it.

Extracting Superset metadata

Dashboards, Charts and the relationships with Tables can be extracted by Amundsen data builder, as we already setup a Superset Dashboard, let's try ingesting its metadata.

Execution of Superset metadata ETL

The sample superset script will fetch data from Superset and load metadata into Nebula Graph and Elasticsearch.

class="python language-python"python3 sample_superset_data_loader_nebula.py

If we set the logging level to DEBUG, we could actually see lines like:

class="python language-python"# fetching metadata from superset
DEBUG:urllib3.connectionpool:http://localhost:8088 "POST /api/v1/security/login HTTP/1.1" 200 280
INFO:databuilder.task.task:Running a task
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): localhost:8088
DEBUG:urllib3.connectionpool:http://localhost:8088 "GET /api/v1/dashboard?q=(page_size:20,page:0,order_direction:desc) HTTP/1.1" 308 374
DEBUG:urllib3.connectionpool:http://localhost:8088 "GET /api/v1/dashboard/?q=(page_size:20,page:0,order_direction:desc) HTTP/1.1" 200 1058
...

# insert Dashboard

DEBUG:databuilder.publisher.nebula_csv_publisher:Query: INSERT VERTEX `Dashboard` (`dashboard_url`, `name`, published_tag, publisher_last_updated_epoch_ms) VALUES  "superset_dashboard://my_cluster.1/3":("http://localhost:8088/superset/dashboard/3/","my_dashboard","unique_tag",timestamp());
...

# insert a DASHBOARD_WITH_TABLE relationship/edge

INFO:databuilder.publisher.nebula_csv_publisher:Importing data in edge files: ['/tmp/amundsen/dashboard/relationships/Dashboard_Table_DASHBOARD_WITH_TABLE.csv']
DEBUG:databuilder.publisher.nebula_csv_publisher:Query:
INSERT edge `DASHBOARD_WITH_TABLE` (`END_LABEL`, `START_LABEL`, published_tag, publisher_last_updated_epoch_ms) VALUES "superset_dashboard://my_cluster.1/3"->"postgresql+psycopg2://my_cluster.warehouse/orders":("Table","Dashboard","unique_tag", timestamp()), "superset_dashboard://my_cluster.1/3"->"postgresql+psycopg2://my_cluster.warehouse/customers":("Table","Dashboard","unique_tag", timestamp());
Verify the Superset Dashboard Extraction

By searching it in Amundsen, we could the Dashboard info now. And we could verify it from Nebula Studio, too.

 Note, see also the Dashboard's model in Amundsen from the dashboard ingestion guide:  
dashboard_graph_modeling

Preview data with Superset

Superset could be used to preview Table Data like this. Corresponding documentation could be referred here, where the API of /superset/sql_json/ will be called by Amundsen Frontend.

Enable Data lineage

By default, data lineage was not enabled, we could enable it by:

  1. Go to the Amundsen repo, that's also where we run the docker-compose -f docker-amundsen-nebula.yml up command
class="bash language-bash"cd amundsen
  1. Modify frontend JS configuration:
class="diff language-diff"--- a/frontend/amundsen_application/static/js/config/config-default.ts
+++ b/frontend/amundsen_application/static/js/config/config-default.ts
  tableLineage: {
-    inAppListEnabled: false,
-    inAppPageEnabled: false,
+    inAppListEnabled: true,
+    inAppPageEnabled: true,
    externalEnabled: false,
    iconPath: 'PATH_TO_ICON',
    isBeta: false,
  1. Now let's run again build for docker image, where the frontend image will be rebuilt.
class="bash language-bash"docker-compose -f docker-amundsen-nebula.yml build

Then, rerun the up -d to ensure the frontend container is recreated with a new configuration:

class="bash language-bash"docker-compose -f docker-amundsen-nebula.yml up -d

We could see something like this:

class="bash language-bash"$ docker-compose -f docker-amundsen-nebula.yml up -d
...
Recreating amundsenfrontend           ... done

After that, we could visit http://localhost:5000/lineage/table/gold/hive/test_schema/test_table1 to see the Lineage is shown as:

We could click Downstream(if there is) to see downstream resources of this table:

Or click Lineage to see the graph:

There are APIs for lineage queries, too. Here is an example to query that with cURL, where we leverage the netshoot container as we did before for user creation.

class="bash language-bash"docker run -it --rm --net container:amundsenfrontend nicolaka/netshoot

curl "http://amundsenmetadata:5002/table/snowflake://dbt_demo.public/raw_inventory_value/lineage?depth=3&direction=both"

The above API call was to query linage on both upstream and downstream direction, with depth 3 for table snowflake://dbt_demo.public/raw_inventory_value.

And the result should be like:

class="json language-json"{
 "depth": 3,
 "downstream_entities": [
   {
     "level": 2,
     "usage": 0,
     "key": "snowflake://dbt_demo.public/fact_daily_expenses",
     "parent": "snowflake://dbt_demo.public/fact_warehouse_inventory",
     "badges": [],
     "source": "snowflake"
   },
   {
     "level": 1,
     "usage": 0,
     "key": "snowflake://dbt_demo.public/fact_warehouse_inventory",
     "parent": "snowflake://dbt_demo.public/raw_inventory_value",
     "badges": [],
     "source": "snowflake"
   }
 ],
 "key": "snowflake://dbt_demo.public/raw_inventory_value",
 "direction": "both",
 "upstream_entities": []
}

In fact, this lineage data was just extracted and loaded during our DbtExtractor execution, where extractor.dbt.{DbtExtractor.EXTRACT_LINEAGE} by default was True, thus lineage metadata were created and loaded to Amundsen.

Get lineage in Nebula Graph

Two of the advantages to use a Graph Database as Metadata Storage are:

  • The graph query itself is a flexible DSL for lineage API, for example, this query helps us do the equivalent query of the Amundsen metadata API for fetching lineage:
class="cypher language-cypher"MATCH p=(t:Table) -[:HAS_UPSTREAM|:HAS_DOWNSTREAM *1..3]->(x)
WHERE id(t) == "snowflake://dbt_demo.public/raw_inventory_value" RETURN p
  • We could now even query it in Nebula Graph Studio's console, and click View Subgraphs to make it rendered in a graph view then.
Extract Data Lineage
Dbt

As mentioned above, DbtExtractor will extract table level lineage, together with other information defined in the dbt ETL pipeline.

Open Lineage

The other linage extractor out-of-the-box in Amundsen is OpenLineageTableLineageExtractor.

Open Lineage is an open framework to collect lineage data from different sources in one place, which can output linage information as JSON files to be extracted by OpenLineageTableLineageExtractor:

class="python language-python"dict_config = {
   # ...
   f'extractor.openlineage_tablelineage.{OpenLineageTableLineageExtractor.CLUSTER_NAME}': 'datalab',
   f'extractor.openlineage_tablelineage.{OpenLineageTableLineageExtractor.OL_DATASET_NAMESPACE_OVERRIDE}': 'hive_table',
   f'extractor.openlineage_tablelineage.{OpenLineageTableLineageExtractor.TABLE_LINEAGE_FILE_LOCATION}': 'input_dir/openlineage_nd.json',
}
...

task = DefaultTask(
   extractor=OpenLineageTableLineageExtractor(),
   loader=FsNebulaCSVLoader())

Recap

The whole idea of Metadata Governance/Discovery is to:

  • Put all components in the stack as Metadata Sources(from any DB or DW to dbt, Airflow, Openlineage, Superset, etc.)
  • Run metadata ETL with Databuilder(as a script, or DAG) to store and index with Nebula Graph(or other Graph Database) and Elasticsearch
  • Consume, manage, and discover metadata from Frontend UI(with Superset for preview) or API
  • Have more possibilities, flexibility, and insights on Nebula Graph from queries and the UI

Upstream Projects

All projects used in this reference project are listed below in lexicographic order.

  • Amundsen
  • Apache Airflow
  • Apache Superset
  • dbt
  • Elasticsearch
  • meltano
  • Nebula Graph
  • Open Lineage
  • singer
Recommended for you
use-cases
Knowledge Graph
Financial Fraud Detection: One of the Best Practices of Knowledge Graph
NebulaGraph
12/7/2022
use-cases
Use cases of graph databases in real-time recommendation
Wayne Sheng
27/6/2022
graph-database
Azure Marketplace series: My take on three cloud service models for ISVs
Jerry Liang
20/6/2022