Building a Knowledge Graph with Neo4j and Python: A Hands-On Tutorial
Introduction
In our previous post, we explored the theory behind Knowledge Graphs (KGs). Now, it’s time to get our hands dirty!
In this tutorial, we will build a Knowledge Graph of the African Tech Ecosystem from scratch. We’ll model startups, founders, investors, and their relationships using Neo4j, the world’s leading graph database, and control it all with Python.
Goal
By the end of this tutorial, you will have a running Neo4j database populated with real-world-like data, and you’ll be able to answer complex questions like “Which investors are driving the fintech boom in Lagos?”
Prerequisites
Before we begin, ensure you have:
- Python 3.8+ installed.
- Neo4j Database: You can use Neo4j Desktop (local) or Neo4j AuraDB (free cloud tier).
- Python Libraries:
1
pip install neo4j pandas
Step 1: Designing the Ontology
A Knowledge Graph starts with a schema (ontology). We want to capture the relationships in the tech ecosystem.
Entities (Nodes):
StartupFounderInvestorLocation(City/Country)Sector(e.g., Fintech, Healthtech)
Relationships (Edges):
(:Founder)-[:FOUNDED]->(:Startup)(:Investor)-[:INVESTED_IN]->(:Startup)(:Startup)-[:LOCATED_IN]->(:Location)(:Startup)-[:OPERATES_IN]->(:Sector)
graph TD
F((Founder)) -->|FOUNDED| S((Startup))
I((Investor)) -->|INVESTED_IN| S
S -->|LOCATED_IN| L((Location))
S -->|OPERATES_IN| SEC((Sector))
style F fill:#ff6b6b
style S fill:#ffd93d
style I fill:#6bcf7f
style L fill:#95e1d3
style SEC fill:#a8d8ea
Step 2: Connecting to Neo4j
Let’s write a Python class to handle our database connection.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from neo4j import GraphDatabase
class AfricanTechGraph:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def run_query(self, query, parameters=None):
with self.driver.session() as session:
result = session.run(query, parameters)
return [record.data() for record in result]
# Usage (Update with your credentials)
# uri = "bolt://localhost:7687"
# kg = AfricanTechGraph(uri, "neo4j", "password")
Step 3: Preparing the Data
We’ll use a dictionary list to represent our “raw” data. In a real project, this would come from scraping or CSV files.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
startups_data = [
{
"name": "Paystack",
"sector": "Fintech",
"location": "Lagos",
"founders": ["Shola Akinlade", "Ezra Olubi"],
"investors": ["Stripe", "Y Combinator", "Tencent"]
},
{
"name": "Flutterwave",
"sector": "Fintech",
"location": "Lagos",
"founders": ["Olugbenga Agboola", "Iyinoluwa Aboyeji"],
"investors": ["Tiger Global", "Avenir Growth"]
},
{
"name": "Twiga Foods",
"sector": "AgriTech",
"location": "Nairobi",
"founders": ["Peter Njonjo", "Grant Brooke"],
"investors": ["Creadev", "TLcom Capital"]
},
{
"name": "M-KOPA",
"sector": "CleanTech",
"location": "Nairobi",
"founders": ["Jesse Moore"],
"investors": ["Sumitomo Corporation", "Standard Bank"]
}
]
Step 4: Loading Data into Neo4j
We need to be careful not to create duplicates. We’ll use the Cypher MERGE clause, which acts like “Create if not exists”.
First, let’s define uniqueness constraints (run this once in your Neo4j Browser or via Python):
1
2
3
4
5
CREATE CONSTRAINT FOR (s:Startup) REQUIRE s.name IS UNIQUE;
CREATE CONSTRAINT FOR (f:Founder) REQUIRE f.name IS UNIQUE;
CREATE CONSTRAINT FOR (i:Investor) REQUIRE i.name IS UNIQUE;
CREATE CONSTRAINT FOR (l:Location) REQUIRE l.name IS UNIQUE;
CREATE CONSTRAINT FOR (sec:Sector) REQUIRE sec.name IS UNIQUE;
Now, the loading function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def load_data(kg, data):
query = """
UNWIND $startups AS row
// Create Nodes
MERGE (s:Startup {name: row.name})
MERGE (sec:Sector {name: row.sector})
MERGE (l:Location {name: row.location})
// Create Relationships
MERGE (s)-[:OPERATES_IN]->(sec)
MERGE (s)-[:LOCATED_IN]->(l)
// Handle Founders
FOREACH (founder_name IN row.founders |
MERGE (f:Founder {name: founder_name})
MERGE (f)-[:FOUNDED]->(s)
)
// Handle Investors
FOREACH (investor_name IN row.investors |
MERGE (i:Investor {name: investor_name})
MERGE (i)-[:INVESTED_IN]->(s)
)
"""
kg.run_query(query, parameters={"startups": data})
print(f"Successfully loaded {len(data)} startups into the graph.")
# Execute loading
# load_data(kg, startups_data)
Step 5: Analyzing the Graph
Now that our graph is populated, we can ask interesting questions using Cypher.
Query 1: Find all Fintech startups in Lagos
1
2
3
4
5
6
7
query_fintech_lagos = """
MATCH (s:Startup)-[:OPERATES_IN]->(:Sector {name: 'Fintech'})
MATCH (s)-[:LOCATED_IN]->(:Location {name: 'Lagos'})
RETURN s.name as Startup
"""
# print(kg.run_query(query_fintech_lagos))
# Output: [{'Startup': 'Paystack'}, {'Startup': 'Flutterwave'}]
Query 2: Who are the “Super Investors”?
Find investors who have invested in more than one startup in our dataset.
1
2
3
4
5
6
7
query_super_investors = """
MATCH (i:Investor)-[:INVESTED_IN]->(s:Startup)
WITH i, count(s) as portfolio_size
WHERE portfolio_size > 1
RETURN i.name, portfolio_size
ORDER BY portfolio_size DESC
"""
Query 3: Shortest Path
How is Stripe connected to Flutterwave?
1
2
3
4
5
6
7
8
query_path = """
MATCH path = shortestPath(
(i:Investor {name: 'Stripe'})-[*]-(s:Startup {name: 'Flutterwave'})
)
RETURN [n in nodes(path) | n.name] as connection_path
"""
# Possible Output: ['Stripe', 'Paystack', 'Lagos', 'Flutterwave']
# (Showing they are connected via the Lagos ecosystem location!)
Step 6: Visualizing the Graph
While Python is great for logic, graphs are visual.
- Neo4j Browser: Run
MATCH (n) RETURN nto see your full graph interactively. - Neovis.js: For embedding graphs in web apps.
- yFiles for Jupyter: For visualization directly in notebooks.
Here is a conceptual visualization of what we built:
graph LR
subgraph Lagos Ecosystem
P[Paystack] --- L[Lagos]
F[Flutterwave] --- L
end
subgraph Investors
YC[Y Combinator] --> P
T[Tiger Global] --> F
end
subgraph Founders
SA[Shola Akinlade] --> P
OA[Olugbenga Agboola] --> F
end
style P fill:#ffd93d
style F fill:#ffd93d
style L fill:#95e1d3
Conclusion
We’ve successfully built a functional Knowledge Graph of the African Tech Ecosystem! We moved from raw dictionaries to a structured, queryable graph database.
What’s Next? In the next post, we will explore Graph Algorithms. We’ll use algorithms like PageRank to find the most influential nodes and Louvain Modularity to detect communities within our graph.
Challenge
Try adding a new entity type
Universityand a relationship(:Founder)-[:STUDIED_AT]->(:University). Can you find which universities produce the most founders?
References
Related Posts:
Graph the future!
