Build Your First Graph
This tutorial covers creating a knowledge graph from scratch, exploring it with SQL queries, and exporting it to files. You will learn the core workflow used in graph processing pipelines.
Note: If running from a source checkout, use
uv run kozainstead ofkoza. If installed via pip, usekozadirectly.
Overview
- Create sample KGX node and edge files
- Join files into a DuckDB database
- Explore your graph with SQL queries
- Generate statistics and reports
- Export to different formats
Prerequisites
Before starting, ensure you have:
- Koza installed:
- uv:
uvx koza(run directly),uv add koza, oruv pip install koza - poetry:
poetry add koza - pip:
pip install koza
- uv:
- DuckDB CLI (optional but recommended): Install from duckdb.org or
pip install duckdb - Basic command line familiarity
Verify your installation:
koza --version
You should see the Koza version number printed.
Sample Data
We will create a small knowledge graph about genes and diseases. The graph will have:
- 5 nodes: 3 genes and 2 diseases
- 5 edges: Gene-disease associations
This is a tiny example, but the same commands work identically on graphs with millions of nodes and edges.
Understanding KGX Format
KGX (Knowledge Graph Exchange) is a standard format for biomedical knowledge graphs. It uses:
- Nodes file: Contains entities (genes, diseases, phenotypes, etc.)
- Edges file: Contains relationships between entities
Both are tab-separated files with specific columns. The minimum required columns are:
- Nodes:
id,category,name - Edges:
id,subject,predicate,object
Step 1: Create Sample Files
Let us create our sample data files. You can copy-paste these commands into your terminal, or create the files manually in a text editor.
Create a working directory
mkdir -p kgx-tutorial
cd kgx-tutorial
Create the nodes file
Create a file named sample_nodes.tsv with the following content:
cat > sample_nodes.tsv << 'EOF'
id category name description provided_by
HGNC:1100 biolink:Gene BRCA1 BRCA1 DNA repair associated infores:hgnc
HGNC:1101 biolink:Gene BRCA2 BRCA2 DNA repair associated infores:hgnc
HGNC:7881 biolink:Gene NOTCH1 notch receptor 1 infores:hgnc
MONDO:0007254 biolink:Disease breast cancer A malignant neoplasm of the breast infores:mondo
MONDO:0005070 biolink:Disease leukemia Cancer of blood-forming tissues infores:mondo
EOF
Create the edges file
Create a file named sample_edges.tsv with the following content:
cat > sample_edges.tsv << 'EOF'
id subject predicate object primary_knowledge_source provided_by
uuid:1 HGNC:1100 biolink:gene_associated_with_condition MONDO:0007254 infores:clinvar infores:clinvar
uuid:2 HGNC:1101 biolink:gene_associated_with_condition MONDO:0007254 infores:clinvar infores:clinvar
uuid:3 HGNC:7881 biolink:gene_associated_with_condition MONDO:0005070 infores:clinvar infores:clinvar
uuid:4 HGNC:1100 biolink:interacts_with HGNC:1101 infores:string infores:string
uuid:5 HGNC:1100 biolink:interacts_with HGNC:7881 infores:string infores:string
EOF
Verify the files
Check that your files look correct:
head sample_nodes.tsv
head sample_edges.tsv
You should see the header row followed by data rows for each file.
Step 2: Join Into a Database
Now we will combine these files into a DuckDB database. DuckDB is an embedded analytical database that supports SQL queries on your graph data.
Run the join command
koza join \
--nodes sample_nodes.tsv \
--edges sample_edges.tsv \
--output my_graph.duckdb
You should see output similar to:
Join operation completed successfully!
The command creates my_graph.duckdb containing two tables:
nodes- All your node recordsedges- All your edge records
What just happened?
The koza join command:
- Read both input files
- Detected the TSV format automatically
- Inferred column types from the data
- Created a DuckDB database with optimized storage
- Loaded all records into
nodesandedgestables
This same process works with:
- Mixed file formats (TSV, JSONL, Parquet)
- Multiple input files per table
- Compressed files (.gz, .bz2)
- Files with different column schemas (missing columns are filled with NULL)
Step 3: Explore with SQL
DuckDB allows you to query your graph using standard SQL. Let us explore our data.
Count nodes and edges
duckdb my_graph.duckdb "SELECT COUNT(*) AS node_count FROM nodes"
Output:
┌────────────┐
│ node_count │
│ int64 │
├────────────┤
│ 5 │
└────────────┘
duckdb my_graph.duckdb "SELECT COUNT(*) AS edge_count FROM edges"
Output:
┌────────────┐
│ edge_count │
│ int64 │
├────────────┤
│ 5 │
└────────────┘
View the database schema
See what columns are available:
duckdb my_graph.duckdb "DESCRIBE nodes"
Output:
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id │ VARCHAR │ YES │ │ │ │
│ category │ VARCHAR │ YES │ │ │ │
│ name │ VARCHAR │ YES │ │ │ │
│ description │ VARCHAR │ YES │ │ │ │
│ provided_by │ VARCHAR │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
List all categories
See what types of entities are in your graph:
duckdb my_graph.duckdb "SELECT DISTINCT category FROM nodes"
Output:
┌─────────────────┐
│ category │
│ varchar │
├─────────────────┤
│ biolink:Gene │
│ biolink:Disease │
└─────────────────┘
Count nodes by category
duckdb my_graph.duckdb "SELECT category, COUNT(*) AS count FROM nodes GROUP BY category"
Output:
┌─────────────────┬───────┐
│ category │ count │
│ varchar │ int64 │
├─────────────────┼───────┤
│ biolink:Gene │ 3 │
│ biolink:Disease │ 2 │
└─────────────────┴───────┘
Find specific nodes
Search for nodes by name:
duckdb my_graph.duckdb "SELECT id, name FROM nodes WHERE name LIKE '%BRCA%'"
Output:
┌───────────┬───────┐
│ id │ name │
│ varchar │ varchar│
├───────────┼───────┤
│ HGNC:1100 │ BRCA1 │
│ HGNC:1101 │ BRCA2 │
└───────────┴───────┘
Explore edge relationships
List all predicate types:
duckdb my_graph.duckdb "SELECT predicate, COUNT(*) AS count FROM edges GROUP BY predicate"
Output:
┌─────────────────────────────────────────┬───────┐
│ predicate │ count │
│ varchar │ int64 │
├─────────────────────────────────────────┼───────┤
│ biolink:gene_associated_with_condition │ 3 │
│ biolink:interacts_with │ 2 │
└─────────────────────────────────────────┴───────┘
Find edges for a specific node
What diseases are associated with BRCA1?
duckdb my_graph.duckdb "
SELECT e.predicate, n.name AS disease_name
FROM edges e
JOIN nodes n ON e.object = n.id
WHERE e.subject = 'HGNC:1100'
AND e.predicate = 'biolink:gene_associated_with_condition'
"
Output:
┌────────────────────────────────────────┬───────────────┐
│ predicate │ disease_name │
│ varchar │ varchar │
├────────────────────────────────────────┼───────────────┤
│ biolink:gene_associated_with_condition │ breast cancer │
└────────────────────────────────────────┴───────────────┘
Step 4: Generate Statistics
Koza provides built-in commands for generating reports about your graph. These are especially useful for quality control when working with larger datasets.
Generate graph statistics
koza report graph-stats --database my_graph.duckdb --output graph_stats.yaml
This creates a YAML file with comprehensive statistics:
cat graph_stats.yaml
The report includes:
- Total node and edge counts
- Counts by category and predicate
- Namespace distributions
- Data source breakdowns
Generate a QC report
The QC (Quality Control) report provides more detailed analysis:
koza report qc --database my_graph.duckdb --output qc_report.yaml
This report helps identify potential data quality issues like:
- Missing required fields
- Orphan nodes (nodes not connected to any edges)
- Invalid identifiers
View report summary
You can also run reports without saving to a file to see output in the terminal:
koza report graph-stats --database my_graph.duckdb
Step 5: Export to Files
After working with your graph in the database, you may want to export it back to files. The split command can export your entire graph or create subsets based on field values.
Export all nodes and edges
First, let us export the complete graph. The simplest approach is to split on a field where all records have the same value, or to use a field that groups naturally.
Export nodes by category:
koza split sample_nodes.tsv category --output-dir ./export
Output filenames are generated as {input}_{field_value}_{type}.tsv. Since our input is sample_nodes.tsv and we're splitting by category, this produces:
./export/
sample_biolink_Gene_nodes.tsv # nodes where category = biolink:Gene
sample_biolink_Disease_nodes.tsv # nodes where category = biolink:Disease
(The : in biolink:Gene becomes _ in the filename.)
Note: When data has been loaded through
koza transform, theprovided_byfield is typically overwritten with the ingest name. If you want to split by data source after transformation, use a different field or ensure your data preserves the original source information in another column.
Convert to different formats
You can convert between formats during export. Convert to Parquet (a columnar format ideal for analytics):
koza split sample_nodes.tsv category \
--output-dir ./parquet_export \
--format parquet
This produces files like sample_biolink_Gene_nodes.parquet.
Or convert to JSONL (JSON Lines, useful for streaming):
koza split sample_edges.tsv predicate \
--output-dir ./jsonl_export \
--format jsonl
This produces files like sample_biolink_related_to_edges.jsonl (one per predicate value).
Check exported files
Verify the exports look correct:
# Check Parquet files
ls ./parquet_export/
# Read a Parquet file with DuckDB
duckdb -c "SELECT * FROM read_parquet('./parquet_export/sample_Gene_nodes.parquet')"
# Check JSONL files
ls ./jsonl_export/
head ./jsonl_export/*.jsonl
Summary
This tutorial covered the core graph operations workflow. Here is what you accomplished:
-
Created KGX files - You made sample node and edge files in the standard KGX TSV format
-
Joined files into a database - The
koza joincommand combined your files into a DuckDB database -
Explored with SQL - You queried your graph to:
- Count nodes and edges
- List categories and predicates
- Find specific entities
-
Traverse relationships
-
Generated reports - You used
koza reportto create statistics and quality control reports -
Exported data - You used
koza splitto export subsets in different formats (TSV, Parquet, JSONL)
Key Commands Summary
| Command | Purpose |
|---|---|
koza join |
Combine KGX files into a DuckDB database |
koza report graph-stats |
Generate graph statistics |
koza report qc |
Generate quality control report |
koza split |
Export/split graph by field values |
Next Steps
Now that you understand the basics, explore more advanced capabilities:
Continue Learning
- Complete Merge Workflow - Learn to combine data from multiple sources, normalize identifiers with SSSOM mappings, and clean your graph in one pipeline
How-to Guides
- Join Files - Advanced joining with glob patterns, mixed formats, and schema reporting
- Split Graphs - More splitting options including prefix removal and multivalued fields
- Generate Reports - All available report types and customization options
- Normalize IDs - Use SSSOM mappings to harmonize identifiers
- Clean Graphs - Remove duplicates and dangling edges
Reference
- CLI Reference - Complete documentation for all commands and options
Cleanup
When you are done experimenting, you can remove the tutorial files:
cd ..
rm -rf kgx-tutorial
Or keep them around to continue exploring.
These patterns scale to handle graphs with millions of nodes and edges.