Skip to content

Build Your First Graph

This tutorial covers creating a knowledge graph from scratch, exploring it with SQL queries, and exporting it to files. You will learn the core workflow used in graph processing pipelines.

Note: If running from a source checkout, use uv run koza instead of koza. If installed via pip, use koza directly.

Overview

  • Create sample KGX node and edge files
  • Join files into a DuckDB database
  • Explore your graph with SQL queries
  • Generate statistics and reports
  • Export to different formats

Prerequisites

Before starting, ensure you have:

  • Koza installed:
    • uv: uvx koza (run directly), uv add koza, or uv pip install koza
    • poetry: poetry add koza
    • pip: pip install koza
  • DuckDB CLI (optional but recommended): Install from duckdb.org or pip install duckdb
  • Basic command line familiarity

Verify your installation:

koza --version

You should see the Koza version number printed.

Sample Data

We will create a small knowledge graph about genes and diseases. The graph will have:

  • 5 nodes: 3 genes and 2 diseases
  • 5 edges: Gene-disease associations

This is a tiny example, but the same commands work identically on graphs with millions of nodes and edges.

Understanding KGX Format

KGX (Knowledge Graph Exchange) is a standard format for biomedical knowledge graphs. It uses:

  • Nodes file: Contains entities (genes, diseases, phenotypes, etc.)
  • Edges file: Contains relationships between entities

Both are tab-separated files with specific columns. The minimum required columns are:

  • Nodes: id, category, name
  • Edges: id, subject, predicate, object

Step 1: Create Sample Files

Let us create our sample data files. You can copy-paste these commands into your terminal, or create the files manually in a text editor.

Create a working directory

mkdir -p kgx-tutorial
cd kgx-tutorial

Create the nodes file

Create a file named sample_nodes.tsv with the following content:

cat > sample_nodes.tsv << 'EOF'
id  category    name    description provided_by
HGNC:1100   biolink:Gene    BRCA1   BRCA1 DNA repair associated infores:hgnc
HGNC:1101   biolink:Gene    BRCA2   BRCA2 DNA repair associated infores:hgnc
HGNC:7881   biolink:Gene    NOTCH1  notch receptor 1    infores:hgnc
MONDO:0007254   biolink:Disease breast cancer   A malignant neoplasm of the breast  infores:mondo
MONDO:0005070   biolink:Disease leukemia    Cancer of blood-forming tissues infores:mondo
EOF

Create the edges file

Create a file named sample_edges.tsv with the following content:

cat > sample_edges.tsv << 'EOF'
id  subject predicate   object  primary_knowledge_source    provided_by
uuid:1  HGNC:1100   biolink:gene_associated_with_condition  MONDO:0007254   infores:clinvar infores:clinvar
uuid:2  HGNC:1101   biolink:gene_associated_with_condition  MONDO:0007254   infores:clinvar infores:clinvar
uuid:3  HGNC:7881   biolink:gene_associated_with_condition  MONDO:0005070   infores:clinvar infores:clinvar
uuid:4  HGNC:1100   biolink:interacts_with  HGNC:1101   infores:string  infores:string
uuid:5  HGNC:1100   biolink:interacts_with  HGNC:7881   infores:string  infores:string
EOF

Verify the files

Check that your files look correct:

head sample_nodes.tsv
head sample_edges.tsv

You should see the header row followed by data rows for each file.

Step 2: Join Into a Database

Now we will combine these files into a DuckDB database. DuckDB is an embedded analytical database that supports SQL queries on your graph data.

Run the join command

koza join \
  --nodes sample_nodes.tsv \
  --edges sample_edges.tsv \
  --output my_graph.duckdb

You should see output similar to:

Join operation completed successfully!

The command creates my_graph.duckdb containing two tables:

  • nodes - All your node records
  • edges - All your edge records

What just happened?

The koza join command:

  1. Read both input files
  2. Detected the TSV format automatically
  3. Inferred column types from the data
  4. Created a DuckDB database with optimized storage
  5. Loaded all records into nodes and edges tables

This same process works with:

  • Mixed file formats (TSV, JSONL, Parquet)
  • Multiple input files per table
  • Compressed files (.gz, .bz2)
  • Files with different column schemas (missing columns are filled with NULL)

Step 3: Explore with SQL

DuckDB allows you to query your graph using standard SQL. Let us explore our data.

Count nodes and edges

duckdb my_graph.duckdb "SELECT COUNT(*) AS node_count FROM nodes"

Output:

┌────────────┐
│ node_count │
│   int64    │
├────────────┤
│          5 │
└────────────┘

duckdb my_graph.duckdb "SELECT COUNT(*) AS edge_count FROM edges"

Output:

┌────────────┐
│ edge_count │
│   int64    │
├────────────┤
│          5 │
└────────────┘

View the database schema

See what columns are available:

duckdb my_graph.duckdb "DESCRIBE nodes"

Output:

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ VARCHAR     │ YES     │         │         │         │
│ category    │ VARCHAR     │ YES     │         │         │         │
│ name        │ VARCHAR     │ YES     │         │         │         │
│ description │ VARCHAR     │ YES     │         │         │         │
│ provided_by │ VARCHAR     │ YES     │         │         │         │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

List all categories

See what types of entities are in your graph:

duckdb my_graph.duckdb "SELECT DISTINCT category FROM nodes"

Output:

┌─────────────────┐
│    category     │
│     varchar     │
├─────────────────┤
│ biolink:Gene    │
│ biolink:Disease │
└─────────────────┘

Count nodes by category

duckdb my_graph.duckdb "SELECT category, COUNT(*) AS count FROM nodes GROUP BY category"

Output:

┌─────────────────┬───────┐
│    category     │ count │
│     varchar     │ int64 │
├─────────────────┼───────┤
│ biolink:Gene    │     3 │
│ biolink:Disease │     2 │
└─────────────────┴───────┘

Find specific nodes

Search for nodes by name:

duckdb my_graph.duckdb "SELECT id, name FROM nodes WHERE name LIKE '%BRCA%'"

Output:

┌───────────┬───────┐
│    id     │ name  │
│  varchar  │ varchar│
├───────────┼───────┤
│ HGNC:1100 │ BRCA1 │
│ HGNC:1101 │ BRCA2 │
└───────────┴───────┘

Explore edge relationships

List all predicate types:

duckdb my_graph.duckdb "SELECT predicate, COUNT(*) AS count FROM edges GROUP BY predicate"

Output:

┌─────────────────────────────────────────┬───────┐
│                predicate                │ count │
│                 varchar                 │ int64 │
├─────────────────────────────────────────┼───────┤
│ biolink:gene_associated_with_condition  │     3 │
│ biolink:interacts_with                  │     2 │
└─────────────────────────────────────────┴───────┘

Find edges for a specific node

What diseases are associated with BRCA1?

duckdb my_graph.duckdb "
SELECT e.predicate, n.name AS disease_name
FROM edges e
JOIN nodes n ON e.object = n.id
WHERE e.subject = 'HGNC:1100'
  AND e.predicate = 'biolink:gene_associated_with_condition'
"

Output:

┌────────────────────────────────────────┬───────────────┐
│               predicate                │ disease_name  │
│                varchar                 │    varchar    │
├────────────────────────────────────────┼───────────────┤
│ biolink:gene_associated_with_condition │ breast cancer │
└────────────────────────────────────────┴───────────────┘

Step 4: Generate Statistics

Koza provides built-in commands for generating reports about your graph. These are especially useful for quality control when working with larger datasets.

Generate graph statistics

koza report graph-stats --database my_graph.duckdb --output graph_stats.yaml

This creates a YAML file with comprehensive statistics:

cat graph_stats.yaml

The report includes:

  • Total node and edge counts
  • Counts by category and predicate
  • Namespace distributions
  • Data source breakdowns

Generate a QC report

The QC (Quality Control) report provides more detailed analysis:

koza report qc --database my_graph.duckdb --output qc_report.yaml

This report helps identify potential data quality issues like:

  • Missing required fields
  • Orphan nodes (nodes not connected to any edges)
  • Invalid identifiers

View report summary

You can also run reports without saving to a file to see output in the terminal:

koza report graph-stats --database my_graph.duckdb

Step 5: Export to Files

After working with your graph in the database, you may want to export it back to files. The split command can export your entire graph or create subsets based on field values.

Export all nodes and edges

First, let us export the complete graph. The simplest approach is to split on a field where all records have the same value, or to use a field that groups naturally.

Export nodes by category:

koza split sample_nodes.tsv category --output-dir ./export

Output filenames are generated as {input}_{field_value}_{type}.tsv. Since our input is sample_nodes.tsv and we're splitting by category, this produces:

./export/
  sample_biolink_Gene_nodes.tsv      # nodes where category = biolink:Gene
  sample_biolink_Disease_nodes.tsv   # nodes where category = biolink:Disease

(The : in biolink:Gene becomes _ in the filename.)

Note: When data has been loaded through koza transform, the provided_by field is typically overwritten with the ingest name. If you want to split by data source after transformation, use a different field or ensure your data preserves the original source information in another column.

Convert to different formats

You can convert between formats during export. Convert to Parquet (a columnar format ideal for analytics):

koza split sample_nodes.tsv category \
  --output-dir ./parquet_export \
  --format parquet

This produces files like sample_biolink_Gene_nodes.parquet.

Or convert to JSONL (JSON Lines, useful for streaming):

koza split sample_edges.tsv predicate \
  --output-dir ./jsonl_export \
  --format jsonl

This produces files like sample_biolink_related_to_edges.jsonl (one per predicate value).

Check exported files

Verify the exports look correct:

# Check Parquet files
ls ./parquet_export/

# Read a Parquet file with DuckDB
duckdb -c "SELECT * FROM read_parquet('./parquet_export/sample_Gene_nodes.parquet')"

# Check JSONL files
ls ./jsonl_export/
head ./jsonl_export/*.jsonl

Summary

This tutorial covered the core graph operations workflow. Here is what you accomplished:

  1. Created KGX files - You made sample node and edge files in the standard KGX TSV format

  2. Joined files into a database - The koza join command combined your files into a DuckDB database

  3. Explored with SQL - You queried your graph to:

  4. Count nodes and edges
  5. List categories and predicates
  6. Find specific entities
  7. Traverse relationships

  8. Generated reports - You used koza report to create statistics and quality control reports

  9. Exported data - You used koza split to export subsets in different formats (TSV, Parquet, JSONL)

Key Commands Summary

Command Purpose
koza join Combine KGX files into a DuckDB database
koza report graph-stats Generate graph statistics
koza report qc Generate quality control report
koza split Export/split graph by field values

Next Steps

Now that you understand the basics, explore more advanced capabilities:

Continue Learning

  • Complete Merge Workflow - Learn to combine data from multiple sources, normalize identifiers with SSSOM mappings, and clean your graph in one pipeline

How-to Guides

  • Join Files - Advanced joining with glob patterns, mixed formats, and schema reporting
  • Split Graphs - More splitting options including prefix removal and multivalued fields
  • Generate Reports - All available report types and customization options
  • Normalize IDs - Use SSSOM mappings to harmonize identifiers
  • Clean Graphs - Remove duplicates and dangling edges

Reference

  • CLI Reference - Complete documentation for all commands and options

Cleanup

When you are done experimenting, you can remove the tutorial files:

cd ..
rm -rf kgx-tutorial

Or keep them around to continue exploring.


These patterns scale to handle graphs with millions of nodes and edges.