Complete Merge Workflow
This tutorial covers the full graph merge pipeline. It demonstrates combining data from multiple sources with different identifier schemes.
Note: If running from a source checkout, use
uv run kozainstead ofkoza. If installed via pip, usekozadirectly.
Overview
- Prepare input files from multiple sources
- Create SSSOM mapping files for identifier harmonization
- Run the complete merge pipeline
- Customize pipeline steps for your use case
- Troubleshoot common issues
Prerequisites
- Completed the "Build Your First Graph" tutorial
- Understanding of KGX format (nodes and edges as TSV/JSONL/Parquet)
- Basic familiarity with identifier namespaces (e.g., HGNC, OMIM, MONDO)
Scenario
You are building a knowledge graph that integrates data from three sources:
- HGNC genes: Gene nodes with HGNC identifiers
- OMIM diseases: Disease nodes with OMIM identifiers
- Gene-disease associations: Edges connecting genes to diseases, but using MONDO identifiers for diseases
The challenge: Your edges reference MONDO disease IDs, but your disease nodes use OMIM IDs. Without normalization, these edges would be "dangling" (referencing nodes that do not exist in your graph).
The solution: Use SSSOM mappings to normalize OMIM identifiers to MONDO, allowing edges to connect properly to your disease nodes.
Sample Data
Create a working directory and add these sample files.
Source 1: Gene Nodes (genes_nodes.tsv)
id name category provided_by
HGNC:1100 BRCA1 biolink:Gene hgnc
HGNC:1101 BRCA2 biolink:Gene hgnc
HGNC:7989 TP53 biolink:Gene hgnc
HGNC:3689 CFTR biolink:Gene hgnc
Source 2: Disease Nodes (diseases_nodes.tsv)
id name category provided_by
OMIM:114480 Breast Cancer biolink:Disease omim
OMIM:219700 Cystic Fibrosis biolink:Disease omim
OMIM:151623 Li-Fraumeni Syndrome biolink:Disease omim
Source 3: Gene-Disease Edges (associations_edges.tsv)
Note: These edges use MONDO identifiers for diseases (not OMIM).
id subject predicate object primary_knowledge_source provided_by
uuid:1 HGNC:1100 biolink:gene_associated_with_condition MONDO:0007254 infores:monarch associations
uuid:2 HGNC:1101 biolink:gene_associated_with_condition MONDO:0007254 infores:monarch associations
uuid:3 HGNC:7989 biolink:gene_associated_with_condition MONDO:0007903 infores:monarch associations
uuid:4 HGNC:3689 biolink:gene_associated_with_condition MONDO:0009061 infores:monarch associations
Mapping File (mondo_omim.sssom.tsv)
This SSSOM file maps OMIM identifiers (object_id) to their equivalent MONDO identifiers (subject_id).
#curie_map:
# MONDO: http://purl.obolibrary.org/obo/MONDO_
# OMIM: https://omim.org/entry/
#mapping_set_id: https://example.org/mappings/mondo-omim
subject_id predicate_id object_id mapping_justification
MONDO:0007254 skos:exactMatch OMIM:114480 semapv:ManualMappingCuration
MONDO:0009061 skos:exactMatch OMIM:219700 semapv:ManualMappingCuration
MONDO:0007903 skos:exactMatch OMIM:151623 semapv:ManualMappingCuration
Step 1: Prepare Input Files
First, let's understand what each file contributes and verify they exist.
Examine the Input Files
# Check file contents
head -5 genes_nodes.tsv
head -5 diseases_nodes.tsv
head -5 associations_edges.tsv
head -10 mondo_omim.sssom.tsv
Understanding the ID Mismatch Problem
Look at the edge file - it references MONDO identifiers:
# See what disease IDs the edges reference
cut -f4 associations_edges.tsv | tail -n +2 | sort -u
Output:
MONDO:0007254
MONDO:0007903
MONDO:0009061
But the disease nodes use OMIM identifiers:
# See what IDs the disease nodes have
cut -f1 diseases_nodes.tsv | tail -n +2
Output:
OMIM:114480
OMIM:219700
OMIM:151623
Without normalization, all four edges would be dangling because MONDO:0007254 does not match OMIM:114480, even though they represent the same disease.
Step 2: Create SSSOM Mappings
The SSSOM (Simple Standard for Sharing Ontological Mappings) file defines how identifiers map to each other.
SSSOM File Structure
An SSSOM file has two parts:
- Header (optional): YAML metadata starting with
# - Data: Tab-separated mappings
#curie_map:
# MONDO: http://purl.obolibrary.org/obo/MONDO_
# OMIM: https://omim.org/entry/
#mapping_set_id: https://example.org/mappings/mondo-omim
subject_id predicate_id object_id mapping_justification
MONDO:0007254 skos:exactMatch OMIM:114480 semapv:ManualMappingCuration
Key Columns for Normalization
| Column | Purpose | Direction |
|---|---|---|
subject_id |
Target identifier (normalize TO this) | Output |
object_id |
Source identifier (normalize FROM this) | Input |
predicate_id |
Relationship type | Metadata |
mapping_justification |
How mapping was determined | Metadata |
During normalization, Koza replaces object_id values in your edges with the corresponding subject_id values.
Finding SSSOM Mappings
For real projects, you can obtain SSSOM mappings from:
- Mondo disease ontology mappings
- OBO Foundry ontologies
- SSSOM-Py tools for creating your own
- Biomedical identifier registries
Step 3: Run the Merge Pipeline
Now run the complete merge pipeline with a single command.
Basic Merge Command
koza merge \
-n genes_nodes.tsv \
-n diseases_nodes.tsv \
-e associations_edges.tsv \
-m mondo_omim.sssom.tsv \
-o merged_graph.duckdb
Understanding the Command Options
| Option | Purpose |
|---|---|
-n, --nodes |
Node files to load (use -n multiple times for multiple files) |
-e, --edges |
Edge files to load (use -e multiple times for multiple files) |
-m, --mappings |
SSSOM mapping files for normalization |
-o, --output |
Output DuckDB database file |
Expected Output
Starting merge pipeline...
Pipeline: join -> normalize -> prune
Output database: merged_graph.duckdb
Step 1: Join - Loading input files...
Join completed: 3 files | 7 nodes | 4 edges
Step 2: Normalize - Applying SSSOM mappings...
Normalize completed: 1 mapping files | 3 edge references normalized
Step 3: Prune - Cleaning graph structure...
Prune completed: 0 dangling edges moved | 0 singleton nodes handled
Merge pipeline completed successfully!
Step 4: Understand the Output
Let's examine what happened at each pipeline step.
Pipeline Steps Explained
- Join: Loaded all node and edge files into a unified database
- Normalize: Applied SSSOM mappings to convert OMIM IDs to MONDO IDs in edge references
- Prune: Removed dangling edges and handled singleton nodes
Verify Normalization Worked
Check that edge references were normalized. Normalization updates the subject and object columns in the edges table, preserving the original values in original_subject and original_object:
duckdb merged_graph.duckdb -c "
SELECT subject, object, original_subject, original_object
FROM edges
WHERE original_object IS NOT NULL
"
Expected output:
subject | object | original_subject | original_object
----------+---------------+------------------+----------------
HGNC:1100 | MONDO:0007254 | NULL | OMIM:114480
HGNC:1101 | MONDO:0007254 | NULL | OMIM:114480
HGNC:7989 | MONDO:0007903 | NULL | OMIM:151623
HGNC:3689 | MONDO:0009061 | NULL | OMIM:219700
The original_subject and original_object columns preserve the original identifiers for provenance. Note that in this example, the edge objects were normalized from OMIM to MONDO IDs, so original_object contains the original OMIM ID while object now contains the MONDO equivalent.
Verify Edges Connect Properly
duckdb merged_graph.duckdb -c "
SELECT
e.subject,
n1.name as subject_name,
e.object,
n2.name as object_name
FROM edges e
JOIN nodes n1 ON e.subject = n1.id
JOIN nodes n2 ON e.object = n2.id
"
All edges should now join successfully to nodes.
Check for Dangling Edges
duckdb merged_graph.duckdb -c "
SELECT COUNT(*) as dangling_count
FROM edges e
WHERE NOT EXISTS (SELECT 1 FROM nodes n WHERE n.id = e.subject)
OR NOT EXISTS (SELECT 1 FROM nodes n WHERE n.id = e.object)
"
Expected output: 0 (no dangling edges)
Step 5: Customize the Pipeline
The merge pipeline is flexible and can be customized for different needs.
Skip Normalization
If you do not need identifier normalization (e.g., IDs already match):
koza merge \
-n "*.nodes.tsv" \
-e "*.edges.tsv" \
-o graph.duckdb \
--skip-normalize
Skip Pruning
Keep all edges and nodes, even if some are dangling or disconnected:
koza merge \
-n "*.nodes.tsv" \
-e "*.edges.tsv" \
-m "*.sssom.tsv" \
-o graph.duckdb \
--skip-prune
Handle Singleton Nodes
By default, singleton nodes (nodes with no edges) are kept. To move them to a separate table:
koza merge \
-n "*.nodes.tsv" \
-e "*.edges.tsv" \
-m "*.sssom.tsv" \
-o graph.duckdb \
--remove-singletons
Export Final Data
Export the merged graph to files for use in other tools:
koza merge \
-n "*.nodes.tsv" \
-e "*.edges.tsv" \
-m "*.sssom.tsv" \
-o graph.duckdb \
--export \
--export-dir ./output/ \
--graph-name my_knowledge_graph
This creates my_knowledge_graph_nodes.tsv and my_knowledge_graph_edges.tsv.
Create a Compressed Archive
For distribution, create a compressed tar archive:
koza merge \
-n "*.nodes.tsv" \
-e "*.edges.tsv" \
-m "*.sssom.tsv" \
-o graph.duckdb \
--export \
--export-dir ./output/ \
--archive \
--compress \
--graph-name monarch_kg
This creates monarch_kg.tar.gz containing both node and edge files.
Auto-Discovery Mode
For directories with standard naming conventions, use auto-discovery:
koza merge \
--input-dir ./data/ \
--mappings-dir ./sssom/ \
-o merged.duckdb
This automatically finds files matching patterns like *_nodes.tsv and *.sssom.tsv.
Troubleshooting
Error: No Mapping Files Found
Error: Must specify --mappings-dir, --mappings, or --skip-normalize
Solution: Either provide SSSOM mappings or skip normalization:
# Option 1: Provide mappings
koza merge -n "*.nodes.tsv" -e "*.edges.tsv" \
-m mondo.sssom.tsv -o graph.duckdb
# Option 2: Skip normalization
koza merge -n "*.nodes.tsv" -e "*.edges.tsv" \
--skip-normalize -o graph.duckdb
Many Dangling Edges After Merge
If you see many dangling edges in the output, your mappings may be incomplete.
Diagnose:
# Check which edge objects are not in nodes
duckdb merged_graph.duckdb -c "
SELECT DISTINCT e.object
FROM edges e
WHERE NOT EXISTS (SELECT 1 FROM nodes n WHERE n.id = e.object)
LIMIT 20
"
Solutions:
- Add more mappings to cover missing identifiers
- Add the missing nodes from another source
- Use
--skip-pruneto keep dangling edges for later resolution
Duplicate Mappings Warning
Found 234 duplicate mappings (one object_id mapped to multiple subject_ids).
Keeping only one mapping per object_id.
This occurs when your SSSOM file contains one-to-many mappings (one source ID maps to multiple targets).
Solution: Order your mapping files with preferred mappings first, or curate your SSSOM files to have deterministic mappings:
# Preferred mappings file first
koza merge \
-n "*.nodes.tsv" \
-e "*.edges.tsv" \
-m preferred_mappings.sssom.tsv \
-m secondary_mappings.sssom.tsv \
-o graph.duckdb
Join Failed - No Files Loaded
Error: Join operation failed - no files were loaded
Causes:
- File paths are incorrect
- Glob patterns do not match any files
- Files are in an unsupported format
Solution: Verify files exist and use correct patterns:
# Check files exist
ls -la *.nodes.tsv *.edges.tsv
# Use explicit paths instead of patterns
koza merge \
-n genes_nodes.tsv \
-n diseases_nodes.tsv \
-e associations_edges.tsv \
-m mondo_omim.sssom.tsv \
-o graph.duckdb
Out of Memory for Large Graphs
For very large graphs, DuckDB handles most memory management automatically. If you encounter issues:
- Use a persistent database (always use
--output) - Process files in batches using
koza append - Ensure sufficient disk space for temporary files
Summary
This tutorial covered:
- Preparing KGX files from multiple sources with different ID schemes
- Creating SSSOM mapping files to harmonize identifiers
- Running the complete merge pipeline (join, deduplicate, normalize, prune)
- Customizing the pipeline by skipping steps or changing options
- Verifying merge results using DuckDB queries
- Troubleshooting common issues with mappings and dangling edges
Next Steps
Related documentation:
- How to Normalize IDs - Deep dive into SSSOM mappings
- How to Join Files - Advanced file joining options
- How to Clean Graphs - Graph pruning strategies
- How to Generate Reports - QC and statistics
- CLI Reference - Complete command documentation