How to Normalize IDs
Goal
Apply SSSOM mappings to harmonize identifiers in edge references. This allows you to consolidate equivalent identifiers from different sources (e.g., mapping OMIM disease IDs to MONDO) so that edges reference consistent node identifiers.
Prerequisites
- A DuckDB database with edges (created via
koza join,koza merge, orkoza append) - One or more SSSOM mapping files that define the identifier mappings you want to apply
Understanding SSSOM
SSSOM (Simple Standard for Sharing Ontological Mappings) is a standard format for representing mappings between ontology terms and other identifiers.
SSSOM File Format
An SSSOM TSV file typically has an optional YAML header (lines starting with #) followed by tab-separated data:
#curie_map:
# MONDO: http://purl.obolibrary.org/obo/MONDO_
# OMIM: https://omim.org/entry/
# HP: http://purl.obolibrary.org/obo/HP_
#mapping_set_id: https://example.org/mappings/mondo-omim
subject_id predicate_id object_id mapping_justification
MONDO:0005148 skos:exactMatch OMIM:222100 semapv:ManualMappingCuration
MONDO:0007455 skos:exactMatch OMIM:114500 semapv:ManualMappingCuration
MONDO:0008199 skos:exactMatch OMIM:176000 semapv:ManualMappingCuration
How Koza Uses SSSOM for Normalization
Koza uses a simplifying assumption when applying SSSOM mappings: it maps FROM the object_id TO the subject_id. This is a practical simplification for identifier normalization and does not reflect the full semantic meaning of SSSOM mapping files (where subject/object semantics depend on the predicate).
The relevant columns for normalization are:
- subject_id: The target identifier (what Koza normalizes TO)
- object_id: The source identifier (what Koza normalizes FROM)
- predicate_id: The mapping relationship (e.g.,
skos:exactMatch) - used for filtering but not for determining direction - mapping_justification: How the mapping was created (optional but recommended)
Important: Normalization changes edge references (the subject and object columns in the edges table), not the node IDs themselves. If an edge references an identifier that appears in the SSSOM object_id column, that reference is updated to the corresponding subject_id.
Basic Normalization
Apply a single SSSOM file to normalize identifiers in your database.
Step 1: Verify Your Database
First, check the current state of your edges:
# Count edges and check identifier patterns
duckdb graph.duckdb -c "
SELECT COUNT(*) as edge_count FROM edges;
"
# See sample identifiers
duckdb graph.duckdb -c "
SELECT DISTINCT subject FROM edges LIMIT 10;
"
Step 2: Apply the Mapping
koza normalize graph.duckdb \
--mappings mondo-omim.sssom.tsv
Example Output
Loading SSSOM mappings...
Loaded 45,678 unique mappings from 1 file(s)
Normalizing edge references...
Normalized 12,345 edge subject/object references
Normalization completed successfully
Multiple Mapping Files
When you have mappings from multiple sources, you can apply them all at once.
Using Multiple Files
koza normalize graph.duckdb \
-m mondo-omim.sssom.tsv \
-m mondo-orphanet.sssom.tsv \
-m hp-mp.sssom.tsv
Using a Mappings Directory
If all your SSSOM files are in one directory:
koza normalize graph.duckdb \
--mappings-dir ./mappings/
This loads all .sssom.tsv files from the specified directory.
Order of Application
When using multiple mapping files:
- Files are processed in the order specified (or alphabetically for
--mappings-dir) - All mappings are loaded into a single mappings table before normalization
- If the same
object_idappears in multiple files, the first occurrence is kept
Original Value Preservation
Normalization preserves the original identifier values so you can trace back to the source data.
How It Works
When an edge's subject or object is normalized:
- The new (normalized) identifier is written to
subjectorobject - The original identifier is stored in
original_subjectororiginal_object
Example
Before normalization:
| subject | object | predicate |
|---|---|---|
| HGNC:1234 | OMIM:222100 | biolink:gene_associated_with_condition |
After normalization (with OMIM to MONDO mapping):
| subject | object | predicate | original_subject | original_object |
|---|---|---|---|---|
| HGNC:1234 | MONDO:0005148 | biolink:gene_associated_with_condition | NULL | OMIM:222100 |
Note: original_subject is NULL because HGNC:1234 was not in the mappings and was not changed.
Querying Original Values
You can find all normalized edges:
SELECT subject, object, original_subject, original_object
FROM edges
WHERE original_subject IS NOT NULL
OR original_object IS NOT NULL;
Duplicate Mapping Handling
SSSOM files sometimes contain one-to-many mappings where a single object_id maps to multiple subject_id values. Koza handles this to prevent edge duplication.
The Problem
If your SSSOM file contains:
subject_id predicate_id object_id mapping_justification
MONDO:0005148 skos:exactMatch OMIM:222100 semapv:ManualMappingCuration
MONDO:0005149 skos:closeMatch OMIM:222100 semapv:LexicalMatching
Without deduplication, an edge referencing OMIM:222100 could become two edges.
How Koza Handles It
Koza keeps only one mapping per object_id:
- Mappings are ordered by source file, then by
subject_id - The first mapping for each
object_idis kept - Subsequent duplicates are discarded with a warning
Warning Message
When duplicates are detected:
Loading SSSOM mappings...
Loaded 45,678 unique mappings from 2 file(s)
Found 234 duplicate mappings (one object_id mapped to multiple subject_ids).
Keeping only one mapping per object_id.
Best Practice
To control which mapping is used, ensure your preferred mappings come first:
- Order the
--mappingsarguments with preferred files first - Or curate your SSSOM files to have one mapping per identifier
Verification
After normalization, verify the results.
Check Normalization Statistics
# Count how many edges were normalized
duckdb graph.duckdb -c "
SELECT
COUNT(*) as total_edges,
COUNT(original_subject) as normalized_subjects,
COUNT(original_object) as normalized_objects
FROM edges;
"
Compare Before and After
Save identifiers before normalization for comparison:
# Before normalization - export unique identifiers
duckdb graph.duckdb -c "
SELECT DISTINCT object FROM edges WHERE object LIKE 'OMIM:%'
" > before_omim_ids.txt
# Run normalization
koza normalize graph.duckdb -m mondo-omim.sssom.tsv
# After normalization - check OMIM IDs are gone
duckdb graph.duckdb -c "
SELECT DISTINCT object FROM edges WHERE object LIKE 'OMIM:%'
" > after_omim_ids.txt
# Compare
diff before_omim_ids.txt after_omim_ids.txt
Verify Mapping Coverage
Check which identifiers were not mapped:
-- Find object identifiers that match a pattern but were not normalized
SELECT DISTINCT object
FROM edges
WHERE object LIKE 'OMIM:%'
AND original_object IS NULL
LIMIT 20;
This shows OMIM IDs that remain in the object column (not normalized), likely because they were not in your mapping file.
Generate a Report
Use koza report to get overall statistics after normalization:
koza report qc -d graph.duckdb -o post_normalize_qc.yaml
Variations
Using Normalize in the Merge Pipeline
For new graphs, koza merge includes normalization as part of its pipeline:
koza merge \
--nodes "*.nodes.*" \
--edges "*.edges.*" \
--mappings "mappings/*.sssom.tsv" \
--output merged_graph.duckdb
This runs: join -> deduplicate -> normalize -> prune
See the merge command reference for details.
Skip Normalization in Merge
If you want to run merge without normalization:
koza merge \
--nodes "*.nodes.*" \
--edges "*.edges.*" \
--output merged_graph.duckdb \
--skip-normalize
Normalize After Other Operations
You can run normalize at any point after your database has edges:
# First, join your files
koza join --nodes "*.nodes.*" --edges "*.edges.*" --output graph.duckdb
# Later, normalize with new mappings
koza normalize graph.duckdb -m new_mappings.sssom.tsv
See Also
- CLI Reference: normalize - Full command options
- Merge Pipeline - Complete merge workflow including normalization
- SSSOM Specification - Official SSSOM documentation