Data Integrity

Overview

Koza graph operations use non-destructive data operations. When problems are detected in graph data (duplicate entries, dangling edges, or isolated nodes), the problematic records are moved to archive tables rather than deleted. These archived records remain accessible for analysis, debugging, and recovery.

"Problem" data often contains useful information:

Duplicates may reveal integration issues between data sources
Dangling edges may indicate missing node files or ID mismatches
Singletons may represent valid entities that lack relationships in the current dataset

Preserving this data enables quality control analysis without irreversible data loss.

Move, Don't Delete

The core principle: move problem data to archive tables rather than deleting it.

Debugging and QC: When edges reference non-existent nodes, the archived edges show which nodes are missing and from which source files. This helps identify upstream data issues.

Recovery: If data is archived incorrectly (due to misconfiguration or upstream bugs), it can be recovered from archive tables without re-running the entire pipeline.

Audit trail: Archive tables document what was removed and when. This supports reproducibility and explains changes between pipeline runs.

No data loss: Cleaning operations preserve all original data. You can inspect what was removed and verify that the cleaning logic behaved correctly.

Archive Tables

Koza creates several archive tables during graph operations:

dangling_edges

Contains edges where the subject or object ID does not exist in the nodes table.

-- Example: View dangling edges
SELECT * FROM dangling_edges LIMIT 10;

-- Find which nodes are missing
SELECT DISTINCT subject as missing_node
FROM dangling_edges e
LEFT JOIN nodes n ON e.subject = n.id
WHERE n.id IS NULL;

Dangling edges typically indicate:

Node files that failed to load
ID mismatches between node and edge files
Normalization that mapped to non-existent canonical IDs

duplicate_nodes

Contains all rows that had duplicate IDs in the nodes table. When multiple rows share the same id, all are copied here. Only the first occurrence (ordered by file_source) is kept in the main nodes table.

-- Example: View duplicate nodes
SELECT id, COUNT(*) as occurrence_count
FROM duplicate_nodes
GROUP BY id
ORDER BY occurrence_count DESC;

-- Compare duplicates for a specific ID
SELECT * FROM duplicate_nodes WHERE id = 'HGNC:1234';

Duplicate nodes may indicate:

The same entity appearing in multiple source files
Version conflicts between data sources
Intentional overlaps that need resolution

duplicate_edges

Contains all rows that had duplicate id values in the edges table. All duplicates are archived. Only the first occurrence is retained.

-- Example: View duplicate edges by source
SELECT file_source, COUNT(*) as duplicates
FROM duplicate_edges
GROUP BY file_source
ORDER BY duplicates DESC;

singleton_nodes

Contains nodes that do not appear as subject or object in any edge. This table is only populated when you explicitly request singleton removal with --remove-singletons.

-- Example: Analyze singleton nodes by category
SELECT category, COUNT(*) as count
FROM singleton_nodes
GROUP BY category
ORDER BY count DESC;

Singleton nodes may represent:

Valid entities that simply lack relationships in your dataset
Nodes from incomplete data sources
Orphaned entries from failed edge loading

Provenance Tracking

Koza tracks the source of each record through provenance columns. This enables source-aware deduplication and QC analysis.

file_source Column

When loading files, Koza adds a file_source column containing the source identifier:

# Source name is derived from filename by default
koza join --nodes gene_nodes.tsv --edges gene_edges.tsv -d graph.duckdb

# Or specify explicitly
koza join --nodes gene_nodes.tsv:gene_source --edges gene_edges.tsv:gene_source -d graph.duckdb

The file_source column provides:

Record-to-file traceability
Deterministic ordering during deduplication
Source-specific QC reports

provided_by Column

The provided_by column is a standard KGX provenance field. It may already exist in your source data. If present, Koza preserves it. If absent, Koza can generate it from the source name.

Ordering During Deduplication

When duplicates are found, Koza keeps the "first" occurrence based on ordering by provenance columns:

file_source (preferred) - Added by Koza during loading
provided_by (fallback) - Standard KGX provenance column
Constant (last resort) - Arbitrary but deterministic ordering

This means you can control which version of a duplicate is kept by ordering your input files appropriately or by setting source names that sort in your preferred order.

Original Value Preservation

When normalizing identifiers using SSSOM mappings, Koza preserves the original values. This allows tracing back to the source data.

original_subject and original_object Columns

After normalization, edges gain two new columns:

original_subject: The subject ID before normalization (NULL if unchanged)
original_object: The object ID before normalization (NULL if unchanged)

-- Example: Find edges that were normalized
SELECT subject, original_subject, object, original_object
FROM edges
WHERE original_subject IS NOT NULL
   OR original_object IS NOT NULL;

-- Compare before and after
SELECT
    original_subject as before,
    subject as after,
    COUNT(*) as edge_count
FROM edges
WHERE original_subject IS NOT NULL
GROUP BY original_subject, subject
ORDER BY edge_count DESC;

Capabilities of Preserving Originals

Debugging: Check what IDs an edge had before normalization
Validation: Compare normalized IDs against expected mappings
Reversibility: Reconstruct the original graph if needed
Provenance: Full audit trail of transformations applied to each record

Recovery via SQL

Archive tables are standard DuckDB tables. You can query them directly and recover data if needed.

Query Archive Tables

-- Connect to the database
-- duckdb graph.duckdb

-- View all archive tables
SHOW TABLES;

-- Count records in each archive
SELECT 'dangling_edges' as table_name, COUNT(*) as count FROM dangling_edges
UNION ALL
SELECT 'duplicate_nodes', COUNT(*) FROM duplicate_nodes
UNION ALL
SELECT 'duplicate_edges', COUNT(*) FROM duplicate_edges
UNION ALL
SELECT 'singleton_nodes', COUNT(*) FROM singleton_nodes;

Re-insert Recovered Data

If you determine that archived data should be restored:

-- Restore specific dangling edges (perhaps after adding missing nodes)
INSERT INTO edges
SELECT * FROM dangling_edges
WHERE file_source = 'my_source';

-- Restore singleton nodes
INSERT INTO nodes
SELECT * FROM singleton_nodes
WHERE category = 'biolink:Gene';

Analyze Patterns in Problem Data

-- Find common patterns in dangling edges
SELECT
    file_source,
    COUNT(*) as dangling_count,
    COUNT(DISTINCT subject) as unique_subjects,
    COUNT(DISTINCT object) as unique_objects
FROM dangling_edges
GROUP BY file_source
ORDER BY dangling_count DESC;

-- Identify which source files contribute most duplicates
SELECT
    file_source,
    COUNT(DISTINCT id) as duplicate_ids,
    COUNT(*) as total_duplicate_rows
FROM duplicate_nodes
GROUP BY file_source
ORDER BY duplicate_ids DESC;

Why This Matters

Non-destructive data operations support production knowledge graph workflows:

QC Analysis

Archive tables support quality control analysis:

-- Generate a QC summary
SELECT
    (SELECT COUNT(*) FROM nodes) as active_nodes,
    (SELECT COUNT(*) FROM edges) as active_edges,
    (SELECT COUNT(*) FROM dangling_edges) as dangling_edges,
    (SELECT COUNT(*) FROM duplicate_nodes) as duplicate_node_rows,
    (SELECT COUNT(*) FROM singleton_nodes) as singleton_nodes;

Debugging Data Issues

When something looks wrong in your graph, archive tables help answer:

Why are certain edges missing? (Check dangling_edges)
Why is this node's data different than expected? (Check duplicate_nodes)
Why are some entities disconnected? (Check singleton_nodes)

Compliance and Auditing

For regulated environments or reproducible science:

Full provenance: Every record can be traced to its source file
Complete audit trail: Archive tables document all removals
Reproducibility: Re-running with the same inputs produces the same outputs
Transparency: QC reports based on archive tables explain exactly what was cleaned

Safe Iteration

Non-destructive operations allow iteration on pipelines:

Try aggressive cleaning, inspect results, adjust parameters
Compare archive table contents between runs
Recover from mistakes without re-running upstream processing

This approach ensures that Koza graph operations preserve data while allowing cleaning and transformation.