How to Join Files

Goal

Combine multiple KGX files into a unified DuckDB database. The join operation automatically handles schema harmonization across files with different column structures and supports multiple input formats.

Prerequisites

Koza installed (uvx koza, uv add koza, poetry add koza, or pip install koza)
KGX files (nodes and/or edges in TSV, JSONL, or Parquet format)

Basic Join

The simplest case combines a node file and an edge file into a single database.

koza join \
  --nodes genes.tsv \
  --edges gene_interactions.tsv \
  --output gene_graph.duckdb

This creates gene_graph.duckdb containing two tables: nodes and edges.

Joining Multiple Files

You can specify multiple files for each table type by repeating the options:

koza join \
  -n genes.tsv -n proteins.tsv -n pathways.tsv \
  -e gene_protein.tsv -e protein_pathway.tsv \
  --output combined_graph.duckdb

Mixed Formats

The join operation handles files in different formats. DuckDB automatically detects formats based on file extensions.

koza join \
  -n genes.tsv -n proteins.jsonl -n pathways.parquet \
  -e interactions.tsv -e associations.jsonl.gz \
  --output mixed_format_graph.duckdb

Supported formats:

TSV: Tab-separated values (.tsv, .tsv.gz)
JSONL: JSON Lines (.jsonl, .jsonl.gz, .jsonl.bz2)
Parquet: Apache Parquet (.parquet)

Compressed files (.gz, .bz2) are automatically decompressed during processing.

Using Glob Patterns

For directories with many files following naming conventions, use wildcard patterns:

# Join all node and edge files
koza join \
  --nodes *.nodes.tsv \
  --edges *.edges.tsv \
  --output graph.duckdb

More flexible patterns:

# Match multiple extensions
koza join \
  --nodes *.nodes.* \
  --edges *.edges.* \
  --output graph.duckdb

# Match files in subdirectories
koza join \
  --nodes data/**/nodes.tsv \
  --edges data/**/edges.tsv \
  --output graph.duckdb

Schema Reporting

Use the --schema-report flag to generate a detailed YAML report analyzing the schema across all input files:

koza join \
  -n genes.tsv -n proteins.jsonl \
  -e interactions.tsv \
  --output graph.duckdb \
  --schema-report

This creates graph_schema_report.yaml containing:

File analysis: Format detection, column counts, record counts per file
Schema summary: Unique columns across all files
Column details: Data types, null percentages, example values
Schema harmonization: How differences between files were resolved

Schema harmonization handles:

Missing columns: Filled with NULL values
Extra columns: Preserved in the final schema
Type conflicts: Resolved using DuckDB's type inference
Multi-valued fields: Detected and converted to arrays where appropriate

Source Attribution

The join operation automatically tracks which file each record came from using the file_source column. This enables provenance tracking and later splitting by source. No additional flags are needed - source attribution happens automatically during the join operation.

Persistent vs In-Memory

Persistent Database (Recommended)

Use --output to create a persistent DuckDB file:

koza join \
  --nodes genes.tsv \
  --edges interactions.tsv \
  --output graph.duckdb

Use persistent databases when:

Processing large datasets
Running multiple subsequent operations (prune, split, normalize)
Sharing results with others
Preserving work for later analysis

In-Memory Database

Omit --output for an in-memory database (useful for quick analysis or piping to other operations):

koza join \
  --nodes genes.tsv \
  --edges interactions.tsv

In-Memory Limitations

In-memory databases are lost when the process exits. Always use --output for data you want to keep.

Progress Tracking

Progress tracking is enabled by default. Use --quiet to suppress all output:

koza join \
  --nodes "*.nodes.*" \
  --edges "*.edges.*" \
  --output graph.duckdb \
  --quiet

Use -p or --progress to explicitly control progress bars (enabled by default).

Verification

After joining, verify the operation succeeded using SQL queries or the report commands.

Quick Verification with DuckDB CLI

duckdb graph.duckdb "SELECT COUNT(*) AS node_count FROM nodes"
duckdb graph.duckdb "SELECT COUNT(*) AS edge_count FROM edges"

Check Schema

duckdb graph.duckdb "DESCRIBE nodes"
duckdb graph.duckdb "DESCRIBE edges"

Generate QC Report

koza report qc \
  -d graph.duckdb \
  -o qc_report.yaml

View Sample Records

duckdb graph.duckdb "SELECT * FROM nodes LIMIT 5"
duckdb graph.duckdb "SELECT * FROM edges LIMIT 5"

Verify Source Attribution

Check file source tracking:

duckdb graph.duckdb "SELECT file_source, COUNT(*) FROM nodes GROUP BY file_source"
duckdb graph.duckdb "SELECT file_source, COUNT(*) FROM edges GROUP BY file_source"

Complete Example

A typical workflow combining multiple options:

koza join \
  -n data/hgnc_genes.tsv -n data/uniprot_proteins.jsonl -n data/reactome_pathways.parquet \
  -e data/gene_protein_interactions.tsv -e data/protein_pathway_associations.jsonl \
  --output knowledge_graph.duckdb \
  --schema-report

Expected output:

Join completed successfully
  Files processed: 5 (5 successful)
  Records loaded: 125,340 nodes, 298,567 edges
  Schema harmonization: 3 missing columns filled, 2 extra preserved
  Database created: knowledge_graph.duckdb (4.2 MB)
  Total time: 12.4s

Next Steps

After joining files, you might want to:

Clean the graph to remove dangling edges
Normalize identifiers using SSSOM mappings
Split the graph by source or category
Generate reports for quality control