Graph Operations

Graph operations provide tools for building, transforming, and analyzing knowledge graphs in KGX format. Built on DuckDB, these operations handle the lifecycle from raw data files to merged graphs.

Quick Start

Combine multiple KGX files into a single graph:

koza join \
  -n source1_nodes.tsv -n source2_nodes.tsv \
  -e source1_edges.tsv -e source2_edges.tsv \
  -o my_graph.duckdb

When to Use Each Operation

flowchart TD
    A[KGX Files] --> B{Multiple sources?}
    B -->|Yes| C[join]
    B -->|No| D{Need full pipeline?}
    C --> D
    D -->|Yes| E[merge]
    D -->|No| F{Need ID harmonization?}
    E --> G[Done]
    F -->|Yes| H[normalize]
    F -->|No| I{Dangling edges?}
    H --> I
    I -->|Yes| J[prune]
    I -->|No| K{Need subsets?}
    J --> K
    K -->|Yes| L[split]
    K -->|No| G
    L --> G

Operation	Use When You Need To...
join	Combine multiple KGX files into one database
merge	Run complete pipeline: join → normalize → prune
split	Extract subsets by field value (e.g., by source)
normalize	Apply SSSOM mappings to harmonize edge identifiers
prune	Remove dangling edges and optionally singleton nodes
append	Add new data to an existing database

Documentation Sections

Tutorials

Step-by-step lessons for learning graph operations from scratch.

How-to Guides

Practical recipes for specific tasks and common workflows.

Reference

Technical documentation for CLI commands, Python API, and configuration.

Explanation

Background concepts and architectural decisions explained.

Key Features

Supported formats: TSV, JSONL, and Parquet files
Schema harmonization: Handles different column sets across input files
Archive behavior: Problem data is moved to archive tables, not deleted
Provenance tracking: Records source attribution for all records
SQL access: Graphs can be queried with DuckDB SQL