Skip to content

Graph Operations

Graph operations provide tools for building, transforming, and analyzing knowledge graphs in KGX format. Built on DuckDB, these operations handle the lifecycle from raw data files to merged graphs.

Quick Start

Combine multiple KGX files into a single graph:

koza join \
  -n source1_nodes.tsv -n source2_nodes.tsv \
  -e source1_edges.tsv -e source2_edges.tsv \
  -o my_graph.duckdb

When to Use Each Operation

flowchart TD
    A[KGX Files] --> B{Multiple sources?}
    B -->|Yes| C[join]
    B -->|No| D{Need full pipeline?}
    C --> D
    D -->|Yes| E[merge]
    D -->|No| F{Need ID harmonization?}
    E --> G[Done]
    F -->|Yes| H[normalize]
    F -->|No| I{Dangling edges?}
    H --> I
    I -->|Yes| J[prune]
    I -->|No| K{Need subsets?}
    J --> K
    K -->|Yes| L[split]
    K -->|No| G
    L --> G
Operation Use When You Need To...
join Combine multiple KGX files into one database
merge Run complete pipeline: join → normalize → prune
split Extract subsets by field value (e.g., by source)
normalize Apply SSSOM mappings to harmonize edge identifiers
prune Remove dangling edges and optionally singleton nodes
append Add new data to an existing database

Documentation Sections

Tutorials

Step-by-step lessons for learning graph operations from scratch.

How-to Guides

Practical recipes for specific tasks and common workflows.

Reference

Technical documentation for CLI commands, Python API, and configuration.

Explanation

Background concepts and architectural decisions explained.

Key Features

  • Supported formats: TSV, JSONL, and Parquet files
  • Schema harmonization: Handles different column sets across input files
  • Archive behavior: Problem data is moved to archive tables, not deleted
  • Provenance tracking: Records source attribution for all records
  • SQL access: Graphs can be queried with DuckDB SQL