Skip to content

koza

Usage:

$ koza [OPTIONS] COMMAND [ARGS]...

Options:

  • --version
  • --install-completion: Install completion for the current shell.
  • --show-completion: Show completion for the current shell, to copy it or customize the installation.
  • --help: Show this message and exit.

Commands:

  • transform: Transform a source file
  • join: Join multiple KGX files into a unified...
  • split: Split a KGX file by specified fields with...
  • prune: Prune graph by removing dangling edges and...
  • append: Append new KGX files to an existing graph...
  • normalize: Apply SSSOM mappings to normalize edge...
  • merge: Complete merge pipeline: join → normalize...
  • report: Generate comprehensive reports for KGX...
  • node-report: Generate tabular node report with GROUP BY...
  • edge-report: Generate tabular edge report with...
  • node-examples: Generate sample rows per node type.
  • edge-examples: Generate sample rows per edge type.

koza transform

Transform a source file

Usage:

$ koza transform [OPTIONS] CONFIGURATION_YAML

Arguments:

  • CONFIGURATION_YAML: Configuration YAML file [required]

Options:

  • -i, --input-file TEXT: Override input files
  • -o, --output-dir TEXT: Path to output directory [default: ./output]
  • -f, --output-format [tsv|jsonl|kgx|passthrough]: Output format [default: tsv]
  • -n, --limit INTEGER: Number of rows to process (if skipped, processes entire source file) [default: 0]
  • -p, --progress: Display progress of transform
  • -q, --quiet: Disable log output
  • --help: Show this message and exit.

koza join

Join multiple KGX files into a unified DuckDB database

Examples: # Auto-discover files in directory koza join --input-dir tmp/ -o graph.duckdb

# Use glob patterns
koza join -n "tmp/*_nodes.tsv" -e "tmp/*_edges.tsv" -o graph.duckdb

# Mix directory discovery with additional files
koza join --input-dir tmp/ -n extra_nodes.tsv -o graph.duckdb

# Multiple individual files
koza join -n file1.tsv -n file2.tsv -e edges.tsv -o graph.duckdb

Usage:

$ koza join [OPTIONS]

Options:

  • -n, --nodes TEXT: Node files or glob patterns (can specify multiple)
  • -e, --edges TEXT: Edge files or glob patterns (can specify multiple)
  • -d, --input-dir TEXT: Directory to auto-discover KGX files
  • -o, --output TEXT: Path to output database file (default: in-memory)
  • -f, --format [tsv|jsonl|parquet]: Output format for any exported files [default: tsv]
  • --schema-report: Generate schema compliance report
  • -q, --quiet: Suppress output
  • -p, --progress: Show progress bars [default: True]
  • --help: Show this message and exit.

koza split

Split a KGX file by specified fields with format conversion support

Usage:

$ koza split [OPTIONS] FILE FIELDS

Arguments:

  • FILE: Path to the KGX file to split [required]
  • FIELDS: Comma-separated list of fields to split on [required]

Options:

  • -o, --output-dir TEXT: Output directory for split files [default: ./output]
  • -f, --format [tsv|jsonl|parquet]: Output format (default: preserve input format)
  • --remove-prefixes: Remove prefixes from values in filenames
  • -q, --quiet: Suppress output
  • -p, --progress: Show progress bars [default: True]
  • --help: Show this message and exit.

koza prune

Prune graph by removing dangling edges and handling singleton nodes

Examples: # Keep singleton nodes, move dangling edges koza prune graph.duckdb

# Remove singleton nodes to separate table
koza prune graph.duckdb --remove-singletons

# Experimental: filter small components
koza prune graph.duckdb --min-component-size 10

Usage:

$ koza prune [OPTIONS] DATABASE

Arguments:

  • DATABASE: Path to the DuckDB database file to prune [required]

Options:

  • --keep-singletons: Keep singleton nodes in main table
  • --remove-singletons: Move singleton nodes to separate table
  • --min-component-size INTEGER: Minimum connected component size (experimental)
  • -q, --quiet: Suppress output
  • -p, --progress: Show progress bars [default: True]
  • --help: Show this message and exit.

koza append

Append new KGX files to an existing graph database

Examples: # Append specific files to existing database koza append graph.duckdb -n new_nodes.tsv -e new_edges.tsv

# Auto-discover files in directory and append
koza append graph.duckdb --input-dir new_data/

# Append with deduplication and schema reporting
koza append graph.duckdb -n "*.tsv" --deduplicate --schema-report

Usage:

$ koza append [OPTIONS] DATABASE

Arguments:

  • DATABASE: Path to existing DuckDB database file [required]

Options:

  • -n, --nodes TEXT: Node files or glob patterns (can specify multiple)
  • -e, --edges TEXT: Edge files or glob patterns (can specify multiple)
  • -d, --input-dir TEXT: Directory to auto-discover KGX files
  • --deduplicate: Remove duplicates during append
  • --schema-report: Generate schema compliance report
  • -q, --quiet: Suppress output
  • -p, --progress: Show progress bars [default: True]
  • --help: Show this message and exit.

koza normalize

Apply SSSOM mappings to normalize edge subject/object references

This operation loads SSSOM mapping files and applies them to rewrite edge subject and object identifiers to their canonical/equivalent forms. Node identifiers themselves are not changed - only edge references are normalized.

Examples: # Apply specific mapping files koza normalize graph.duckdb -m gene_mappings.sssom.tsv -m mondo.sssom.tsv

# Auto-discover SSSOM files in directory
koza normalize graph.duckdb --mappings-dir ./sssom/

# Apply mappings with glob pattern
koza normalize graph.duckdb -m "*.sssom.tsv"

Usage:

$ koza normalize [OPTIONS] DATABASE

Arguments:

  • DATABASE: Path to existing DuckDB database file [required]

Options:

  • -m, --mappings TEXT: SSSOM mapping files or glob patterns (can specify multiple)
  • -d, --mappings-dir TEXT: Directory containing SSSOM mapping files
  • -q, --quiet: Suppress output
  • -p, --progress: Show progress bars [default: True]
  • --help: Show this message and exit.

koza merge

Complete merge pipeline: join → normalize → prune

This composite operation orchestrates the full graph processing pipeline: 1. Join: Load and combine multiple KGX files into a unified database 2. Normalize: Apply SSSOM mappings to edge subject/object references 3. Prune: Remove dangling edges and handle singleton nodes

The pipeline can be customized by skipping steps or configuring options.

Examples: # Full pipeline with auto-discovery koza merge --input-dir ./data/ --mappings-dir ./sssom/ -o clean_graph.duckdb

# Specific files with export
koza merge -n nodes.tsv -e edges.tsv -m mappings.sssom.tsv --export --export-dir ./output/

# Skip normalization, only join and prune
koza merge -n "*.tsv" -e "*.tsv" --skip-normalize -o graph.duckdb

# Custom singleton handling
koza merge --input-dir ./data/ -m "*.sssom.tsv" --remove-singletons

Usage:

$ koza merge [OPTIONS]

Options:

  • -n, --nodes TEXT: Node files or glob patterns (can specify multiple)
  • -e, --edges TEXT: Edge files or glob patterns (can specify multiple)
  • -m, --mappings TEXT: SSSOM mapping files or glob patterns (can specify multiple)
  • -d, --input-dir TEXT: Directory to auto-discover KGX files
  • --mappings-dir TEXT: Directory containing SSSOM mapping files
  • -o, --output TEXT: Path to output database file (default: temporary)
  • --export: Export final clean data to files
  • --export-dir TEXT: Directory for exported files (required if --export)
  • -f, --format [tsv|jsonl|parquet]: Output format for exported files [default: tsv]
  • --archive: Export as archive (tar) instead of loose files
  • --compress: Compress archive as tar.gz (requires --archive)
  • --graph-name TEXT: Name for graph files in archive (default: merged_graph)
  • --skip-normalize: Skip normalization step
  • --skip-prune: Skip pruning step
  • --keep-singletons: Keep singleton nodes (default) [default: True]
  • --remove-singletons: Move singleton nodes to separate table
  • -q, --quiet: Suppress output
  • -p, --progress: Show progress bars [default: True]
  • --help: Show this message and exit.

koza report

Generate comprehensive reports for KGX graph databases.

Available report types:

• qc: Quality control analysis by data source

• graph-stats: Comprehensive graph statistics (similar to merged_graph_stats.yaml)

• schema: Database schema analysis and biolink compliance

Examples:

# Generate QC report
koza report qc -d merged.duckdb -o qc_report.yaml

# Generate graph statistics
koza report graph-stats -d merged.duckdb -o graph_stats.yaml

# Generate schema report
koza report schema -d merged.duckdb -o schema_report.yaml

# Quick QC analysis (console output only)
koza report qc -d merged.duckdb

Usage:

$ koza report [OPTIONS] REPORT_TYPE

Arguments:

  • REPORT_TYPE: Type of report to generate: qc, graph-stats, or schema [required]

Options:

  • -d, --database TEXT: Path to DuckDB database file [required]
  • -o, --output TEXT: Path to output report file (YAML format)
  • -q, --quiet: Suppress progress output
  • --help: Show this message and exit.

koza node-report

Generate tabular node report with GROUP BY ALL categorical columns.

Outputs count of nodes grouped by categorical columns (namespace, category, etc.).

Examples:

# From database
koza node-report -d merged.duckdb -o node_report.tsv

# From file
koza node-report -f nodes.tsv -o node_report.parquet --format parquet

# Custom columns
koza node-report -d merged.duckdb -o report.tsv -c namespace -c category -c provided_by

Usage:

$ koza node-report [OPTIONS]

Options:

  • -d, --database TEXT: Path to DuckDB database file
  • -f, --file TEXT: Path to node file (TSV, JSONL, or Parquet)
  • -o, --output TEXT: Path to output report file
  • --format [tsv|parquet|jsonl]: Output format [default: tsv]
  • -c, --column TEXT: Categorical columns to group by (can specify multiple)
  • -q, --quiet: Suppress progress output
  • --help: Show this message and exit.

koza edge-report

Generate tabular edge report with denormalized node info.

Joins edges to nodes to get subject_category, object_category, etc., then outputs count of edges grouped by categorical columns.

Examples:

# From database
koza edge-report -d merged.duckdb -o edge_report.tsv

# From files
koza edge-report -n nodes.tsv -e edges.tsv -o edge_report.parquet --format parquet

# Custom columns
koza edge-report -d merged.duckdb -o report.tsv \
    -c subject_category -c predicate -c object_category -c primary_knowledge_source

Usage:

$ koza edge-report [OPTIONS]

Options:

  • -d, --database TEXT: Path to DuckDB database file
  • -n, --nodes TEXT: Path to node file (for denormalization)
  • -e, --edges TEXT: Path to edge file (TSV, JSONL, or Parquet)
  • -o, --output TEXT: Path to output report file
  • --format [tsv|parquet|jsonl]: Output format [default: tsv]
  • -c, --column TEXT: Categorical columns to group by (can specify multiple)
  • -q, --quiet: Suppress progress output
  • --help: Show this message and exit.

koza node-examples

Generate sample rows per node type.

Samples N example rows for each distinct value in the type column (default: category).

Examples:

# From database (5 examples per category)
koza node-examples -d merged.duckdb -o node_examples.tsv

# From file with 10 examples per type
koza node-examples -f nodes.tsv -o examples.tsv -n 10

# Group by different column
koza node-examples -d merged.duckdb -o examples.tsv -t provided_by

Usage:

$ koza node-examples [OPTIONS]

Options:

  • -d, --database TEXT: Path to DuckDB database file
  • -f, --file TEXT: Path to node file (TSV, JSONL, or Parquet)
  • -o, --output TEXT: Path to output examples file
  • --format [tsv|parquet|jsonl]: Output format [default: tsv]
  • -n, --sample-size INTEGER: Number of examples per type [default: 5]
  • -t, --type-column TEXT: Column to partition examples by [default: category]
  • -q, --quiet: Suppress progress output
  • --help: Show this message and exit.

koza edge-examples

Generate sample rows per edge type.

Samples N example rows for each distinct combination of type columns (default: subject_category, predicate, object_category).

Examples:

# From database (5 examples per edge type)
koza edge-examples -d merged.duckdb -o edge_examples.tsv

# From files with 10 examples
koza edge-examples -n nodes.tsv -e edges.tsv -o examples.tsv -s 10

# Custom type columns
koza edge-examples -d merged.duckdb -o examples.tsv -t predicate -t primary_knowledge_source

Usage:

$ koza edge-examples [OPTIONS]

Options:

  • -d, --database TEXT: Path to DuckDB database file
  • -n, --nodes TEXT: Path to node file (for denormalization)
  • -e, --edges TEXT: Path to edge file (TSV, JSONL, or Parquet)
  • -o, --output TEXT: Path to output examples file
  • --format [tsv|parquet|jsonl]: Output format [default: tsv]
  • -s, --sample-size INTEGER: Number of examples per type [default: 5]
  • -t, --type-column TEXT: Columns to partition examples by (can specify multiple)
  • -q, --quiet: Suppress progress output
  • --help: Show this message and exit.