koza
Usage:
$ koza [OPTIONS] COMMAND [ARGS]...
Options:
--version--install-completion: Install completion for the current shell.--show-completion: Show completion for the current shell, to copy it or customize the installation.--help: Show this message and exit.
Commands:
transform: Transform a source filejoin: Join multiple KGX files into a unified...split: Split a KGX file by specified fields with...prune: Prune graph by removing dangling edges and...append: Append new KGX files to an existing graph...normalize: Apply SSSOM mappings to normalize edge...merge: Complete merge pipeline: join → normalize...report: Generate comprehensive reports for KGX...node-report: Generate tabular node report with GROUP BY...edge-report: Generate tabular edge report with...node-examples: Generate sample rows per node type.edge-examples: Generate sample rows per edge type.
koza transform
Transform a source file
Usage:
$ koza transform [OPTIONS] CONFIGURATION_YAML
Arguments:
CONFIGURATION_YAML: Configuration YAML file [required]
Options:
-i, --input-file TEXT: Override input files-o, --output-dir TEXT: Path to output directory [default: ./output]-f, --output-format [tsv|jsonl|kgx|passthrough]: Output format [default: tsv]-n, --limit INTEGER: Number of rows to process (if skipped, processes entire source file) [default: 0]-p, --progress: Display progress of transform-q, --quiet: Disable log output--help: Show this message and exit.
koza join
Join multiple KGX files into a unified DuckDB database
Examples: # Auto-discover files in directory koza join --input-dir tmp/ -o graph.duckdb
# Use glob patterns
koza join -n "tmp/*_nodes.tsv" -e "tmp/*_edges.tsv" -o graph.duckdb
# Mix directory discovery with additional files
koza join --input-dir tmp/ -n extra_nodes.tsv -o graph.duckdb
# Multiple individual files
koza join -n file1.tsv -n file2.tsv -e edges.tsv -o graph.duckdb
Usage:
$ koza join [OPTIONS]
Options:
-n, --nodes TEXT: Node files or glob patterns (can specify multiple)-e, --edges TEXT: Edge files or glob patterns (can specify multiple)-d, --input-dir TEXT: Directory to auto-discover KGX files-o, --output TEXT: Path to output database file (default: in-memory)-f, --format [tsv|jsonl|parquet]: Output format for any exported files [default: tsv]--schema-report: Generate schema compliance report-q, --quiet: Suppress output-p, --progress: Show progress bars [default: True]--help: Show this message and exit.
koza split
Split a KGX file by specified fields with format conversion support
Usage:
$ koza split [OPTIONS] FILE FIELDS
Arguments:
FILE: Path to the KGX file to split [required]FIELDS: Comma-separated list of fields to split on [required]
Options:
-o, --output-dir TEXT: Output directory for split files [default: ./output]-f, --format [tsv|jsonl|parquet]: Output format (default: preserve input format)--remove-prefixes: Remove prefixes from values in filenames-q, --quiet: Suppress output-p, --progress: Show progress bars [default: True]--help: Show this message and exit.
koza prune
Prune graph by removing dangling edges and handling singleton nodes
Examples: # Keep singleton nodes, move dangling edges koza prune graph.duckdb
# Remove singleton nodes to separate table
koza prune graph.duckdb --remove-singletons
# Experimental: filter small components
koza prune graph.duckdb --min-component-size 10
Usage:
$ koza prune [OPTIONS] DATABASE
Arguments:
DATABASE: Path to the DuckDB database file to prune [required]
Options:
--keep-singletons: Keep singleton nodes in main table--remove-singletons: Move singleton nodes to separate table--min-component-size INTEGER: Minimum connected component size (experimental)-q, --quiet: Suppress output-p, --progress: Show progress bars [default: True]--help: Show this message and exit.
koza append
Append new KGX files to an existing graph database
Examples: # Append specific files to existing database koza append graph.duckdb -n new_nodes.tsv -e new_edges.tsv
# Auto-discover files in directory and append
koza append graph.duckdb --input-dir new_data/
# Append with deduplication and schema reporting
koza append graph.duckdb -n "*.tsv" --deduplicate --schema-report
Usage:
$ koza append [OPTIONS] DATABASE
Arguments:
DATABASE: Path to existing DuckDB database file [required]
Options:
-n, --nodes TEXT: Node files or glob patterns (can specify multiple)-e, --edges TEXT: Edge files or glob patterns (can specify multiple)-d, --input-dir TEXT: Directory to auto-discover KGX files--deduplicate: Remove duplicates during append--schema-report: Generate schema compliance report-q, --quiet: Suppress output-p, --progress: Show progress bars [default: True]--help: Show this message and exit.
koza normalize
Apply SSSOM mappings to normalize edge subject/object references
This operation loads SSSOM mapping files and applies them to rewrite edge subject and object identifiers to their canonical/equivalent forms. Node identifiers themselves are not changed - only edge references are normalized.
Examples: # Apply specific mapping files koza normalize graph.duckdb -m gene_mappings.sssom.tsv -m mondo.sssom.tsv
# Auto-discover SSSOM files in directory
koza normalize graph.duckdb --mappings-dir ./sssom/
# Apply mappings with glob pattern
koza normalize graph.duckdb -m "*.sssom.tsv"
Usage:
$ koza normalize [OPTIONS] DATABASE
Arguments:
DATABASE: Path to existing DuckDB database file [required]
Options:
-m, --mappings TEXT: SSSOM mapping files or glob patterns (can specify multiple)-d, --mappings-dir TEXT: Directory containing SSSOM mapping files-q, --quiet: Suppress output-p, --progress: Show progress bars [default: True]--help: Show this message and exit.
koza merge
Complete merge pipeline: join → normalize → prune
This composite operation orchestrates the full graph processing pipeline: 1. Join: Load and combine multiple KGX files into a unified database 2. Normalize: Apply SSSOM mappings to edge subject/object references 3. Prune: Remove dangling edges and handle singleton nodes
The pipeline can be customized by skipping steps or configuring options.
Examples: # Full pipeline with auto-discovery koza merge --input-dir ./data/ --mappings-dir ./sssom/ -o clean_graph.duckdb
# Specific files with export
koza merge -n nodes.tsv -e edges.tsv -m mappings.sssom.tsv --export --export-dir ./output/
# Skip normalization, only join and prune
koza merge -n "*.tsv" -e "*.tsv" --skip-normalize -o graph.duckdb
# Custom singleton handling
koza merge --input-dir ./data/ -m "*.sssom.tsv" --remove-singletons
Usage:
$ koza merge [OPTIONS]
Options:
-n, --nodes TEXT: Node files or glob patterns (can specify multiple)-e, --edges TEXT: Edge files or glob patterns (can specify multiple)-m, --mappings TEXT: SSSOM mapping files or glob patterns (can specify multiple)-d, --input-dir TEXT: Directory to auto-discover KGX files--mappings-dir TEXT: Directory containing SSSOM mapping files-o, --output TEXT: Path to output database file (default: temporary)--export: Export final clean data to files--export-dir TEXT: Directory for exported files (required if --export)-f, --format [tsv|jsonl|parquet]: Output format for exported files [default: tsv]--archive: Export as archive (tar) instead of loose files--compress: Compress archive as tar.gz (requires --archive)--graph-name TEXT: Name for graph files in archive (default: merged_graph)--skip-normalize: Skip normalization step--skip-prune: Skip pruning step--keep-singletons: Keep singleton nodes (default) [default: True]--remove-singletons: Move singleton nodes to separate table-q, --quiet: Suppress output-p, --progress: Show progress bars [default: True]--help: Show this message and exit.
koza report
Generate comprehensive reports for KGX graph databases.
Available report types:
• qc: Quality control analysis by data source
• graph-stats: Comprehensive graph statistics (similar to merged_graph_stats.yaml)
• schema: Database schema analysis and biolink compliance
Examples:
# Generate QC report
koza report qc -d merged.duckdb -o qc_report.yaml
# Generate graph statistics
koza report graph-stats -d merged.duckdb -o graph_stats.yaml
# Generate schema report
koza report schema -d merged.duckdb -o schema_report.yaml
# Quick QC analysis (console output only)
koza report qc -d merged.duckdb
Usage:
$ koza report [OPTIONS] REPORT_TYPE
Arguments:
REPORT_TYPE: Type of report to generate: qc, graph-stats, or schema [required]
Options:
-d, --database TEXT: Path to DuckDB database file [required]-o, --output TEXT: Path to output report file (YAML format)-q, --quiet: Suppress progress output--help: Show this message and exit.
koza node-report
Generate tabular node report with GROUP BY ALL categorical columns.
Outputs count of nodes grouped by categorical columns (namespace, category, etc.).
Examples:
# From database
koza node-report -d merged.duckdb -o node_report.tsv
# From file
koza node-report -f nodes.tsv -o node_report.parquet --format parquet
# Custom columns
koza node-report -d merged.duckdb -o report.tsv -c namespace -c category -c provided_by
Usage:
$ koza node-report [OPTIONS]
Options:
-d, --database TEXT: Path to DuckDB database file-f, --file TEXT: Path to node file (TSV, JSONL, or Parquet)-o, --output TEXT: Path to output report file--format [tsv|parquet|jsonl]: Output format [default: tsv]-c, --column TEXT: Categorical columns to group by (can specify multiple)-q, --quiet: Suppress progress output--help: Show this message and exit.
koza edge-report
Generate tabular edge report with denormalized node info.
Joins edges to nodes to get subject_category, object_category, etc., then outputs count of edges grouped by categorical columns.
Examples:
# From database
koza edge-report -d merged.duckdb -o edge_report.tsv
# From files
koza edge-report -n nodes.tsv -e edges.tsv -o edge_report.parquet --format parquet
# Custom columns
koza edge-report -d merged.duckdb -o report.tsv \
-c subject_category -c predicate -c object_category -c primary_knowledge_source
Usage:
$ koza edge-report [OPTIONS]
Options:
-d, --database TEXT: Path to DuckDB database file-n, --nodes TEXT: Path to node file (for denormalization)-e, --edges TEXT: Path to edge file (TSV, JSONL, or Parquet)-o, --output TEXT: Path to output report file--format [tsv|parquet|jsonl]: Output format [default: tsv]-c, --column TEXT: Categorical columns to group by (can specify multiple)-q, --quiet: Suppress progress output--help: Show this message and exit.
koza node-examples
Generate sample rows per node type.
Samples N example rows for each distinct value in the type column (default: category).
Examples:
# From database (5 examples per category)
koza node-examples -d merged.duckdb -o node_examples.tsv
# From file with 10 examples per type
koza node-examples -f nodes.tsv -o examples.tsv -n 10
# Group by different column
koza node-examples -d merged.duckdb -o examples.tsv -t provided_by
Usage:
$ koza node-examples [OPTIONS]
Options:
-d, --database TEXT: Path to DuckDB database file-f, --file TEXT: Path to node file (TSV, JSONL, or Parquet)-o, --output TEXT: Path to output examples file--format [tsv|parquet|jsonl]: Output format [default: tsv]-n, --sample-size INTEGER: Number of examples per type [default: 5]-t, --type-column TEXT: Column to partition examples by [default: category]-q, --quiet: Suppress progress output--help: Show this message and exit.
koza edge-examples
Generate sample rows per edge type.
Samples N example rows for each distinct combination of type columns (default: subject_category, predicate, object_category).
Examples:
# From database (5 examples per edge type)
koza edge-examples -d merged.duckdb -o edge_examples.tsv
# From files with 10 examples
koza edge-examples -n nodes.tsv -e edges.tsv -o examples.tsv -s 10
# Custom type columns
koza edge-examples -d merged.duckdb -o examples.tsv -t predicate -t primary_knowledge_source
Usage:
$ koza edge-examples [OPTIONS]
Options:
-d, --database TEXT: Path to DuckDB database file-n, --nodes TEXT: Path to node file (for denormalization)-e, --edges TEXT: Path to edge file (TSV, JSONL, or Parquet)-o, --output TEXT: Path to output examples file--format [tsv|parquet|jsonl]: Output format [default: tsv]-s, --sample-size INTEGER: Number of examples per type [default: 5]-t, --type-column TEXT: Columns to partition examples by (can specify multiple)-q, --quiet: Suppress progress output--help: Show this message and exit.