CLI Reference
Complete documentation for all graph operation CLI commands.
koza join
Combine multiple KGX files into a unified DuckDB database with automatic schema harmonization.
Synopsis
koza join [OPTIONS]
Description
The join command loads multiple KGX files (TSV, JSONL, or Parquet) into a single DuckDB database. It automatically handles schema differences between files, filling missing columns with NULL values and preserving extra columns. Supports glob patterns for file discovery and automatic file detection from directories.
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--nodes |
-n |
List[str] | None | Node files or glob patterns (can specify multiple) |
--edges |
-e |
List[str] | None | Edge files or glob patterns (can specify multiple) |
--input-dir |
-d |
Path | None | Directory to auto-discover KGX files |
--output |
-o |
str | None | Path to output database file (default: in-memory) |
--format |
-f |
KGXFormat | tsv |
Output format for any exported files |
--schema-report |
bool | False | Generate schema compliance report | |
--progress |
-p |
bool | True | Show progress bars |
--quiet |
-q |
bool | False | Suppress output |
Examples
# Auto-discover files in directory
koza join --input-dir ./data/ -o graph.duckdb
# Use glob patterns for node and edge files
koza join -n "data/*_nodes.tsv" -e "data/*_edges.tsv" -o graph.duckdb
# Mix directory discovery with additional files
koza join --input-dir ./data/ -n extra_nodes.tsv -o graph.duckdb
# Multiple individual files from different formats
koza join -n genes.tsv -n proteins.jsonl -e interactions.parquet -o graph.duckdb
# Generate schema compliance report
koza join -n "*.nodes.*" -e "*.edges.*" -o graph.duckdb --schema-report
Output
- Database file: DuckDB database with
nodesandedgestables - Schema report:
{database}_schema_report.yaml(if--schema-reportenabled) - CLI summary: File counts, record counts, and schema harmonization details
See also: How to Join KGX Files
koza split
Split a KGX file by specified fields with format conversion support.
Synopsis
koza split FILE FIELDS [OPTIONS]
Description
The split command extracts subsets of data from a KGX file, creating separate output files for each unique value (or combination of values) in the specified fields. Supports format conversion during split operations.
Arguments
| Argument | Type | Description |
|---|---|---|
FILE |
str | Path to the KGX file to split (required) |
FIELDS |
str | Comma-separated list of fields to split on (required) |
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--output-dir |
-o |
Path | ./output |
Output directory for split files |
--format |
-f |
KGXFormat | None | Output format (default: preserve input format) |
--remove-prefixes |
bool | False | Remove prefixes from values in filenames | |
--progress |
-p |
bool | True | Show progress bars |
--quiet |
-q |
bool | False | Suppress output |
Examples
# Split nodes by category
koza split nodes.tsv category -o ./split_output
# Split edges by predicate and convert to Parquet
koza split edges.tsv predicate -o ./parquet_output -f parquet
# Split by multiple fields (creates files per combination)
koza split nodes.tsv namespace,category -o ./split_output
# Remove CURIE prefixes from output filenames
koza split nodes.tsv category --remove-prefixes -o ./clean_output
# Split with progress tracking
koza split large_nodes.tsv provided_by -o ./split_output -p
Output
For each unique value (or combination) in the split fields, creates:
- {value}_nodes.{format} or {value}_edges.{format} depending on input file type
When splitting on array-type fields (e.g., category), records may appear in multiple output files if they have multiple values in that field.
See also: How to Split a Graph
koza merge
Run the complete merge pipeline: join, deduplicate, normalize, and prune.
Synopsis
koza merge [OPTIONS]
Description
The merge command orchestrates a complete graph processing pipeline in sequence:
- Join: Load and combine multiple KGX files into a unified database
- Deduplicate: Remove duplicate nodes and edges by ID
- Normalize: Apply SSSOM mappings to edge subject/object references
- Prune: Remove dangling edges and handle singleton nodes
This command runs the complete pipeline for creating a production-ready knowledge graph from multiple sources.
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--nodes |
-n |
List[str] | None | Node files or glob patterns (can specify multiple) |
--edges |
-e |
List[str] | None | Edge files or glob patterns (can specify multiple) |
--mappings |
-m |
List[str] | None | SSSOM mapping files or glob patterns |
--input-dir |
-d |
Path | None | Directory to auto-discover KGX files |
--mappings-dir |
Path | None | Directory containing SSSOM mapping files | |
--output |
-o |
str | None | Path to output database file (default: temporary) |
--export |
bool | False | Export final clean data to files | |
--export-dir |
Path | None | Directory for exported files (required if --export) |
|
--format |
-f |
KGXFormat | tsv |
Output format for exported files |
--archive |
bool | False | Export as archive (tar) instead of loose files | |
--compress |
bool | False | Compress archive as tar.gz (requires --archive) |
|
--graph-name |
str | merged_graph |
Name for graph files in archive | |
--skip-normalize |
bool | False | Skip normalization step | |
--skip-prune |
bool | False | Skip pruning step | |
--keep-singletons |
bool | True | Keep singleton nodes (default) | |
--remove-singletons |
bool | False | Move singleton nodes to separate table | |
--progress |
-p |
bool | True | Show progress bars |
--quiet |
-q |
bool | False | Suppress output |
Examples
# Full pipeline with auto-discovery
koza merge --input-dir ./data/ --mappings-dir ./sssom/ -o clean_graph.duckdb
# Specific files with export to Parquet
koza merge -n nodes.tsv -e edges.tsv -m mappings.sssom.tsv \
--export --export-dir ./output/ -f parquet
# Skip normalization (no SSSOM mappings needed)
koza merge -n "*.nodes.tsv" -e "*.edges.tsv" --skip-normalize -o graph.duckdb
# Create compressed archive for distribution
koza merge --input-dir ./data/ -m "*.sssom.tsv" \
--export --export-dir ./dist/ --archive --compress --graph-name my_kg
# Custom singleton handling
koza merge --input-dir ./data/ -m "*.sssom.tsv" --remove-singletons -o graph.duckdb
Output
- Database file: DuckDB database with cleaned
nodesandedgestables - Archive tables:
duplicate_nodes,duplicate_edges,dangling_edges,singleton_nodes(if applicable) - Exported files: KGX files in specified format (if
--exportenabled) - CLI summary: Progress and statistics for each pipeline step
See also: How to Join KGX Files, How to Normalize Identifiers, How to Clean a Graph
koza normalize
Apply SSSOM mappings to normalize edge subject/object references.
Synopsis
koza normalize DATABASE [OPTIONS]
Description
The normalize command loads SSSOM (Simple Standard for Sharing Ontological Mappings) files and applies them to rewrite edge subject and object identifiers to their canonical/equivalent forms. Node identifiers themselves are not changed - only edge references are normalized. Original values are preserved in original_subject and original_object columns.
Arguments
| Argument | Type | Description |
|---|---|---|
DATABASE |
str | Path to existing DuckDB database file (required) |
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--mappings |
-m |
List[str] | None | SSSOM mapping files or glob patterns (can specify multiple) |
--mappings-dir |
-d |
Path | None | Directory containing SSSOM mapping files |
--progress |
-p |
bool | True | Show progress bars |
--quiet |
-q |
bool | False | Suppress output |
Examples
# Apply specific mapping files
koza normalize graph.duckdb -m gene_mappings.sssom.tsv -m mondo.sssom.tsv
# Auto-discover SSSOM files in directory
koza normalize graph.duckdb --mappings-dir ./sssom/
# Apply mappings with glob pattern
koza normalize graph.duckdb -m "mappings/*.sssom.tsv"
# Quiet operation for automation
koza normalize graph.duckdb -m mappings.sssom.tsv -q
Output
- Modified edges table:
subjectandobjectcolumns updated with mapped identifiers - Preservation columns:
original_subjectandoriginal_objectstore pre-normalization values - CLI summary: Count of loaded mappings and normalized references
Note: When one object_id maps to multiple subject_id values in SSSOM files, only the first mapping is kept to prevent edge duplication.
See also: How to Normalize Identifiers
koza deduplicate
Remove duplicate nodes and edges by ID.
Synopsis
The deduplicate operation is included in the merge pipeline but is not exposed as a standalone CLI command. Use koza merge with appropriate options, or koza append --deduplicate for incremental deduplication.
Description
Deduplication identifies nodes and edges with duplicate IDs, archives all duplicates to separate tables (duplicate_nodes, duplicate_edges), and keeps only the first occurrence in the main tables. Order is determined by file_source or provided_by fields.
Usage via Merge
# Merge includes deduplication by default
koza merge -n "*.nodes.tsv" -e "*.edges.tsv" --skip-normalize -o graph.duckdb
Usage via Append
# Deduplicate during append operation
koza append graph.duckdb -n new_nodes.tsv --deduplicate
Archive Tables
After deduplication, inspect removed duplicates:
-- View duplicate nodes
SELECT * FROM duplicate_nodes LIMIT 10;
-- Count duplicates by source
SELECT file_source, COUNT(*) FROM duplicate_edges GROUP BY file_source;
See also: How to Perform Incremental Updates
koza prune
Prune graph by removing dangling edges and handling singleton nodes.
Synopsis
koza prune DATABASE [OPTIONS]
Description
The prune command cleans up graph integrity issues by identifying and moving dangling edges (edges pointing to non-existent nodes) to a separate table. It can also optionally move singleton nodes (nodes with no edges) to a separate table. Data is never deleted - only moved to archive tables for preservation.
Arguments
| Argument | Type | Description |
|---|---|---|
DATABASE |
str | Path to the DuckDB database file to prune (required) |
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--keep-singletons |
bool | True* | Keep singleton nodes in main table | |
--remove-singletons |
bool | False | Move singleton nodes to separate table | |
--min-component-size |
int | None | Minimum connected component size (experimental) | |
--progress |
-p |
bool | True | Show progress bars |
--quiet |
-q |
bool | False | Suppress output |
*Default behavior: if neither --keep-singletons nor --remove-singletons is specified, singletons are kept.
Examples
# Keep singleton nodes, move dangling edges (default)
koza prune graph.duckdb
# Explicitly keep singletons
koza prune graph.duckdb --keep-singletons
# Remove singleton nodes to separate table
koza prune graph.duckdb --remove-singletons
# Experimental: filter small components
koza prune graph.duckdb --min-component-size 10
# Quiet operation for automation
koza prune graph.duckdb --keep-singletons -q
Output
Creates archive tables for data preservation:
- dangling_edges: Edges pointing to non-existent nodes
- singleton_nodes: Isolated nodes (if --remove-singletons)
- CLI summary: Counts of edges/nodes moved, integrity statistics
See also: How to Clean a Graph
koza append
Append new KGX files to an existing graph database.
Synopsis
koza append DATABASE [OPTIONS]
Description
The append command adds new data to an existing DuckDB database with automatic schema evolution. New columns in appended files are automatically added to existing tables, with existing records receiving NULL values for new columns. Optional deduplication removes exact duplicates after appending.
Arguments
| Argument | Type | Description |
|---|---|---|
DATABASE |
str | Path to existing DuckDB database file (required) |
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--nodes |
-n |
List[str] | None | Node files or glob patterns (can specify multiple) |
--edges |
-e |
List[str] | None | Edge files or glob patterns (can specify multiple) |
--input-dir |
-d |
Path | None | Directory to auto-discover KGX files |
--deduplicate |
bool | False | Remove duplicates during append | |
--schema-report |
bool | False | Generate schema compliance report | |
--progress |
-p |
bool | True | Show progress bars |
--quiet |
-q |
bool | False | Suppress output |
Examples
# Append specific files to existing database
koza append graph.duckdb -n new_nodes.tsv -e new_edges.tsv
# Auto-discover files in directory and append
koza append graph.duckdb --input-dir ./new_data/
# Append with deduplication and schema reporting
koza append graph.duckdb -n "*.tsv" --deduplicate --schema-report
# Multiple files with glob patterns
koza append graph.duckdb -n "batch2/*.nodes.*" -e "batch2/*.edges.*"
# Quiet append for automation
koza append graph.duckdb -n corrections.tsv -q
Output
- Schema changes: Reports new columns added and their sources
- Record counts: Before/after record counts for nodes and edges
- Duplicate statistics: Number of duplicates removed (if
--deduplicate) - Schema report: Detailed analysis (if
--schema-report)
See also: How to Perform Incremental Updates
koza report
Generate comprehensive reports for KGX graph databases.
Synopsis
koza report REPORT_TYPE --database DATABASE [OPTIONS]
Description
The report command generates various analysis reports for graph databases. Three report types are available: QC (quality control), graph-stats (comprehensive statistics), and schema (database schema analysis).
Arguments
| Argument | Type | Description |
|---|---|---|
REPORT_TYPE |
str | Type of report: qc, graph-stats, or schema (required) |
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--database |
-d |
Path | Required | Path to DuckDB database file |
--output |
-o |
Path | None | Path to output report file (YAML format) |
--quiet |
-q |
bool | False | Suppress progress output |
Report Types
qc - Quality Control Report
Generates quality control analysis grouped by data source, including node/edge counts, category distributions, and potential issues.
koza report qc -d merged.duckdb -o qc_report.yaml
graph-stats - Graph Statistics Report
Generates comprehensive graph statistics similar to merged_graph_stats.yaml, including total counts, degree distributions, and connectivity metrics.
koza report graph-stats -d merged.duckdb -o graph_stats.yaml
schema - Schema Report
Analyzes database schema and biolink compliance, reporting column types, coverage, and potential schema issues.
koza report schema -d merged.duckdb -o schema_report.yaml
Examples
# Generate QC report with output file
koza report qc -d merged.duckdb -o qc_report.yaml
# Generate graph statistics
koza report graph-stats -d merged.duckdb -o graph_stats.yaml
# Generate schema report
koza report schema -d merged.duckdb -o schema_report.yaml
# Quick QC analysis (console output only)
koza report qc -d merged.duckdb
# Quiet operation for scripts
koza report graph-stats -d merged.duckdb -o stats.yaml -q
Output
All reports are generated in YAML format and include: - Timestamp: When the report was generated - Database info: Source database path and size - Statistics: Type-specific metrics and analysis
See also: How to Generate Reports
koza node-report
Generate tabular node reports with categorical column grouping.
Synopsis
koza node-report [OPTIONS]
Description
The node-report command generates tabular reports showing node counts grouped by categorical columns (namespace, category, provided_by, etc.). Can read from either a DuckDB database or directly from a node file.
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--database |
-d |
Path | None | Path to DuckDB database file |
--file |
-f |
Path | None | Path to node file (TSV, JSONL, or Parquet) |
--output |
-o |
Path | None | Path to output report file |
--format |
TabularReportFormat | tsv |
Output format: tsv, jsonl, or parquet |
|
--column |
-c |
List[str] | None | Categorical columns to group by (can specify multiple) |
--quiet |
-q |
bool | False | Suppress progress output |
Note: Must specify either --database or --file.
Examples
# From database with default columns
koza node-report -d merged.duckdb -o node_report.tsv
# From file with Parquet output
koza node-report -f nodes.tsv -o node_report.parquet --format parquet
# Custom categorical columns
koza node-report -d merged.duckdb -o report.tsv -c namespace -c category -c provided_by
# JSONL output format
koza node-report -d merged.duckdb -o node_report.jsonl --format jsonl
Output
Tabular report with columns for each categorical field plus a count column. Default categorical columns include namespace, category, and provided_by when present.
See also: How to Generate Reports
koza edge-report
Generate tabular edge reports with denormalized node information.
Synopsis
koza edge-report [OPTIONS]
Description
The edge-report command generates tabular reports showing edge counts grouped by categorical columns. When node information is available, it joins edges to nodes to include subject_category and object_category in the grouping.
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--database |
-d |
Path | None | Path to DuckDB database file |
--nodes |
-n |
Path | None | Path to node file (for denormalization) |
--edges |
-e |
Path | None | Path to edge file (TSV, JSONL, or Parquet) |
--output |
-o |
Path | None | Path to output report file |
--format |
TabularReportFormat | tsv |
Output format: tsv, jsonl, or parquet |
|
--column |
-c |
List[str] | None | Categorical columns to group by (can specify multiple) |
--quiet |
-q |
bool | False | Suppress progress output |
Note: Must specify either --database or --edges.
Examples
# From database with default columns
koza edge-report -d merged.duckdb -o edge_report.tsv
# From files with Parquet output
koza edge-report -n nodes.tsv -e edges.tsv -o edge_report.parquet --format parquet
# Custom categorical columns
koza edge-report -d merged.duckdb -o report.tsv \
-c subject_category -c predicate -c object_category -c primary_knowledge_source
# Edge report without node denormalization
koza edge-report -e edges.tsv -o edge_report.tsv
Output
Tabular report with columns for each categorical field plus a count column. When node information is available, includes subject_category and object_category derived from joining to the nodes table.
See also: How to Generate Reports
koza node-examples
Generate sample rows per node type.
Synopsis
koza node-examples [OPTIONS]
Description
The node-examples command samples N example rows for each distinct value in a type column (default: category). The output can be used for documentation, debugging, and data exploration.
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--database |
-d |
Path | None | Path to DuckDB database file |
--file |
-f |
Path | None | Path to node file (TSV, JSONL, or Parquet) |
--output |
-o |
Path | None | Path to output examples file |
--format |
TabularReportFormat | tsv |
Output format: tsv, jsonl, or parquet |
|
--sample-size |
-n |
int | 5 | Number of examples per type |
--type-column |
-t |
str | category |
Column to partition examples by |
--quiet |
-q |
bool | False | Suppress progress output |
Note: Must specify either --database or --file.
Examples
# From database (5 examples per category)
koza node-examples -d merged.duckdb -o node_examples.tsv
# From file with 10 examples per type
koza node-examples -f nodes.tsv -o examples.tsv -n 10
# Group by different column
koza node-examples -d merged.duckdb -o examples.tsv -t provided_by
# Output as Parquet
koza node-examples -d merged.duckdb -o examples.parquet --format parquet
# More examples per category
koza node-examples -d merged.duckdb -o examples.tsv -n 20
Output
Tabular file containing N sample rows for each unique value in the type column. All node columns are preserved in the output.
See also: How to Generate Reports
koza edge-examples
Generate sample rows per edge type.
Synopsis
koza edge-examples [OPTIONS]
Description
The edge-examples command samples N example rows for each distinct combination of type columns (default: subject_category, predicate, object_category). When node information is available, it joins edges to nodes for category information.
Options
| Option | Short | Type | Default | Description |
|---|---|---|---|---|
--database |
-d |
Path | None | Path to DuckDB database file |
--nodes |
-n |
Path | None | Path to node file (for denormalization) |
--edges |
-e |
Path | None | Path to edge file (TSV, JSONL, or Parquet) |
--output |
-o |
Path | None | Path to output examples file |
--format |
TabularReportFormat | tsv |
Output format: tsv, jsonl, or parquet |
|
--sample-size |
-s |
int | 5 | Number of examples per type |
--type-column |
-t |
List[str] | None | Columns to partition examples by (can specify multiple) |
--quiet |
-q |
bool | False | Suppress progress output |
Note: Must specify either --database or --edges.
Examples
# From database (5 examples per edge type)
koza edge-examples -d merged.duckdb -o edge_examples.tsv
# From files with 10 examples
koza edge-examples -n nodes.tsv -e edges.tsv -o examples.tsv -s 10
# Custom type columns
koza edge-examples -d merged.duckdb -o examples.tsv -t predicate -t primary_knowledge_source
# Output as Parquet
koza edge-examples -d merged.duckdb -o examples.parquet --format parquet
# More examples per edge type
koza edge-examples -d merged.duckdb -o examples.tsv -s 20
Output
Tabular file containing N sample rows for each unique combination of type columns. When node information is available, includes subject_category and object_category columns.
See also: How to Generate Reports
Common Patterns
File Specification Formats
All commands that accept file lists support multiple specification formats:
Glob Patterns
# Match all node files
--nodes "*.nodes.*"
# Match specific formats
--nodes "*.tsv" --edges "*.jsonl"
# Match files in subdirectories
--nodes "data/*_nodes.tsv"
Multiple Files
# Specify multiple files individually
--nodes genes.tsv --nodes proteins.tsv --edges interactions.tsv
# Or as a list
--nodes genes.tsv proteins.jsonl pathways.parquet
Progress and Output Control
Progress Indicators
--progress/-p: Display progress bars (enabled by default for most commands)--quiet/-q: Suppress all non-error output
Output Formats
Supported formats for tabular reports and exports: - TSV: Tab-separated values (KGX standard) - JSONL: JSON Lines format - Parquet: Columnar format for analytics
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error (file not found, permission denied, etc.) |
| 2 | Invalid arguments or configuration |
| 130 | Interrupted by user (Ctrl+C) |
Getting Help
# General help
koza --help
# Command-specific help
koza join --help
koza split --help
koza merge --help
koza normalize --help
koza prune --help
koza append --help
koza report --help
koza node-report --help
koza edge-report --help
koza node-examples --help
koza edge-examples --help
# Version information
koza --version