Skip to content

Configuration Reference

This reference documents all Pydantic configuration models used in Koza graph operations.


Enums

KGXFormat

Supported KGX file formats.

Value Description
TSV Tab-separated values format
JSONL JSON Lines format
PARQUET Apache Parquet format

KGXFileType

KGX file types.

Value Description
NODES Node file
EDGES Edge file

TabularReportFormat

Supported formats for tabular reports.

Value Description
TSV Tab-separated values format
JSONL JSON Lines format
PARQUET Apache Parquet format

File Handling

FileSpec

Specification for a KGX file with automatic format detection and source attribution.

Fields

Field Type Default Description
path Path required Path to the file
source_name str \| None file stem Source attribution name (defaults to filename stem)
format KGXFormat \| None auto-detect File format (TSV, JSONL, PARQUET)
file_type KGXFileType \| None auto-detect File type (NODES, EDGES)

Validation Rules

  • Format Detection: If format is not provided, it is auto-detected from the file extension:
    • .tsv, .txt -> TSV
    • .jsonl, .json -> JSONL
    • .parquet -> PARQUET
    • Compressed files (.gz, .bz2, .xz) are handled by stripping the compression suffix first
  • File Type Detection: If file_type is not provided, it is auto-detected from the filename:
    • Files containing _nodes. or starting with nodes. -> NODES
    • Files containing _edges. or starting with edges. -> EDGES
  • Source Name: If not provided, defaults to the stem of the file path

Example

from koza.model.graph_operations import FileSpec

# Full specification
file_spec = FileSpec(
    path=Path("data/monarch_nodes.tsv"),
    source_name="monarch",
    format=KGXFormat.TSV,
    file_type=KGXFileType.NODES
)

# With auto-detection (recommended)
file_spec = FileSpec(path=Path("data/monarch_nodes.tsv"))
# Automatically detects: format=TSV, file_type=NODES, source_name="monarch_nodes"

Operation Configurations

JoinConfig

Configuration for the join operation, which loads multiple KGX files into a unified database.

Fields

Field Type Default Description
node_files list[FileSpec] [] List of node files to join
edge_files list[FileSpec] [] List of edge files to join
output_database Path \| None None Path for output DuckDB database
schema_reporting bool True Enable schema analysis reporting
preserve_duplicates bool False Keep duplicate records during join
generate_provided_by bool True Add provided_by column from filename
database_path Path \| None None Database path (inherited from base)
output_format KGXFormat TSV Output format for exports
quiet bool False Suppress progress output
show_progress bool True Show progress indicators

Validation Rules

  • If output_database is provided, it automatically sets database_path for compatibility

Example

from koza.model.graph_operations import JoinConfig, FileSpec

config = JoinConfig(
    node_files=[
        FileSpec(path=Path("data/source1_nodes.tsv")),
        FileSpec(path=Path("data/source2_nodes.tsv")),
    ],
    edge_files=[
        FileSpec(path=Path("data/source1_edges.tsv")),
        FileSpec(path=Path("data/source2_edges.tsv")),
    ],
    output_database=Path("merged.duckdb"),
    generate_provided_by=True
)

JoinResult

Result of the join operation.

Fields

Field Type Default Description
files_loaded list[FileLoadResult] required Details for each loaded file
final_stats DatabaseStats required Final database statistics
schema_report dict[str, Any] \| None None Schema analysis report
total_time_seconds float required Total operation time
database_path Path \| None None Path to the created database

SplitConfig

Configuration for the split operation, which splits a KGX file based on field values.

Fields

Field Type Default Description
input_file FileSpec required Input file to split
split_fields list[str] required Fields to split on
output_directory Path ./output Directory for output files
remove_prefixes bool False Remove prefixes from split values
output_format KGXFormat \| None None Output format (None = preserve original)
database_path Path \| None None Database path (inherited from base)
quiet bool False Suppress progress output
show_progress bool True Show progress indicators

Example

from koza.model.graph_operations import SplitConfig, FileSpec

config = SplitConfig(
    input_file=FileSpec(path=Path("data/all_nodes.tsv")),
    split_fields=["category", "provided_by"],
    output_directory=Path("split_output"),
    remove_prefixes=True
)

SplitResult

Result of the split operation.

Fields

Field Type Default Description
input_file FileSpec required The input file that was split
output_files list[Path] required Paths to created output files
total_records_split int required Total records processed
split_values list[dict[str, str]] required Unique value combinations found
total_time_seconds float required Total operation time

MergeConfig

Configuration for the merge operation, which is a composite pipeline combining join, deduplicate, normalize, and prune operations.

Fields

Field Type Default Description
Input Files
node_files list[FileSpec] [] List of node files to merge
edge_files list[FileSpec] [] List of edge files to merge
mapping_files list[FileSpec] [] SSSOM mapping files for normalization
Pipeline Options
skip_deduplicate bool False Skip deduplication step
skip_normalize bool False Skip normalization step
skip_prune bool False Skip pruning step
generate_provided_by bool True Add provided_by column from filename
continue_on_pipeline_step_error bool True Continue on non-critical errors
Prune Options
keep_singletons bool True Preserve isolated nodes
remove_singletons bool False Move singletons to separate table
Output Options
output_database Path \| None None Path for output database (None = temporary)
output_format KGXFormat TSV Format for exported files
export_final bool False Export final clean data to files
export_directory Path \| None None Directory for exported files
archive bool False Export as archive instead of loose files
compress bool False Compress archive as tar.gz
graph_name str \| None None Name for graph files in archive
General Options
quiet bool False Suppress progress output
show_progress bool True Show progress indicators
schema_reporting bool True Enable schema analysis reporting

Validation Rules

  • Files Required: At least one node or edge file must be provided
  • Normalize Requirements: If skip_normalize is False, mapping files must be provided
  • Singleton Options: Cannot set both keep_singletons and remove_singletons to True
  • Export Requirements: If export_final is True, export_directory must be provided
  • Archive Requirements: compress requires archive to be enabled

Example

from koza.model.graph_operations import MergeConfig, FileSpec

config = MergeConfig(
    node_files=[
        FileSpec(path=Path("data/source1_nodes.tsv")),
        FileSpec(path=Path("data/source2_nodes.tsv")),
    ],
    edge_files=[
        FileSpec(path=Path("data/source1_edges.tsv")),
        FileSpec(path=Path("data/source2_edges.tsv")),
    ],
    mapping_files=[
        FileSpec(path=Path("mappings/disease_mappings.sssom.tsv")),
    ],
    output_database=Path("merged.duckdb"),
    export_final=True,
    export_directory=Path("output"),
    archive=True,
    compress=True,
    graph_name="my-knowledge-graph"
)

MergeResult

Result of the merge operation.

Fields

Field Type Default Description
success bool required Whether the operation succeeded
join_result JoinResult \| None None Result from join step
deduplicate_result DeduplicateResult \| None None Result from deduplicate step
normalize_result NormalizeResult \| None None Result from normalize step
prune_result PruneResult \| None None Result from prune step
operations_completed list[str] [] Names of completed operations
operations_skipped list[str] [] Names of skipped operations
final_stats DatabaseStats \| None None Final database statistics
database_path Path \| None None Path to the database
exported_files list[Path] [] Paths to exported files
total_time_seconds float required Total operation time
summary OperationSummary required Summary for CLI output
errors list[str] [] Error messages
warnings list[str] [] Warning messages

NormalizeConfig

Configuration for the normalize operation, which applies SSSOM mappings to normalize identifiers in edges.

Fields

Field Type Default Description
database_path Path required Path to the DuckDB database
mapping_files list[FileSpec] [] SSSOM mapping files
quiet bool False Suppress progress output
show_progress bool True Show progress indicators

Validation Rules

  • Database Exists: The database file must exist
  • Mapping Files Required: At least one SSSOM mapping file must be provided

Example

from koza.model.graph_operations import NormalizeConfig, FileSpec

config = NormalizeConfig(
    database_path=Path("merged.duckdb"),
    mapping_files=[
        FileSpec(path=Path("mappings/disease_mappings.sssom.tsv")),
        FileSpec(path=Path("mappings/gene_mappings.sssom.tsv")),
    ]
)

NormalizeResult

Result of the normalize operation.

Fields

Field Type Default Description
success bool required Whether the operation succeeded
mappings_loaded list[FileLoadResult] required Details for each loaded mapping file
edges_normalized int required Number of edges normalized
final_stats DatabaseStats \| None None Final database statistics
total_time_seconds float required Total operation time
summary OperationSummary required Summary for CLI output
errors list[str] [] Error messages
warnings list[str] [] Warning messages

DeduplicateConfig

Configuration for the deduplicate operation, which removes duplicate nodes and edges.

Fields

Field Type Default Description
database_path Path required Path to the DuckDB database
deduplicate_nodes bool True Deduplicate nodes table
deduplicate_edges bool True Deduplicate edges table
quiet bool False Suppress progress output
show_progress bool True Show progress indicators

Validation Rules

  • Database Exists: The database file must exist

Example

from koza.model.graph_operations import DeduplicateConfig

config = DeduplicateConfig(
    database_path=Path("merged.duckdb"),
    deduplicate_nodes=True,
    deduplicate_edges=True
)

DeduplicateResult

Result of the deduplicate operation.

Fields

Field Type Default Description
success bool required Whether the operation succeeded
duplicate_nodes_found int 0 Number of duplicate nodes found
duplicate_nodes_removed int 0 Rows removed from nodes table
duplicate_edges_found int 0 Number of duplicate edges found
duplicate_edges_removed int 0 Rows removed from edges table
final_stats DatabaseStats \| None None Final database statistics
total_time_seconds float required Total operation time
summary OperationSummary required Summary for CLI output
errors list[str] [] Error messages
warnings list[str] [] Warning messages

PruneConfig

Configuration for the prune operation, which handles dangling edges and singleton nodes.

Fields

Field Type Default Description
database_path Path required Path to the DuckDB database
keep_singletons bool True Preserve isolated nodes
remove_singletons bool False Move singletons to separate table
min_component_size int \| None None Minimum connected component size
quiet bool False Suppress progress output
show_progress bool True Show progress indicators
output_format KGXFormat \| None None Format for any exported files

Validation Rules

  • Database Exists: The database file must exist
  • Singleton Options: Cannot set both keep_singletons and remove_singletons to True

Example

from koza.model.graph_operations import PruneConfig

config = PruneConfig(
    database_path=Path("merged.duckdb"),
    keep_singletons=False,
    remove_singletons=True
)

PruneResult

Result of the prune operation.

Fields

Field Type Default Description
database_path Path required Path to the database
dangling_edges_moved int required Number of dangling edges moved
singleton_nodes_moved int required Number of singleton nodes moved
singleton_nodes_kept int required Number of singleton nodes kept
final_stats DatabaseStats required Final database statistics
dangling_edges_by_source dict[str, int] {} Dangling edges grouped by source
missing_nodes_by_source dict[str, int] {} Missing nodes grouped by source
total_time_seconds float required Total operation time
success bool required Whether the operation succeeded
errors list[str] [] Error messages

AppendConfig

Configuration for the append operation, which adds new files to an existing database.

Fields

Field Type Default Description
database_path Path required Path to the existing DuckDB database
node_files list[FileSpec] [] Node files to append
edge_files list[FileSpec] [] Edge files to append
deduplicate bool False Run deduplication after append
quiet bool False Suppress progress output
show_progress bool True Show progress indicators
schema_reporting bool False Enable schema analysis reporting

Validation Rules

  • Database Exists: The database file must exist
  • Files Required: At least one node or edge file must be provided

Example

from koza.model.graph_operations import AppendConfig, FileSpec

config = AppendConfig(
    database_path=Path("merged.duckdb"),
    node_files=[
        FileSpec(path=Path("data/new_nodes.tsv")),
    ],
    edge_files=[
        FileSpec(path=Path("data/new_edges.tsv")),
    ],
    deduplicate=True
)

AppendResult

Result of the append operation.

Fields

Field Type Default Description
database_path Path required Path to the database
files_loaded list[FileLoadResult] required Details for each loaded file
records_added int required Number of records added
new_columns_added int required Number of new columns added
schema_changes list[str] [] Description of schema changes
final_stats DatabaseStats required Final database statistics
schema_report dict[str, Any] \| None None Schema analysis report
duplicates_handled int 0 Number of duplicates handled
total_time_seconds float required Total operation time

Report Configurations

QCReportConfig

Configuration for QC (Quality Control) report generation.

Fields

Field Type Default Description
database_path Path required Path to the DuckDB database
output_file Path \| None None Path for output YAML file
group_by str "provided_by" Column to group statistics by
quiet bool False Suppress progress output

Example

from koza.model.graph_operations import QCReportConfig

config = QCReportConfig(
    database_path=Path("merged.duckdb"),
    output_file=Path("qc_report.yaml"),
    group_by="provided_by"
)

GraphStatsConfig

Configuration for graph statistics report generation.

Fields

Field Type Default Description
database_path Path required Path to the DuckDB database
output_file Path \| None None Path for output YAML file
quiet bool False Suppress progress output

Example

from koza.model.graph_operations import GraphStatsConfig

config = GraphStatsConfig(
    database_path=Path("merged.duckdb"),
    output_file=Path("graph_stats.yaml")
)

SchemaReportConfig

Configuration for schema analysis report generation.

Fields

Field Type Default Description
database_path Path required Path to the DuckDB database
output_file Path \| None None Path for output YAML file
include_biolink_compliance bool True Include Biolink model compliance analysis
quiet bool False Suppress progress output

Example

from koza.model.graph_operations import SchemaReportConfig

config = SchemaReportConfig(
    database_path=Path("merged.duckdb"),
    output_file=Path("schema_report.yaml"),
    include_biolink_compliance=True
)

NodeReportConfig

Configuration for tabular node report generation.

Fields

Field Type Default Description
database_path Path \| None None Path to the DuckDB database
node_file FileSpec \| None None Node file to load (alternative to database)
output_file Path \| None None Path for output file
output_format TabularReportFormat TSV Output format
categorical_columns list[str] ["namespace", "category", "in_taxon", "provided_by"] Columns to group by
quiet bool False Suppress progress output

Validation Rules

  • Input Required: Either database_path or node_file must be provided

Example

from koza.model.graph_operations import NodeReportConfig

config = NodeReportConfig(
    database_path=Path("merged.duckdb"),
    output_file=Path("node_report.tsv"),
    categorical_columns=["category", "provided_by"]
)

EdgeReportConfig

Configuration for tabular edge report generation.

Fields

Field Type Default Description
database_path Path \| None None Path to the DuckDB database
node_file FileSpec \| None None Node file to load (for category enrichment)
edge_file FileSpec \| None None Edge file to load (alternative to database)
output_file Path \| None None Path for output file
output_format TabularReportFormat TSV Output format
categorical_columns list[str] see below Columns to group by
quiet bool False Suppress progress output

Default categorical columns:

  • subject_category
  • subject_namespace
  • predicate
  • object_category
  • object_namespace
  • primary_knowledge_source
  • aggregator_knowledge_source
  • knowledge_level
  • agent_type
  • provided_by

Validation Rules

  • Input Required: Either database_path or edge_file must be provided

Example

from koza.model.graph_operations import EdgeReportConfig

config = EdgeReportConfig(
    database_path=Path("merged.duckdb"),
    output_file=Path("edge_report.tsv"),
    categorical_columns=["predicate", "subject_category", "object_category"]
)

NodeExamplesConfig

Configuration for node examples generation.

Fields

Field Type Default Description
database_path Path \| None None Path to the DuckDB database
node_file FileSpec \| None None Node file to load (alternative to database)
output_file Path \| None None Path for output file
output_format TabularReportFormat TSV Output format
sample_size int 5 Number of examples per type
type_column str "category" Column defining the type for grouping
quiet bool False Suppress progress output

Validation Rules

  • Input Required: Either database_path or node_file must be provided

Example

from koza.model.graph_operations import NodeExamplesConfig

config = NodeExamplesConfig(
    database_path=Path("merged.duckdb"),
    output_file=Path("node_examples.tsv"),
    sample_size=10,
    type_column="category"
)

EdgeExamplesConfig

Configuration for edge examples generation.

Fields

Field Type Default Description
database_path Path \| None None Path to the DuckDB database
node_file FileSpec \| None None Node file to load (for category enrichment)
edge_file FileSpec \| None None Edge file to load (alternative to database)
output_file Path \| None None Path for output file
output_format TabularReportFormat TSV Output format
sample_size int 5 Number of examples per type
type_columns list[str] ["subject_category", "predicate", "object_category"] Columns defining the type for grouping
quiet bool False Suppress progress output

Validation Rules

  • Input Required: Either database_path or edge_file must be provided

Example

from koza.model.graph_operations import EdgeExamplesConfig

config = EdgeExamplesConfig(
    database_path=Path("merged.duckdb"),
    output_file=Path("edge_examples.tsv"),
    sample_size=10,
    type_columns=["predicate", "primary_knowledge_source"]
)

Supporting Models

DatabaseStats

Database statistics model used in operation results.

Field Type Default Description
nodes int 0 Total node count
edges int 0 Total edge count
dangling_edges int 0 Edges with missing subject/object nodes
duplicate_nodes int 0 Duplicate node count
singleton_nodes int 0 Nodes with no edges
database_size_mb float \| None None Database size in megabytes

FileLoadResult

Result of loading a single file.

Field Type Default Description
file_spec FileSpec required The file specification
records_loaded int required Number of records loaded
detected_format KGXFormat required Detected file format
load_time_seconds float required Time to load the file
errors list[str] [] Any errors during loading
temp_table_name str \| None None Temp table name for schema analysis

OperationSummary

Summary statistics for CLI output.

Field Type Default Description
operation str required Operation name
success bool required Whether operation succeeded
message str required Summary message
stats DatabaseStats \| None None Database statistics
files_processed int 0 Number of files processed
total_time_seconds float 0.0 Total operation time
warnings list[str] [] Warning messages
errors list[str] [] Error messages