Python API Reference

This reference documentation is auto-generated from the source code docstrings using mkdocstrings. Each function includes parameter descriptions, return types, and usage examples.

Core Operations

These are the primary graph transformation functions. Each operation takes a configuration object and returns a result object with statistics and output information.

join_graphs

Combine multiple KGX files into a single DuckDB database.

`join_graphs`

Join multiple KGX files into a unified DuckDB database.

This operation loads node and edge files from various formats (TSV, JSONL, Parquet) into a single DuckDB database, combining them using UNION ALL BY NAME to handle schema differences across files. Each file's records are tagged with a source identifier for provenance tracking.

The join process: 1. Creates or opens a DuckDB database (in-memory if no path specified) 2. Loads each node file into a temporary table with format auto-detection 3. Loads each edge file into a temporary table with format auto-detection 4. Optionally generates a schema report analyzing column types and values 5. Combines all temporary tables into final 'nodes' and 'edges' tables

Parameters:

Name	Type	Description	Default
`config`	`JoinConfig`	JoinConfig containing: - node_files: List of FileSpec objects for node files - edge_files: List of FileSpec objects for edge files - database_path: Optional path for persistent database (None for in-memory) - schema_reporting: Whether to generate schema analysis report - generate_provided_by: Whether to add provided_by column from source names - quiet: Suppress console output - show_progress: Display progress bars during loading	required

Returns:

Type	Description
`JoinResult`	JoinResult containing: - files_loaded: List of FileLoadResult with per-file statistics - final_stats: DatabaseStats with node/edge counts and database size - schema_report: Optional schema analysis if schema_reporting enabled - total_time_seconds: Operation duration - database_path: Path to the created database

Raises:

Type	Description
`Exception`	If file loading fails or database operations error

split_graph

Split a graph database into separate files based on column values.

`split_graph`

Split a KGX file into multiple output files based on field values.

This operation partitions a single KGX file (nodes or edges) into separate files based on the unique values of one or more specified fields. Supports format conversion between TSV, JSONL, and Parquet during the split.

The split process: 1. Loads the input file into an in-memory DuckDB database 2. Identifies all unique value combinations for the specified split fields 3. For each unique combination, exports matching records to a separate file 4. Output filenames are generated from the split field values

Handles multivalued fields (arrays) by using list_contains() for filtering, allowing records to appear in multiple output files if they contain multiple values in the split field.

Parameters:

Name	Type	Description	Default
`config`	`SplitConfig`	SplitConfig containing: - input_file: FileSpec for the input KGX file - split_fields: List of column names to split on (e.g., ["provided_by"]) - output_directory: Path where split files will be written - output_format: Target format (TSV, JSONL, Parquet); defaults to input format - remove_prefixes: Strip CURIE prefixes from values in output filenames - quiet: Suppress console output - show_progress: Display progress bar during splitting	required

Returns:

Type	Description
`SplitResult`	SplitResult containing: - input_file: The original input FileSpec - output_files: List of Path objects for created files - total_records_split: Total number of records processed - split_values: List of dicts showing the field value combinations - total_time_seconds: Operation duration

Raises:

Type	Description
`FileNotFoundError`	If input file does not exist
`Exception`	If loading or export operations fail

merge_graphs

End-to-end pipeline combining join, normalize, deduplicate, and prune operations.

`merge_graphs`

Execute the complete merge pipeline: join → deduplicate → normalize → prune.

This composite operation orchestrates multiple graph operations in sequence to create a clean, normalized, and validated graph database from multiple source files. It's the recommended way to build a production-ready knowledge graph from raw KGX files and SSSOM mappings.

The pipeline steps: 1. Join: Load all node/edge files into a unified DuckDB database 2. Deduplicate: Remove duplicate nodes/edges by ID (optional, skip with skip_deduplicate) 3. Normalize: Apply SSSOM mappings to harmonize identifiers (optional, skip with skip_normalize) 4. Prune: Handle dangling edges and singleton nodes (optional, skip with skip_prune) 5. Export: Optionally export final graph to TSV/JSONL/Parquet files or archive

Each step can be skipped via config flags. If a step fails, the pipeline can either abort (default) or continue with remaining steps (continue_on_pipeline_step_error).

Parameters:

Name	Type	Description	Default
`config`	`MergeConfig`	MergeConfig containing: - node_files: List of FileSpec objects for node files - edge_files: List of FileSpec objects for edge files - mapping_files: List of FileSpec objects for SSSOM mapping files - output_database: Path for output database (temp if not specified) - skip_deduplicate: Skip deduplication step - skip_normalize: Skip normalization step (also skipped if no mapping files) - skip_prune: Skip pruning step - keep_singletons/remove_singletons: Singleton node handling in prune step - export_final: Whether to export final graph to files - export_directory: Where to write exported files - output_format: Format for exported files (TSV, JSONL, Parquet) - archive: Create a tar archive instead of loose files - compress: Gzip compress the archive - graph_name: Name prefix for exported files - continue_on_pipeline_step_error: Continue pipeline if a step fails - schema_reporting: Generate schema analysis report - quiet: Suppress console output - show_progress: Display progress bars	required

Returns:

Type Description

MergeResult

MergeResult containing: - success: Whether the full pipeline completed successfully - join_result/deduplicate_result/normalize_result/prune_result: Per-step results - operations_completed: List of steps that completed successfully - operations_skipped: List of steps that were skipped - final_stats: DatabaseStats with final node/edge counts - database_path: Path to output database (None if temp was used) - exported_files: List of paths to exported files - total_time_seconds: Total pipeline duration - summary: OperationSummary with overall status - errors: List of error messages from failed steps - warnings: List of warning messages

normalize_graph

Apply SSSOM mappings to normalize identifiers in a graph.

`normalize_graph`

Apply SSSOM mappings to normalize node identifiers in edge references.

This operation uses SSSOM (Simple Standard for Sharing Ontological Mappings) files to replace node identifiers in the edges table with their canonical equivalents. This is useful for harmonizing identifiers from different sources to a common namespace.

The normalization process: 1. Loads SSSOM mapping files (TSV format with YAML header) 2. Creates a mappings table, deduplicating by object_id to prevent edge duplication 3. Updates edge subject/object columns using the mappings (object_id -> subject_id) 4. Preserves original identifiers in original_subject/original_object columns

Note: Only edge references are normalized. Node IDs in the nodes table are not modified - use the mappings to update node IDs separately if needed.

Parameters:

Name	Type	Description	Default
`config`	`NormalizeConfig`	NormalizeConfig containing: - database_path: Path to the DuckDB database to normalize - mapping_files: List of FileSpec objects for SSSOM mapping files - quiet: Suppress console output - show_progress: Display progress bars during loading	required

Returns:

Type Description

NormalizeResult

NormalizeResult containing: - success: Whether the operation completed successfully - mappings_loaded: List of FileLoadResult with per-file statistics - edges_normalized: Count of edge references that were updated - final_stats: DatabaseStats with node/edge counts - total_time_seconds: Operation duration - summary: OperationSummary with status and messages - errors: List of error messages if any - warnings: List of warnings (e.g., duplicate mappings found)

Raises:

Type	Description
`ValueError`	If no nodes/edges tables exist or no mapping files load

deduplicate_graph

Remove duplicate nodes and edges from a graph database.

`deduplicate_graph`

Deduplicate nodes and edges in a graph database.

This operation: 1. Identifies nodes/edges with duplicate IDs 2. Copies ALL duplicate rows to duplicate_nodes/duplicate_edges tables 3. Keeps only the first occurrence in main tables (ordered by file_source)

Parameters:

Name	Type	Description	Default
`config`	`DeduplicateConfig`	DeduplicateConfig with database path and options	required

Returns:

Type	Description
`DeduplicateResult`	DeduplicateResult with deduplication statistics

prune_graph

Remove dangling edges and optionally singleton nodes from a graph.

`prune_graph`

Clean up graph integrity issues by handling dangling edges and singleton nodes.

This operation identifies and handles two common graph quality issues: - Dangling edges: Edges where subject or object IDs don't exist in the nodes table - Singleton nodes: Nodes that don't appear as subject or object in any edge

The prune process: 1. Identifies dangling edges and moves them to a 'dangling_edges' table 2. Based on config, either keeps or moves singleton nodes to 'singleton_nodes' table 3. Optionally filters by minimum connected component size (not yet implemented)

Dangling edges and singleton nodes are preserved in separate tables for QC analysis rather than being deleted, allowing investigation of data issues.

Parameters:

Name	Type	Description	Default
`config`	`PruneConfig`	PruneConfig containing: - database_path: Path to the DuckDB database to prune - keep_singletons: If True, preserve singleton nodes in main table (default) - remove_singletons: If True, move singleton nodes to singleton_nodes table - min_component_size: Minimum connected component size (not yet implemented) - quiet: Suppress console output - show_progress: Display progress during operations	required

Returns:

Type Description

PruneResult

PruneResult containing: - database_path: Path to the pruned database - dangling_edges_moved: Count of edges moved to dangling_edges table - singleton_nodes_moved: Count of nodes moved to singleton_nodes table - singleton_nodes_kept: Count of singleton nodes preserved in main table - final_stats: DatabaseStats with final node/edge counts - dangling_edges_by_source: Breakdown of dangling edges by file_source - missing_nodes_by_source: Count of missing node IDs by source - total_time_seconds: Operation duration - success: Whether the operation completed successfully

Raises:

Type	Description
`Exception`	If database operations fail

append_graphs

Append additional KGX files to an existing graph database.

`append_graphs`

Append new KGX files to an existing DuckDB database with schema evolution.

This operation adds records from new KGX files to an existing database, automatically handling schema differences. New columns in the input files are added to the existing tables, allowing incremental updates to a graph without re-processing all source files.

The append process: 1. Connects to an existing DuckDB database 2. Records the initial schema and statistics 3. For each new file, loads into a temp table and compares schema 4. Adds any new columns to the main table (schema evolution) 5. Inserts records using UNION ALL BY NAME for schema compatibility 6. Optionally deduplicates after appending 7. Generates a schema report if requested

Parameters:

Name	Type	Description	Default
`config`	`AppendConfig`	AppendConfig containing: - database_path: Path to the existing DuckDB database - node_files: List of FileSpec objects for new node files - edge_files: List of FileSpec objects for new edge files - deduplicate: Whether to remove duplicates after appending - schema_reporting: Whether to generate schema analysis report - quiet: Suppress console output - show_progress: Display progress bars during loading	required

Returns:

Type Description

AppendResult

AppendResult containing: - database_path: Path to the updated database - files_loaded: List of FileLoadResult with per-file statistics - records_added: Net change in record count (nodes + edges) - new_columns_added: Count of new columns added via schema evolution - schema_changes: List of descriptions of schema changes - final_stats: DatabaseStats with updated counts - schema_report: Optional schema analysis if enabled - duplicates_handled: Count of duplicates removed (if deduplication enabled) - total_time_seconds: Operation duration

Raises:

Type	Description
`Exception`	If database connection fails or file operations error

Reporting Functions

These functions generate various reports and statistics about graph databases.

generate_qc_report

Generate a quality control report.

`generate_qc_report`

Generate a comprehensive quality control report for a graph database.

This operation analyzes a graph database and produces a detailed QC report including node/edge counts by source, duplicate detection, dangling edge analysis, and singleton node counts. The report can be grouped by different columns (e.g., provided_by, file_source).

The QC report includes: - Summary: Total nodes, edges, duplicates, dangling edges, singletons - Nodes by source: Count and category breakdown per source - Edges by source: Count and predicate breakdown per source - Dangling edges by source: Edges pointing to non-existent nodes - Duplicate analysis: Nodes/edges with duplicate IDs

Parameters:

Name	Type	Description	Default
`config`	`QCReportConfig`	QCReportConfig containing: - database_path: Path to the DuckDB database to analyze - output_file: Optional path to write YAML report - group_by: Column to group statistics by (default: "provided_by") - quiet: Suppress console output	required

Returns:

Type	Description
`QCReportResult`	QCReportResult containing: - qc_report: QCReport with all analysis data - output_file: Path where report was written (if specified) - total_time_seconds: Report generation duration

Raises:

Type	Description
`FileNotFoundError`	If database does not exist
`Exception`	If database analysis fails

generate_graph_stats

Generate graph statistics.

`generate_graph_stats`

Generate comprehensive statistical analysis of a graph database.

This operation produces detailed statistics about a graph's structure, including node category distributions, edge predicate distributions, degree statistics, and biolink model compliance analysis.

The statistics report includes: - Node statistics: Total count, unique categories, category distribution - Edge statistics: Total count, unique predicates, predicate distribution - Predicate details: Subject/object category pairs for each predicate - Biolink compliance: Validation against biolink model categories/predicates

Parameters:

Name	Type	Description	Default
`config`	`GraphStatsConfig`	GraphStatsConfig containing: - database_path: Path to the DuckDB database to analyze - output_file: Optional path to write YAML report - quiet: Suppress console output	required

Returns:

Type	Description
`GraphStatsResult`	GraphStatsResult containing: - stats_report: GraphStatsReport with all statistics - output_file: Path where report was written (if specified) - total_time_seconds: Report generation duration

Raises:

Type	Description
`FileNotFoundError`	If database does not exist
`Exception`	If database analysis fails

generate_schema_compliance_report

Generate a schema analysis and biolink compliance report.

`generate_schema_compliance_report`

Generate a schema analysis and biolink compliance report.

This operation analyzes the schema (columns and data types) of the nodes and edges tables, comparing them against expected biolink model properties and identifying any non-standard or missing columns.

The schema report includes: - Table schemas: Column names and DuckDB data types for nodes/edges - Biolink compliance: Which columns match biolink model slots - Non-standard columns: Columns not in the biolink model - Data type analysis: Column type distributions and potential issues

Parameters:

Name	Type	Description	Default
`config`	`SchemaReportConfig`	SchemaReportConfig containing: - database_path: Path to the DuckDB database to analyze - output_file: Optional path to write YAML report - quiet: Suppress console output	required

Returns:

Type	Description
`SchemaReportResult`	SchemaReportResult containing: - schema_report: SchemaAnalysisReport with all analysis - output_file: Path where report was written (if specified) - total_time_seconds: Report generation duration

Raises:

Type	Description
`FileNotFoundError`	If database does not exist
`Exception`	If schema analysis fails

generate_node_report

Generate a tabular node report grouped by categorical columns.

`generate_node_report`

Generate a tabular node report grouped by categorical columns.

This operation creates a summary report of nodes grouped by specified categorical columns (e.g., category, provided_by, namespace), with counts for each unique combination. Useful for understanding the distribution and composition of nodes in a graph.

Can operate on either an existing DuckDB database or load nodes directly from a KGX file into an in-memory database.

Special handling: - "namespace" column: Extracts the CURIE prefix from the id column

Parameters:

Name	Type	Description	Default
`config`	`NodeReportConfig`	NodeReportConfig containing: - database_path: Path to existing database (or None to load from file) - node_file: FileSpec for direct file loading (if no database_path) - categorical_columns: List of columns to group by (e.g., ["category", "provided_by"]) - output_file: Path to write the report (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output	required

Returns:

Type	Description
`NodeReportResult`	NodeReportResult containing: - output_file: Path where report was written - total_rows: Number of unique combinations in the report - total_time_seconds: Report generation duration

Raises:

Type	Description
`ValueError`	If no valid columns found for the report
`Exception`	If database operations fail

generate_edge_report

Generate a tabular edge report with denormalized node information.

`generate_edge_report`

Generate a tabular edge report with denormalized node information.

This operation creates a summary report of edges grouped by specified categorical columns, with counts for each unique combination. The report can include denormalized node information (subject_category, object_category) by joining edges with the nodes table.

Creates a 'denormalized_edges' view that joins edges with nodes to provide subject and object category information alongside edge predicates.

Special handling: - "subject_namespace" / "object_namespace": Extract CURIE prefixes from subject/object

Parameters:

Name	Type	Description	Default
`config`	`EdgeReportConfig`	EdgeReportConfig containing: - database_path: Path to existing database (or None to load from files) - node_file: FileSpec for node file (if no database_path) - edge_file: FileSpec for edge file (if no database_path) - categorical_columns: List of columns to group by (e.g., ["predicate", "subject_category", "object_category"]) - output_file: Path to write the report (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output	required

Returns:

Type	Description
`EdgeReportResult`	EdgeReportResult containing: - output_file: Path where report was written - total_rows: Number of unique combinations in the report - total_time_seconds: Report generation duration

Raises:

Type	Description
`ValueError`	If no valid columns found for the report
`Exception`	If database operations fail

generate_node_examples

Generate sample nodes for each node type in the graph.

`generate_node_examples`

Generate sample nodes for each node type (category) in the graph.

This operation extracts N representative examples for each unique value of the specified type column (typically "category"). Useful for data exploration, QC review, and documentation purposes.

Uses DuckDB window functions to efficiently sample N rows per type: ROW_NUMBER() OVER (PARTITION BY type_column ORDER BY id) <= sample_size

Can operate on either an existing DuckDB database or load nodes directly from a KGX file into an in-memory database.

Parameters:

Name	Type	Description	Default
`config`	`NodeExamplesConfig`	NodeExamplesConfig containing: - database_path: Path to existing database (or None to load from file) - node_file: FileSpec for direct file loading (if no database_path) - type_column: Column to partition by (default: "category") - sample_size: Number of examples per type (default: 5) - output_file: Path to write the examples (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output	required

Returns:

Type	Description
`NodeExamplesResult`	NodeExamplesResult containing: - output_file: Path where examples were written - types_sampled: Number of unique types found - total_examples: Total number of example rows written - total_time_seconds: Report generation duration

Raises:

Type	Description
`ValueError`	If type_column not found in nodes table
`Exception`	If database operations fail

generate_edge_examples

Generate sample edges for each edge type pattern in the graph.

`generate_edge_examples`

Generate sample edges for each edge type pattern in the graph.

This operation extracts N representative examples for each unique combination of (subject_category, predicate, object_category). Useful for data exploration, QC review, and understanding the relationship patterns in a knowledge graph.

Creates a denormalized view joining edges with nodes to include subject and object category information, then uses DuckDB window functions to efficiently sample N rows per edge type pattern.

Can operate on either an existing DuckDB database or load nodes and edges directly from KGX files into an in-memory database.

Parameters:

Name	Type	Description	Default
`config`	`EdgeExamplesConfig`	EdgeExamplesConfig containing: - database_path: Path to existing database (or None to load from files) - node_file: FileSpec for node file (if no database_path) - edge_file: FileSpec for edge file (if no database_path) - sample_size: Number of examples per edge type (default: 5) - output_file: Path to write the examples (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output	required

Returns:

Type	Description
`EdgeExamplesResult`	EdgeExamplesResult containing: - output_file: Path where examples were written - types_sampled: Number of unique edge type patterns found - total_examples: Total number of example rows written - total_time_seconds: Report generation duration

Raises:

Type	Description
`Exception`	If database operations fail

Helper Functions

These utility functions simplify common tasks like converting file paths to FileSpec objects.

prepare_file_specs_from_paths

Convert file paths to FileSpec objects with format auto-detection.

`prepare_file_specs_from_paths`

Convert file paths to FileSpec objects with format auto-detection.

This CLI helper expands glob patterns and creates FileSpec objects for each matched file. The file format (TSV, JSONL, Parquet) is auto-detected from the file extension. Each file's stem is used as its source_name for provenance tracking.

Parameters:

Name	Type	Description	Default
`node_paths`	`list[str]`	List of node file paths or glob patterns (e.g., "data/*.tsv")	required
`edge_paths`	`list[str]`	List of edge file paths or glob patterns (e.g., "data/*_edges.jsonl")	required

Returns:

Type	Description
`list[FileSpec]`	Tuple of (node_file_specs, edge_file_specs) with FileSpec objects
`list[FileSpec]`	configured for the appropriate file type (NODES or EDGES)

prepare_merge_config_from_paths

Create a MergeConfig from file paths with automatic FileSpec generation.

`prepare_merge_config_from_paths`

Create a MergeConfig from file paths with automatic FileSpec generation.

This CLI helper converts Path objects to FileSpec objects and assembles a complete MergeConfig. File formats are auto-detected from extensions, and file stems are used as source names for provenance tracking.

Parameters:

Name	Type	Description	Default
`node_files`	`list[Path]`	List of Path objects for node KGX files	required
`edge_files`	`list[Path]`	List of Path objects for edge KGX files	required
`mapping_files`	`list[Path]`	List of Path objects for SSSOM mapping files	required
`output_database`	`Path \| None`	Optional path for persistent output database	`None`
`skip_normalize`	`bool`	If True, skip the normalization step	`False`
`skip_prune`	`bool`	If True, skip the pruning step	`False`
`**kwargs`	`Any`	Additional MergeConfig parameters (e.g., quiet, show_progress, export_final, export_directory, archive, compress, graph_name)	`{}`

Returns:

Type	Description
`MergeConfig`	Fully configured MergeConfig ready for merge_graphs()

prepare_mapping_file_specs_from_paths

Convert SSSOM mapping file paths to FileSpec objects.

`prepare_mapping_file_specs_from_paths`

Convert a list of SSSOM mapping file paths to FileSpec objects.

This CLI helper creates FileSpec objects for SSSOM mapping files, which are always in TSV format. Each file's stem is used as its source_name for tracking which mappings came from which file.

Parameters:

Name	Type	Description	Default
`mapping_paths`	`list[Path]`	List of Path objects pointing to SSSOM mapping files	required
`source_name`	`str \| None`	Optional source name to apply to all files (overrides per-file names)	`None`

Returns:

Type	Description
`list[FileSpec]`	List of FileSpec objects configured for SSSOM mapping files

Raises:

Type	Description
`FileNotFoundError`	If any mapping file does not exist