Python API Reference
This reference documentation is auto-generated from the source code docstrings using mkdocstrings. Each function includes parameter descriptions, return types, and usage examples.
Core Operations
These are the primary graph transformation functions. Each operation takes a configuration object and returns a result object with statistics and output information.
join_graphs
Combine multiple KGX files into a single DuckDB database.
join_graphs
Join multiple KGX files into a unified DuckDB database.
This operation loads node and edge files from various formats (TSV, JSONL, Parquet) into a single DuckDB database, combining them using UNION ALL BY NAME to handle schema differences across files. Each file's records are tagged with a source identifier for provenance tracking.
The join process: 1. Creates or opens a DuckDB database (in-memory if no path specified) 2. Loads each node file into a temporary table with format auto-detection 3. Loads each edge file into a temporary table with format auto-detection 4. Optionally generates a schema report analyzing column types and values 5. Combines all temporary tables into final 'nodes' and 'edges' tables
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
JoinConfig
|
JoinConfig containing: - node_files: List of FileSpec objects for node files - edge_files: List of FileSpec objects for edge files - database_path: Optional path for persistent database (None for in-memory) - schema_reporting: Whether to generate schema analysis report - generate_provided_by: Whether to add provided_by column from source names - quiet: Suppress console output - show_progress: Display progress bars during loading |
required |
Returns:
| Type | Description |
|---|---|
JoinResult
|
JoinResult containing: - files_loaded: List of FileLoadResult with per-file statistics - final_stats: DatabaseStats with node/edge counts and database size - schema_report: Optional schema analysis if schema_reporting enabled - total_time_seconds: Operation duration - database_path: Path to the created database |
Raises:
| Type | Description |
|---|---|
Exception
|
If file loading fails or database operations error |
split_graph
Split a graph database into separate files based on column values.
split_graph
Split a KGX file into multiple output files based on field values.
This operation partitions a single KGX file (nodes or edges) into separate files based on the unique values of one or more specified fields. Supports format conversion between TSV, JSONL, and Parquet during the split.
The split process: 1. Loads the input file into an in-memory DuckDB database 2. Identifies all unique value combinations for the specified split fields 3. For each unique combination, exports matching records to a separate file 4. Output filenames are generated from the split field values
Handles multivalued fields (arrays) by using list_contains() for filtering, allowing records to appear in multiple output files if they contain multiple values in the split field.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SplitConfig
|
SplitConfig containing: - input_file: FileSpec for the input KGX file - split_fields: List of column names to split on (e.g., ["provided_by"]) - output_directory: Path where split files will be written - output_format: Target format (TSV, JSONL, Parquet); defaults to input format - remove_prefixes: Strip CURIE prefixes from values in output filenames - quiet: Suppress console output - show_progress: Display progress bar during splitting |
required |
Returns:
| Type | Description |
|---|---|
SplitResult
|
SplitResult containing: - input_file: The original input FileSpec - output_files: List of Path objects for created files - total_records_split: Total number of records processed - split_values: List of dicts showing the field value combinations - total_time_seconds: Operation duration |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If input file does not exist |
Exception
|
If loading or export operations fail |
merge_graphs
End-to-end pipeline combining join, normalize, deduplicate, and prune operations.
merge_graphs
Execute the complete merge pipeline: join → deduplicate → normalize → prune.
This composite operation orchestrates multiple graph operations in sequence to create a clean, normalized, and validated graph database from multiple source files. It's the recommended way to build a production-ready knowledge graph from raw KGX files and SSSOM mappings.
The pipeline steps: 1. Join: Load all node/edge files into a unified DuckDB database 2. Deduplicate: Remove duplicate nodes/edges by ID (optional, skip with skip_deduplicate) 3. Normalize: Apply SSSOM mappings to harmonize identifiers (optional, skip with skip_normalize) 4. Prune: Handle dangling edges and singleton nodes (optional, skip with skip_prune) 5. Export: Optionally export final graph to TSV/JSONL/Parquet files or archive
Each step can be skipped via config flags. If a step fails, the pipeline can either abort (default) or continue with remaining steps (continue_on_pipeline_step_error).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
MergeConfig
|
MergeConfig containing: - node_files: List of FileSpec objects for node files - edge_files: List of FileSpec objects for edge files - mapping_files: List of FileSpec objects for SSSOM mapping files - output_database: Path for output database (temp if not specified) - skip_deduplicate: Skip deduplication step - skip_normalize: Skip normalization step (also skipped if no mapping files) - skip_prune: Skip pruning step - keep_singletons/remove_singletons: Singleton node handling in prune step - export_final: Whether to export final graph to files - export_directory: Where to write exported files - output_format: Format for exported files (TSV, JSONL, Parquet) - archive: Create a tar archive instead of loose files - compress: Gzip compress the archive - graph_name: Name prefix for exported files - continue_on_pipeline_step_error: Continue pipeline if a step fails - schema_reporting: Generate schema analysis report - quiet: Suppress console output - show_progress: Display progress bars |
required |
Returns:
| Type | Description |
|---|---|
MergeResult
|
MergeResult containing: - success: Whether the full pipeline completed successfully - join_result/deduplicate_result/normalize_result/prune_result: Per-step results - operations_completed: List of steps that completed successfully - operations_skipped: List of steps that were skipped - final_stats: DatabaseStats with final node/edge counts - database_path: Path to output database (None if temp was used) - exported_files: List of paths to exported files - total_time_seconds: Total pipeline duration - summary: OperationSummary with overall status - errors: List of error messages from failed steps - warnings: List of warning messages |
normalize_graph
Apply SSSOM mappings to normalize identifiers in a graph.
normalize_graph
Apply SSSOM mappings to normalize node identifiers in edge references.
This operation uses SSSOM (Simple Standard for Sharing Ontological Mappings) files to replace node identifiers in the edges table with their canonical equivalents. This is useful for harmonizing identifiers from different sources to a common namespace.
The normalization process: 1. Loads SSSOM mapping files (TSV format with YAML header) 2. Creates a mappings table, deduplicating by object_id to prevent edge duplication 3. Updates edge subject/object columns using the mappings (object_id -> subject_id) 4. Preserves original identifiers in original_subject/original_object columns
Note: Only edge references are normalized. Node IDs in the nodes table are not modified - use the mappings to update node IDs separately if needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
NormalizeConfig
|
NormalizeConfig containing: - database_path: Path to the DuckDB database to normalize - mapping_files: List of FileSpec objects for SSSOM mapping files - quiet: Suppress console output - show_progress: Display progress bars during loading |
required |
Returns:
| Type | Description |
|---|---|
NormalizeResult
|
NormalizeResult containing: - success: Whether the operation completed successfully - mappings_loaded: List of FileLoadResult with per-file statistics - edges_normalized: Count of edge references that were updated - final_stats: DatabaseStats with node/edge counts - total_time_seconds: Operation duration - summary: OperationSummary with status and messages - errors: List of error messages if any - warnings: List of warnings (e.g., duplicate mappings found) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no nodes/edges tables exist or no mapping files load |
deduplicate_graph
Remove duplicate nodes and edges from a graph database.
deduplicate_graph
Deduplicate nodes and edges in a graph database.
This operation: 1. Identifies nodes/edges with duplicate IDs 2. Copies ALL duplicate rows to duplicate_nodes/duplicate_edges tables 3. Keeps only the first occurrence in main tables (ordered by file_source)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
DeduplicateConfig
|
DeduplicateConfig with database path and options |
required |
Returns:
| Type | Description |
|---|---|
DeduplicateResult
|
DeduplicateResult with deduplication statistics |
prune_graph
Remove dangling edges and optionally singleton nodes from a graph.
prune_graph
Clean up graph integrity issues by handling dangling edges and singleton nodes.
This operation identifies and handles two common graph quality issues: - Dangling edges: Edges where subject or object IDs don't exist in the nodes table - Singleton nodes: Nodes that don't appear as subject or object in any edge
The prune process: 1. Identifies dangling edges and moves them to a 'dangling_edges' table 2. Based on config, either keeps or moves singleton nodes to 'singleton_nodes' table 3. Optionally filters by minimum connected component size (not yet implemented)
Dangling edges and singleton nodes are preserved in separate tables for QC analysis rather than being deleted, allowing investigation of data issues.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
PruneConfig
|
PruneConfig containing: - database_path: Path to the DuckDB database to prune - keep_singletons: If True, preserve singleton nodes in main table (default) - remove_singletons: If True, move singleton nodes to singleton_nodes table - min_component_size: Minimum connected component size (not yet implemented) - quiet: Suppress console output - show_progress: Display progress during operations |
required |
Returns:
| Type | Description |
|---|---|
PruneResult
|
PruneResult containing: - database_path: Path to the pruned database - dangling_edges_moved: Count of edges moved to dangling_edges table - singleton_nodes_moved: Count of nodes moved to singleton_nodes table - singleton_nodes_kept: Count of singleton nodes preserved in main table - final_stats: DatabaseStats with final node/edge counts - dangling_edges_by_source: Breakdown of dangling edges by file_source - missing_nodes_by_source: Count of missing node IDs by source - total_time_seconds: Operation duration - success: Whether the operation completed successfully |
Raises:
| Type | Description |
|---|---|
Exception
|
If database operations fail |
append_graphs
Append additional KGX files to an existing graph database.
append_graphs
Append new KGX files to an existing DuckDB database with schema evolution.
This operation adds records from new KGX files to an existing database, automatically handling schema differences. New columns in the input files are added to the existing tables, allowing incremental updates to a graph without re-processing all source files.
The append process: 1. Connects to an existing DuckDB database 2. Records the initial schema and statistics 3. For each new file, loads into a temp table and compares schema 4. Adds any new columns to the main table (schema evolution) 5. Inserts records using UNION ALL BY NAME for schema compatibility 6. Optionally deduplicates after appending 7. Generates a schema report if requested
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
AppendConfig
|
AppendConfig containing: - database_path: Path to the existing DuckDB database - node_files: List of FileSpec objects for new node files - edge_files: List of FileSpec objects for new edge files - deduplicate: Whether to remove duplicates after appending - schema_reporting: Whether to generate schema analysis report - quiet: Suppress console output - show_progress: Display progress bars during loading |
required |
Returns:
| Type | Description |
|---|---|
AppendResult
|
AppendResult containing: - database_path: Path to the updated database - files_loaded: List of FileLoadResult with per-file statistics - records_added: Net change in record count (nodes + edges) - new_columns_added: Count of new columns added via schema evolution - schema_changes: List of descriptions of schema changes - final_stats: DatabaseStats with updated counts - schema_report: Optional schema analysis if enabled - duplicates_handled: Count of duplicates removed (if deduplication enabled) - total_time_seconds: Operation duration |
Raises:
| Type | Description |
|---|---|
Exception
|
If database connection fails or file operations error |
Reporting Functions
These functions generate various reports and statistics about graph databases.
generate_qc_report
Generate a quality control report.
generate_qc_report
Generate a comprehensive quality control report for a graph database.
This operation analyzes a graph database and produces a detailed QC report including node/edge counts by source, duplicate detection, dangling edge analysis, and singleton node counts. The report can be grouped by different columns (e.g., provided_by, file_source).
The QC report includes: - Summary: Total nodes, edges, duplicates, dangling edges, singletons - Nodes by source: Count and category breakdown per source - Edges by source: Count and predicate breakdown per source - Dangling edges by source: Edges pointing to non-existent nodes - Duplicate analysis: Nodes/edges with duplicate IDs
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
QCReportConfig
|
QCReportConfig containing: - database_path: Path to the DuckDB database to analyze - output_file: Optional path to write YAML report - group_by: Column to group statistics by (default: "provided_by") - quiet: Suppress console output |
required |
Returns:
| Type | Description |
|---|---|
QCReportResult
|
QCReportResult containing: - qc_report: QCReport with all analysis data - output_file: Path where report was written (if specified) - total_time_seconds: Report generation duration |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If database does not exist |
Exception
|
If database analysis fails |
generate_graph_stats
Generate graph statistics.
generate_graph_stats
Generate comprehensive statistical analysis of a graph database.
This operation produces detailed statistics about a graph's structure, including node category distributions, edge predicate distributions, degree statistics, and biolink model compliance analysis.
The statistics report includes: - Node statistics: Total count, unique categories, category distribution - Edge statistics: Total count, unique predicates, predicate distribution - Predicate details: Subject/object category pairs for each predicate - Biolink compliance: Validation against biolink model categories/predicates
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
GraphStatsConfig
|
GraphStatsConfig containing: - database_path: Path to the DuckDB database to analyze - output_file: Optional path to write YAML report - quiet: Suppress console output |
required |
Returns:
| Type | Description |
|---|---|
GraphStatsResult
|
GraphStatsResult containing: - stats_report: GraphStatsReport with all statistics - output_file: Path where report was written (if specified) - total_time_seconds: Report generation duration |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If database does not exist |
Exception
|
If database analysis fails |
generate_schema_compliance_report
Generate a schema analysis and biolink compliance report.
generate_schema_compliance_report
Generate a schema analysis and biolink compliance report.
This operation analyzes the schema (columns and data types) of the nodes and edges tables, comparing them against expected biolink model properties and identifying any non-standard or missing columns.
The schema report includes: - Table schemas: Column names and DuckDB data types for nodes/edges - Biolink compliance: Which columns match biolink model slots - Non-standard columns: Columns not in the biolink model - Data type analysis: Column type distributions and potential issues
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
SchemaReportConfig
|
SchemaReportConfig containing: - database_path: Path to the DuckDB database to analyze - output_file: Optional path to write YAML report - quiet: Suppress console output |
required |
Returns:
| Type | Description |
|---|---|
SchemaReportResult
|
SchemaReportResult containing: - schema_report: SchemaAnalysisReport with all analysis - output_file: Path where report was written (if specified) - total_time_seconds: Report generation duration |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If database does not exist |
Exception
|
If schema analysis fails |
generate_node_report
Generate a tabular node report grouped by categorical columns.
generate_node_report
Generate a tabular node report grouped by categorical columns.
This operation creates a summary report of nodes grouped by specified categorical columns (e.g., category, provided_by, namespace), with counts for each unique combination. Useful for understanding the distribution and composition of nodes in a graph.
Can operate on either an existing DuckDB database or load nodes directly from a KGX file into an in-memory database.
Special handling: - "namespace" column: Extracts the CURIE prefix from the id column
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
NodeReportConfig
|
NodeReportConfig containing: - database_path: Path to existing database (or None to load from file) - node_file: FileSpec for direct file loading (if no database_path) - categorical_columns: List of columns to group by (e.g., ["category", "provided_by"]) - output_file: Path to write the report (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output |
required |
Returns:
| Type | Description |
|---|---|
NodeReportResult
|
NodeReportResult containing: - output_file: Path where report was written - total_rows: Number of unique combinations in the report - total_time_seconds: Report generation duration |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no valid columns found for the report |
Exception
|
If database operations fail |
generate_edge_report
Generate a tabular edge report with denormalized node information.
generate_edge_report
Generate a tabular edge report with denormalized node information.
This operation creates a summary report of edges grouped by specified categorical columns, with counts for each unique combination. The report can include denormalized node information (subject_category, object_category) by joining edges with the nodes table.
Creates a 'denormalized_edges' view that joins edges with nodes to provide subject and object category information alongside edge predicates.
Special handling: - "subject_namespace" / "object_namespace": Extract CURIE prefixes from subject/object
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
EdgeReportConfig
|
EdgeReportConfig containing: - database_path: Path to existing database (or None to load from files) - node_file: FileSpec for node file (if no database_path) - edge_file: FileSpec for edge file (if no database_path) - categorical_columns: List of columns to group by (e.g., ["predicate", "subject_category", "object_category"]) - output_file: Path to write the report (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output |
required |
Returns:
| Type | Description |
|---|---|
EdgeReportResult
|
EdgeReportResult containing: - output_file: Path where report was written - total_rows: Number of unique combinations in the report - total_time_seconds: Report generation duration |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no valid columns found for the report |
Exception
|
If database operations fail |
generate_node_examples
Generate sample nodes for each node type in the graph.
generate_node_examples
Generate sample nodes for each node type (category) in the graph.
This operation extracts N representative examples for each unique value of the specified type column (typically "category"). Useful for data exploration, QC review, and documentation purposes.
Uses DuckDB window functions to efficiently sample N rows per type: ROW_NUMBER() OVER (PARTITION BY type_column ORDER BY id) <= sample_size
Can operate on either an existing DuckDB database or load nodes directly from a KGX file into an in-memory database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
NodeExamplesConfig
|
NodeExamplesConfig containing: - database_path: Path to existing database (or None to load from file) - node_file: FileSpec for direct file loading (if no database_path) - type_column: Column to partition by (default: "category") - sample_size: Number of examples per type (default: 5) - output_file: Path to write the examples (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output |
required |
Returns:
| Type | Description |
|---|---|
NodeExamplesResult
|
NodeExamplesResult containing: - output_file: Path where examples were written - types_sampled: Number of unique types found - total_examples: Total number of example rows written - total_time_seconds: Report generation duration |
Raises:
| Type | Description |
|---|---|
ValueError
|
If type_column not found in nodes table |
Exception
|
If database operations fail |
generate_edge_examples
Generate sample edges for each edge type pattern in the graph.
generate_edge_examples
Generate sample edges for each edge type pattern in the graph.
This operation extracts N representative examples for each unique combination of (subject_category, predicate, object_category). Useful for data exploration, QC review, and understanding the relationship patterns in a knowledge graph.
Creates a denormalized view joining edges with nodes to include subject and object category information, then uses DuckDB window functions to efficiently sample N rows per edge type pattern.
Can operate on either an existing DuckDB database or load nodes and edges directly from KGX files into an in-memory database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
EdgeExamplesConfig
|
EdgeExamplesConfig containing: - database_path: Path to existing database (or None to load from files) - node_file: FileSpec for node file (if no database_path) - edge_file: FileSpec for edge file (if no database_path) - sample_size: Number of examples per edge type (default: 5) - output_file: Path to write the examples (TSV, CSV, or Parquet) - output_format: Format for output file - quiet: Suppress console output |
required |
Returns:
| Type | Description |
|---|---|
EdgeExamplesResult
|
EdgeExamplesResult containing: - output_file: Path where examples were written - types_sampled: Number of unique edge type patterns found - total_examples: Total number of example rows written - total_time_seconds: Report generation duration |
Raises:
| Type | Description |
|---|---|
Exception
|
If database operations fail |
Helper Functions
These utility functions simplify common tasks like converting file paths to FileSpec objects.
prepare_file_specs_from_paths
Convert file paths to FileSpec objects with format auto-detection.
prepare_file_specs_from_paths
Convert file paths to FileSpec objects with format auto-detection.
This CLI helper expands glob patterns and creates FileSpec objects for each matched file. The file format (TSV, JSONL, Parquet) is auto-detected from the file extension. Each file's stem is used as its source_name for provenance tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
node_paths
|
list[str]
|
List of node file paths or glob patterns (e.g., "data/*.tsv") |
required |
edge_paths
|
list[str]
|
List of edge file paths or glob patterns (e.g., "data/*_edges.jsonl") |
required |
Returns:
| Type | Description |
|---|---|
list[FileSpec]
|
Tuple of (node_file_specs, edge_file_specs) with FileSpec objects |
list[FileSpec]
|
configured for the appropriate file type (NODES or EDGES) |
prepare_merge_config_from_paths
Create a MergeConfig from file paths with automatic FileSpec generation.
prepare_merge_config_from_paths
Create a MergeConfig from file paths with automatic FileSpec generation.
This CLI helper converts Path objects to FileSpec objects and assembles a complete MergeConfig. File formats are auto-detected from extensions, and file stems are used as source names for provenance tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
node_files
|
list[Path]
|
List of Path objects for node KGX files |
required |
edge_files
|
list[Path]
|
List of Path objects for edge KGX files |
required |
mapping_files
|
list[Path]
|
List of Path objects for SSSOM mapping files |
required |
output_database
|
Path | None
|
Optional path for persistent output database |
None
|
skip_normalize
|
bool
|
If True, skip the normalization step |
False
|
skip_prune
|
bool
|
If True, skip the pruning step |
False
|
**kwargs
|
Any
|
Additional MergeConfig parameters (e.g., quiet, show_progress, export_final, export_directory, archive, compress, graph_name) |
{}
|
Returns:
| Type | Description |
|---|---|
MergeConfig
|
Fully configured MergeConfig ready for merge_graphs() |
prepare_mapping_file_specs_from_paths
Convert SSSOM mapping file paths to FileSpec objects.
prepare_mapping_file_specs_from_paths
Convert a list of SSSOM mapping file paths to FileSpec objects.
This CLI helper creates FileSpec objects for SSSOM mapping files, which are always in TSV format. Each file's stem is used as its source_name for tracking which mappings came from which file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mapping_paths
|
list[Path]
|
List of Path objects pointing to SSSOM mapping files |
required |
source_name
|
str | None
|
Optional source name to apply to all files (overrides per-file names) |
None
|
Returns:
| Type | Description |
|---|---|
list[FileSpec]
|
List of FileSpec objects configured for SSSOM mapping files |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If any mapping file does not exist |