Explanation
This section covers concepts, architecture, and design decisions behind graph operations.
Topics
Architecture
Describes how graph operations are structured:
- Why DuckDB? - Columnar storage and SQL for graph processing
- In-memory vs persistent - Different operational modes
- Processing pipeline - Data flow through operations
- GraphDatabase context manager - Connection handling and transactions
Schema Handling
Covers how graph operations manage different schemas:
- The schema challenge - Why different sources have different columns
- UNION ALL BY NAME - DuckDB's schema harmonization approach
- Auto-detection - How formats and types are inferred
- Schema evolution - Adding columns during append operations
- NULL handling - Treatment of missing values
Data Integrity
Explains the non-destructive approach to data quality:
- Philosophy: move, don't delete - Why problem data is archived
- Archive tables - Where problematic data is stored
- Provenance tracking - How source attribution works
- Recovery - Retrieving data from archives
- Use cases - QC and debugging scenarios
Biolink Compliance
Describes integration with the Biolink model:
- What is Biolink? - The knowledge graph standard
- Required fields - Minimum columns for valid KGX
- Multivalued fields - Arrays in node/edge properties
- Compliance checking - How validation works
- Common issues - Frequent compliance problems and fixes
Key Concepts
KGX Format
Knowledge Graph Exchange (KGX) is a standard format for representing knowledge graphs as node and edge tables. Graph operations work with KGX files in TSV, JSONL, or Parquet format.
DuckDB
DuckDB is an embedded analytical database with:
- Columnar processing
- SQL query interface
- In-memory and persistent modes
- Data compression
SSSOM
Simple Standard for Sharing Ontological Mappings (SSSOM) is a format for representing identifier mappings. Graph operations use SSSOM files to normalize identifiers during the merge process.
Design Principles
- Non-destructive: Problem data is moved to archive tables, not deleted
- Provenance: All records track their source file
- Flexibility: Operations work with any valid KGX files
- Performance: DuckDB handles processing of large graphs
- SQL access: Graphs can be queried directly