Schema Handling

Overview

When building knowledge graphs from multiple sources, handling different schemas is a common challenge. Each data source may use different columns, different naming conventions, and different data types. Graph operations use DuckDB's schema harmonization features to combine these heterogeneous files into a unified graph.

This document explains how schema handling works, from format detection through schema evolution, so you can understand what happens when your files have different columns.

The Schema Challenge

Why Different KGX Files Have Different Columns

Knowledge graph data comes from many different sources, and each source may have different characteristics:

Different data sources: A gene annotation database exports different fields than a disease ontology. One might include taxon and chromosome, while another includes definition and synonyms.

Different Biolink slots used: The Biolink model defines many optional slots for nodes and edges. A protein interaction dataset might use has_evidence and supporting_publications, while a chemical database uses has_chemical_formula and molecular_weight.

Optional vs required fields: While KGX defines core fields like id, name, category for nodes and subject, predicate, object for edges, most other fields are optional. Different sources populate different subsets.

Custom extensions: Some data sources include custom columns beyond the Biolink model, such as source-specific identifiers, confidence scores, or annotation metadata.

Example: Schema Variation in Practice

Consider combining three node files:

# gene_nodes.tsv (5 columns)
id    name    category    taxon    chromosome

# disease_nodes.tsv (4 columns)
id    name    category    definition

# chemical_nodes.tsv (6 columns)
id    name    category    formula    smiles    molecular_weight

When these files are combined, the result must include all columns from all sources, with appropriate handling for missing values.

UNION ALL BY NAME

DuckDB's Schema Harmonization Strategy

Graph operations use DuckDB's UNION ALL BY NAME feature to combine files with different schemas. This is the key mechanism that makes heterogeneous file handling possible.

How It Works

Columns matched by name, not position: Unlike traditional UNION ALL which requires identical column order, UNION ALL BY NAME matches columns by their names. This means files can have columns in any order.

Missing columns filled with NULL: When a file lacks a column that exists in another file, DuckDB fills those cells with NULL. No data is lost.

All columns from all sources preserved: The final table includes every column from every source file. No manual schema mapping is required.

Example

-- DuckDB automatically handles schema differences
CREATE TABLE nodes AS
SELECT * FROM temp_genes
UNION ALL BY NAME
SELECT * FROM temp_diseases
UNION ALL BY NAME
SELECT * FROM temp_chemicals

Result schema: id, name, category, taxon, chromosome, definition, formula, smiles, molecular_weight

For genes: definition, formula, smiles, molecular_weight will be NULL For diseases: taxon, chromosome, formula, smiles, molecular_weight will be NULL For chemicals: taxon, chromosome, definition will be NULL

Format Auto-Detection

How Format Is Detected

Graph operations detect file formats automatically. The FileSpec model handles this through field validators.

File Extension Detection

The format is detected from the file extension:

Extension	Format
`.tsv`, `.txt`	TSV (tab-separated values)
`.jsonl`, `.json`	JSONL (JSON Lines)
`.parquet`	Parquet

Compression Detection

The system handles compressed files by first stripping the compression extension, then detecting the underlying format:

File	Detected Format
`nodes.tsv.gz`	TSV (compressed)
`edges.jsonl.bz2`	JSONL (compressed)
`data.parquet.xz`	Parquet (compressed)

DuckDB reads compressed files without manual decompression.

File Type Detection

The system also auto-detects whether a file contains nodes or edges based on filename patterns:

Files containing _nodes. or starting with nodes. are detected as node files
Files containing _edges. or starting with edges. are detected as edge files

Type Inference

How Column Types Are Determined

DuckDB performs type inference when reading files. The behavior varies by format:

TSV files: By default, graph operations read TSV files with all_varchar=true, treating all columns as text. This avoids data loss from type mismatches and handles mixed-type columns consistently.

JSONL files: DuckDB infers types from the JSON structure. Arrays become VARCHAR[] (array of strings), numbers become appropriate numeric types, and strings remain VARCHAR. The option convert_strings_to_integers=false prevents UUID-like strings from being incorrectly parsed as integers.

Parquet files: Types are preserved exactly as stored in the Parquet schema, since Parquet is a typed format.

Array Detection for Multivalued Fields

Graph operations identify multivalued fields (like synonym, xref, publications) and convert them appropriately:

Schema-based detection: The system consults the Biolink model schema to determine which fields are defined as multivalued
Fallback list: When the schema is unavailable, a hardcoded list of common KGX multivalued fields is used
Transformation: For TSV files, pipe-delimited values (value1|value2|value3) are converted to arrays

# Fields that are always treated as single-valued (overrides schema)
FORCE_SINGLE_VALUED_FIELDS = {"category", "in_taxon", "type"}

# Fallback multivalued fields when schema unavailable
KGX_MULTIVALUED_FIELDS_FALLBACK = {
    "xref", "synonym", "publications", "provided_by",
    "qualifiers", "knowledge_source", "has_evidence", ...
}

Schema Evolution with Append

How New Columns Are Added

The append operation supports schema evolution, allowing you to add files with new columns to an existing database.

The Process

Load new file into temp table: The new file is loaded with its full schema
Compare schemas: The temp table schema is compared to the existing table
ALTER TABLE ADD COLUMN: Any columns in the new file that do not exist in the target table are added via ALTER TABLE
Insert with UNION ALL BY NAME: Data is inserted using schema-compatible union

# Schema evolution in append operation
new_columns = set(file_columns.keys()) - set(existing_schema.keys())
for col_name in new_columns:
    col_type = file_columns[col_name]
    db.conn.execute(f"ALTER TABLE {table_type} ADD COLUMN {col_name} {col_type}")

NULL Backfill for Existing Rows

When a new column is added to an existing table, all existing rows receive NULL for that column. DuckDB does this automatically.

Example

# Original database has nodes with: id, name, category

# New file has: id, name, category, taxon, description

# After append:
# - taxon and description columns are added
# - Original nodes have NULL for taxon and description
# - New nodes have all five fields populated

NULL Handling

What Happens with Missing Values

NULL values occur during schema harmonization. Here is how they flow through operations:

During join: Missing columns are filled with NULL via UNION ALL BY NAME

During append: New columns in existing rows are NULL; missing columns in new rows are NULL

During export: NULL values are preserved in the output format: - TSV: Empty cell (no value between tabs) - JSONL: Field omitted from the JSON object - Parquet: Native NULL representation

In queries: Use IS NULL / IS NOT NULL to filter, or COALESCE() to provide defaults

NULLs vs Empty Arrays

For multivalued fields, there is a distinction:

NULL: The field has no value (not provided)
[] (empty array): The field was provided but contains no values
['value']: The field contains one or more values

Graph operations preserve this distinction when possible. TSV format cannot distinguish between NULL and empty string.

Best Practices

Tips for Maintaining Clean Schemas

Use consistent column naming: Follow Biolink naming conventions (snake_case) across all your source files.

Document your extensions: If you add custom columns beyond Biolink, document what they contain.

Validate before joining: Use the schema report feature (--schema-reporting) to analyze schemas before combining files.

Consider output format: If schema consistency matters for downstream tools, consider exporting to Parquet which preserves exact types.

Handle multivalued fields consistently: Use pipe-delimited format (value1|value2) in TSV files for multivalued fields.

Review NULL patterns: After joining, query for NULL patterns to understand data coverage:

-- Find columns with high NULL rates
SELECT
    'taxon' as column_name,
    COUNT(*) as total,
    COUNT(taxon) as non_null,
    ROUND(100.0 * COUNT(taxon) / COUNT(*), 1) as coverage_pct
FROM nodes

Use provided_by for provenance: Enable generate_provided_by (default) to track which source each record came from. This allows investigation of schema differences by source.