How to Do Incremental Updates
Goal
Add new data to an existing DuckDB database with automatic schema evolution. The append operation lets you incrementally grow your knowledge graph without rebuilding from scratch, automatically handling schema differences between old and new data.
Prerequisites
- An existing DuckDB database (created via
koza join,koza merge, or a previouskoza append) - New KGX files to append (nodes and/or edges in TSV, JSONL, or Parquet format)
- Koza installed and available in your PATH
Basic Append
The simplest append adds new node and edge files to an existing database.
koza append \
--database existing_graph.duckdb \
--nodes new_genes.tsv \
--edges new_interactions.tsv
You can append multiple files at once:
koza append \
--database existing_graph.duckdb \
--nodes new_genes.tsv updated_pathways.jsonl additional_proteins.parquet \
--edges new_interactions.tsv new_associations.jsonl
Mixed Format Support
Like other Koza operations, append handles mixed formats:
koza append \
--database graph.duckdb \
--nodes genes.tsv proteins.jsonl pathways.parquet \
--edges interactions.tsv.gz associations.jsonl.bz2
Schema Evolution
When new files contain columns that do not exist in the database, append automatically adds these columns to the schema.
koza append \
--database graph.duckdb \
--nodes genes_with_new_fields.tsv \
--edges interactions_with_confidence.tsv \
--schema-report
How Schema Evolution Works
- New columns detected: Append compares incoming file schemas with existing table schemas
- Columns added: New columns are added to the database tables using
ALTER TABLE - Backward compatibility: Existing rows get NULL values for the new columns
- Type safety: DuckDB handles type inference and conversion automatically
Example Output
Append completed successfully
Files processed: 2 (2 successful)
Records added: 15,234
Schema evolution: 2 new columns added
- Added 1 new column to nodes: custom_score
- Added 1 new column to edges: confidence_value
Database growth:
- Nodes: 125,340 -> 138,574 (+13,234)
- Edges: 298,567 -> 300,567 (+2,000)
Total time: 8.2s
Deduplication During Append
When appending data that may overlap with existing records, use the --deduplicate flag to remove duplicates:
koza append \
--database graph.duckdb \
--nodes updated_genes.tsv \
--edges updated_interactions.tsv \
--deduplicate
How Deduplication Works
- Append first: New data is added to the tables
- Identify duplicates: Records with the same ID are identified
- Keep first occurrence: The first occurrence (by load order) is kept
- Archive duplicates: Duplicate records are moved to
duplicate_nodesandduplicate_edgestables
Example with Deduplication
koza append \
--database graph.duckdb \
--nodes genes_v2.tsv \
--deduplicate \
--show-progress
Output:
Append completed successfully
Files processed: 1 (1 successful)
Records added: 5,234
Duplicates removed: 45
Database: graph.duckdb (4.8 MB)
Total time: 4.1s
Deduplication Preserves Data
Duplicate records are moved to archive tables (duplicate_nodes, duplicate_edges), not deleted. You can inspect these tables to understand what was deduplicated.
Tracking Schema Changes
Use the --schema-report flag to generate a detailed report of schema changes:
koza append \
--database graph.duckdb \
--nodes new_data.tsv \
--schema-report
This creates a YAML file (e.g., graph_schema_report_append.yaml) containing:
- File analysis: Format detection, column counts, record counts per file
- Schema changes: List of new columns added to each table
- Column details: Data types, null percentages, example values
Inspecting Schema Changes via SQL
After appending, you can also check the schema directly:
# View current table schemas
duckdb graph.duckdb "DESCRIBE nodes"
duckdb graph.duckdb "DESCRIBE edges"
# Check which columns have NULL values (potentially new columns)
duckdb graph.duckdb "SELECT COUNT(*) - COUNT(custom_score) as nulls FROM nodes"
When to Use Append vs Join
Choose the right operation based on your use case:
| Scenario | Use Append | Use Join |
|---|---|---|
| Adding new data to existing database | Yes | No |
| Preserving existing database state | Yes | No (overwrites) |
| Incremental daily/weekly updates | Yes | No |
| Full rebuild from all sources | No | Yes |
| Starting fresh with new data | No | Yes |
| Schema evolution needed | Yes (automatic) | Yes (automatic) |
| Combining many files initially | No | Yes |
Append Use Cases
- Daily data ingestion: Add new records from a data pipeline each day
- Incremental updates: Add corrections or updates without full rebuild
- Data augmentation: Add new sources to an existing graph
- Schema extension: Add new properties to existing entities
Join Use Cases
- Initial graph creation: Combine multiple source files into a new database
- Full rebuild: Replace existing data with a fresh build from sources
- Format conversion: Load data into DuckDB for the first time
Complete Pipeline Comparison
Incremental approach (append):
# Initial build
koza join --nodes *.nodes.* --edges *.edges.* --output graph.duckdb
# Later: add new data
koza append --database graph.duckdb --nodes new_data.tsv --deduplicate
koza append --database graph.duckdb --nodes more_data.tsv --deduplicate
Full rebuild approach (join):
# Rebuild everything each time
koza merge \
--nodes *.nodes.* new_data.tsv more_data.tsv \
--edges *.edges.* \
--output graph.duckdb
Verification
After appending, verify the operation succeeded.
Check Record Counts
# Compare before and after counts
duckdb graph.duckdb "SELECT COUNT(*) AS node_count FROM nodes"
duckdb graph.duckdb "SELECT COUNT(*) AS edge_count FROM edges"
Verify New Data Is Present
# Check for records from the new source
duckdb graph.duckdb "SELECT COUNT(*) FROM nodes WHERE file_source LIKE '%new_data%'"
# Or by provided_by if set
duckdb graph.duckdb "SELECT provided_by, COUNT(*) FROM nodes GROUP BY provided_by ORDER BY COUNT(*) DESC"
Check Schema Evolution
# View the current schema
duckdb graph.duckdb "DESCRIBE nodes"
# Check for new columns with mostly NULL values (recently added)
duckdb graph.duckdb "
SELECT
column_name,
COUNT(*) - COUNT(column_name) as null_count,
COUNT(*) as total_count
FROM nodes
UNPIVOT (value FOR column_name IN (*))
GROUP BY column_name
ORDER BY null_count DESC
"
Verify Deduplication (if used)
# Check duplicate archive tables
duckdb graph.duckdb "SELECT COUNT(*) as duplicate_nodes FROM duplicate_nodes"
duckdb graph.duckdb "SELECT COUNT(*) as duplicate_edges FROM duplicate_edges"
# View sample duplicates
duckdb graph.duckdb "SELECT * FROM duplicate_nodes LIMIT 5"
Generate QC Report
For comprehensive verification:
koza report qc --database graph.duckdb --output qc_report.yaml
Complete Example
A typical incremental update workflow:
# Step 1: Initial graph build
koza merge \
--nodes source_a.nodes.tsv source_b.nodes.tsv \
--edges source_a.edges.tsv source_b.edges.tsv \
--output knowledge_graph.duckdb
# Step 2: Check initial counts
duckdb knowledge_graph.duckdb "SELECT COUNT(*) FROM nodes"
# Returns: 125340
# Step 3: Append new data with deduplication and schema tracking
koza append \
--database knowledge_graph.duckdb \
--nodes new_source_c.nodes.tsv \
--edges new_source_c.edges.tsv \
--deduplicate \
--schema-report \
--show-progress
# Step 4: Verify the update
duckdb knowledge_graph.duckdb "SELECT COUNT(*) FROM nodes"
# Returns: 138574
# Step 5: Generate updated QC report
koza report qc --database knowledge_graph.duckdb --output qc_report.yaml
See Also
- CLI Reference: koza append
- How to Join Files - Creating the initial database
- How to Clean Graphs - Deduplication and pruning operations
- Schema Handling - Schema evolution and harmonization