How to Generate Reports
Goal
Generate quality control reports, statistics, and data summaries from your knowledge graph databases. Koza provides multiple reporting commands for different analysis needs: QC reports for data quality assessment, graph statistics for structural analysis, schema compliance for biolink validation, and tabular reports for detailed breakdowns.
Prerequisites
- A DuckDB database created by
koza join,koza merge, orkoza append - Alternatively, KGX files (TSV, JSONL, or Parquet) for file-based reports
QC Report
The koza report qc command generates a quality control report with node/edge statistics grouped by data source.
Basic Usage
koza report qc -d graph.duckdb -o qc_report.yaml
Console-Only Analysis
For quick QC analysis without saving to a file:
koza report qc -d graph.duckdb
Output Format
The QC report is saved as YAML with the following structure:
summary:
total_nodes: 125340
total_edges: 298567
dangling_edges: 156
duplicate_nodes: 0
singleton_nodes: 23
nodes:
- provided_by: gene_source
category: biolink:Gene
count: 50000
- provided_by: disease_source
category: biolink:Disease
count: 25340
edges:
- provided_by: interaction_source
predicate: biolink:interacts_with
count: 150000
- provided_by: association_source
predicate: biolink:associated_with
count: 148567
The report includes:
- Total entity counts and data distribution
- Potential integrity issues (dangling edges, duplicates)
- Breakdown by source for multi-source graphs
Graph Statistics
The koza report graph-stats command generates graph statistics similar to the merged_graph_stats.yaml output from cat-merge.
Basic Usage
koza report graph-stats -d graph.duckdb -o graph_stats.yaml
Output Format
graph_name: Graph Statistics
node_stats:
total_nodes: 125340
count_by_category:
biolink:Gene:
count: 50000
provided_by:
gene_source: 50000
biolink:Disease:
count: 25340
provided_by:
disease_source: 25340
count_by_id_prefixes:
HGNC: 45000
NCBIGene: 5000
MONDO: 15340
OMIM: 10000
node_categories:
- biolink:Gene
- biolink:Disease
- biolink:Phenotype
node_id_prefixes:
- HGNC
- NCBIGene
- MONDO
- OMIM
provided_by:
- gene_source
- disease_source
edge_stats:
total_edges: 298567
count_by_predicates:
biolink:interacts_with:
count: 150000
provided_by:
interaction_source: 150000
biolink:associated_with:
count: 148567
provided_by:
association_source: 148567
predicates:
- biolink:interacts_with
- biolink:associated_with
provided_by:
- interaction_source
- association_source
Statistics generated include:
- Node counts by category and ID prefix
- Edge counts by predicate
- Attribution breakdown by
provided_bysource - Complete enumeration of categories, predicates, and prefixes
Schema Compliance
The koza report schema command analyzes database schema and checks biolink model compliance.
Basic Usage
koza report schema -d graph.duckdb -o schema_report.yaml
Output Format
metadata:
operation: schema_analysis
generated_at: '2024-01-15 10:30:45'
report_version: '1.0'
schema_analysis:
summary:
nodes:
file_count: 4
unique_columns: 23
all_columns:
- id
- category
- name
- description
- xref
- provided_by
edges:
file_count: 2
unique_columns: 18
all_columns:
- id
- subject
- predicate
- object
- category
- primary_knowledge_source
files:
- filename: genes_nodes.tsv
table_type: nodes
column_count: 12
columns:
- id
- category
- name
- symbol
The schema report includes:
- Column coverage across source files
- Schema harmonization applied during join
- Biolink-compliant vs. extension columns
- Data type consistency
Node Reports
The koza node-report command generates tabular reports with node counts grouped by categorical columns.
From Database
koza node-report -d graph.duckdb -o node_report.tsv
From File
koza node-report -f nodes.tsv -o node_report.tsv
Custom Columns
Specify which categorical columns to group by:
koza node-report -d graph.duckdb -o report.tsv \
-c namespace -c category -c provided_by
Output Example
namespace category provided_by count
HGNC biolink:Gene gene_source 45000
NCBIGene biolink:Gene gene_source 5000
MONDO biolink:Disease disease_source 15340
OMIM biolink:Disease disease_source 10000
The default grouping columns include namespace, category, and provided_by when present in the data.
Edge Reports
The koza edge-report command generates tabular reports with edge counts, including denormalized node information.
From Database
koza edge-report -d graph.duckdb -o edge_report.tsv
From Files
When working with separate node and edge files, provide both for denormalization:
koza edge-report -n nodes.tsv -e edges.tsv -o edge_report.tsv
Custom Columns
koza edge-report -d graph.duckdb -o report.tsv \
-c subject_category -c predicate -c object_category -c primary_knowledge_source
Output Example
subject_category predicate object_category primary_knowledge_source count
biolink:Gene biolink:interacts_with biolink:Gene string_db 85000
biolink:Gene biolink:interacts_with biolink:Protein biogrid 65000
biolink:Disease biolink:associated_with biolink:Phenotype hpo 98567
biolink:Gene biolink:associated_with biolink:Disease disgenet 50000
The edge report automatically joins edges to nodes to derive subject_category and object_category from the referenced nodes.
Node Examples
The koza node-examples command extracts sample nodes for each category or other grouping column.
Basic Usage
koza node-examples -d graph.duckdb -o node_examples.tsv
Custom Sample Size
koza node-examples -d graph.duckdb -o examples.tsv -n 10
Group by Different Column
koza node-examples -d graph.duckdb -o examples.tsv -t provided_by
From File
koza node-examples -f nodes.tsv -o examples.tsv -n 5
Output Example
The output contains N sample rows for each distinct value in the type column:
id name category provided_by ...
HGNC:1234 BRCA1 biolink:Gene gene_source ...
HGNC:5678 TP53 biolink:Gene gene_source ...
MONDO:0005148 diabetes mellitus biolink:Disease disease_source ...
MONDO:0004975 Alzheimer disease biolink:Disease disease_source ...
The output can be used for:
- Data validation and spot-checking
- Documentation and examples
- Investigating data quality issues
Edge Examples
The koza edge-examples command extracts sample edges for each predicate pattern or custom grouping.
Basic Usage
koza edge-examples -d graph.duckdb -o edge_examples.tsv
Custom Sample Size
koza edge-examples -d graph.duckdb -o examples.tsv -s 10
Custom Type Columns
By default, edges are grouped by subject_category, predicate, and object_category. Customize this:
koza edge-examples -d graph.duckdb -o examples.tsv \
-t predicate -t primary_knowledge_source
From Files
koza edge-examples -n nodes.tsv -e edges.tsv -o examples.tsv -s 5
Output Example
subject predicate object subject_name object_name primary_knowledge_source ...
HGNC:1234 biolink:interacts_with HGNC:5678 BRCA1 TP53 string_db ...
HGNC:9999 biolink:interacts_with HGNC:8888 MYC MAX biogrid ...
MONDO:0005148 biolink:associated_with HP:0001943 diabetes Hypoglycemia hpo ...
Edge examples show:
- Subject/object relationships
- Predicate usage patterns
- Knowledge source attribution
Output Formats
All tabular reports (node-report, edge-report, node-examples, edge-examples) support multiple output formats.
TSV (Default)
koza node-report -d graph.duckdb -o report.tsv --format tsv
Parquet
For large reports or downstream analytics:
koza node-report -d graph.duckdb -o report.parquet --format parquet
CSV
koza node-report -d graph.duckdb -o report.csv --format csv
JSON
koza node-report -d graph.duckdb -o report.json --format json
Common Options
All report commands support these options:
| Option | Description |
|---|---|
-d, --database |
Path to DuckDB database file |
-o, --output |
Path to output file |
-q, --quiet |
Suppress progress output |
Workflow Examples
Complete QC Pipeline
Generate all reports for a merged graph:
# Generate all report types
koza report qc -d merged.duckdb -o qc_report.yaml
koza report graph-stats -d merged.duckdb -o graph_stats.yaml
koza report schema -d merged.duckdb -o schema_report.yaml
# Generate tabular breakdowns
koza node-report -d merged.duckdb -o node_report.tsv
koza edge-report -d merged.duckdb -o edge_report.tsv
# Extract examples for documentation
koza node-examples -d merged.duckdb -o node_examples.tsv -n 3
koza edge-examples -d merged.duckdb -o edge_examples.tsv -s 3
Post-Merge Validation
After running koza merge, validate the results:
# Check for data quality issues
koza report qc -d merged.duckdb -o post_merge_qc.yaml
# Verify expected categories and predicates
koza report graph-stats -d merged.duckdb -o post_merge_stats.yaml
# Spot-check examples
koza edge-examples -d merged.duckdb -o edge_samples.tsv -s 5
See Also
- CLI Reference - Complete command documentation
- Configuration Reference - Report configuration options
- Data Integrity - Understanding archive tables in reports