How to Generate Reports

Goal

Generate quality control reports, statistics, and data summaries from your knowledge graph databases. Koza provides multiple reporting commands for different analysis needs: QC reports for data quality assessment, graph statistics for structural analysis, schema compliance for biolink validation, and tabular reports for detailed breakdowns.

Prerequisites

A DuckDB database created by koza join, koza merge, or koza append
Alternatively, KGX files (TSV, JSONL, or Parquet) for file-based reports

QC Report

The koza report qc command generates a quality control report with node/edge statistics grouped by data source.

Basic Usage

koza report qc -d graph.duckdb -o qc_report.yaml

Console-Only Analysis

For quick QC analysis without saving to a file:

koza report qc -d graph.duckdb

Output Format

The QC report is saved as YAML with the following structure:

summary:
  total_nodes: 125340
  total_edges: 298567
  dangling_edges: 156
  duplicate_nodes: 0
  singleton_nodes: 23
nodes:
  - provided_by: gene_source
    category: biolink:Gene
    count: 50000
  - provided_by: disease_source
    category: biolink:Disease
    count: 25340
edges:
  - provided_by: interaction_source
    predicate: biolink:interacts_with
    count: 150000
  - provided_by: association_source
    predicate: biolink:associated_with
    count: 148567

The report includes:

Total entity counts and data distribution
Potential integrity issues (dangling edges, duplicates)
Breakdown by source for multi-source graphs

Graph Statistics

The koza report graph-stats command generates graph statistics similar to the merged_graph_stats.yaml output from cat-merge.

Basic Usage

koza report graph-stats -d graph.duckdb -o graph_stats.yaml

Output Format

graph_name: Graph Statistics
node_stats:
  total_nodes: 125340
  count_by_category:
    biolink:Gene:
      count: 50000
      provided_by:
        gene_source: 50000
    biolink:Disease:
      count: 25340
      provided_by:
        disease_source: 25340
  count_by_id_prefixes:
    HGNC: 45000
    NCBIGene: 5000
    MONDO: 15340
    OMIM: 10000
  node_categories:
    - biolink:Gene
    - biolink:Disease
    - biolink:Phenotype
  node_id_prefixes:
    - HGNC
    - NCBIGene
    - MONDO
    - OMIM
  provided_by:
    - gene_source
    - disease_source
edge_stats:
  total_edges: 298567
  count_by_predicates:
    biolink:interacts_with:
      count: 150000
      provided_by:
        interaction_source: 150000
    biolink:associated_with:
      count: 148567
      provided_by:
        association_source: 148567
  predicates:
    - biolink:interacts_with
    - biolink:associated_with
  provided_by:
    - interaction_source
    - association_source

Statistics generated include:

Node counts by category and ID prefix
Edge counts by predicate
Attribution breakdown by provided_by source
Complete enumeration of categories, predicates, and prefixes

Schema Compliance

The koza report schema command analyzes database schema and checks biolink model compliance.

Basic Usage

koza report schema -d graph.duckdb -o schema_report.yaml

Output Format

metadata:
  operation: schema_analysis
  generated_at: '2024-01-15 10:30:45'
  report_version: '1.0'
schema_analysis:
  summary:
    nodes:
      file_count: 4
      unique_columns: 23
      all_columns:
        - id
        - category
        - name
        - description
        - xref
        - provided_by
    edges:
      file_count: 2
      unique_columns: 18
      all_columns:
        - id
        - subject
        - predicate
        - object
        - category
        - primary_knowledge_source
  files:
    - filename: genes_nodes.tsv
      table_type: nodes
      column_count: 12
      columns:
        - id
        - category
        - name
        - symbol

The schema report includes:

Column coverage across source files
Schema harmonization applied during join
Biolink-compliant vs. extension columns
Data type consistency

Node Reports

The koza node-report command generates tabular reports with node counts grouped by categorical columns.

From Database

koza node-report -d graph.duckdb -o node_report.tsv

From File

koza node-report -f nodes.tsv -o node_report.tsv

Custom Columns

Specify which categorical columns to group by:

koza node-report -d graph.duckdb -o report.tsv \
    -c namespace -c category -c provided_by

Output Example

namespace   category    provided_by count
HGNC    biolink:Gene    gene_source 45000
NCBIGene    biolink:Gene    gene_source 5000
MONDO   biolink:Disease disease_source  15340
OMIM    biolink:Disease disease_source  10000

The default grouping columns include namespace, category, and provided_by when present in the data.

Edge Reports

The koza edge-report command generates tabular reports with edge counts, including denormalized node information.

From Database

koza edge-report -d graph.duckdb -o edge_report.tsv

From Files

When working with separate node and edge files, provide both for denormalization:

koza edge-report -n nodes.tsv -e edges.tsv -o edge_report.tsv

Custom Columns

koza edge-report -d graph.duckdb -o report.tsv \
    -c subject_category -c predicate -c object_category -c primary_knowledge_source

Output Example

subject_category    predicate   object_category primary_knowledge_source    count
biolink:Gene    biolink:interacts_with  biolink:Gene    string_db   85000
biolink:Gene    biolink:interacts_with  biolink:Protein biogrid 65000
biolink:Disease biolink:associated_with biolink:Phenotype   hpo 98567
biolink:Gene    biolink:associated_with biolink:Disease disgenet    50000

The edge report automatically joins edges to nodes to derive subject_category and object_category from the referenced nodes.

Node Examples

The koza node-examples command extracts sample nodes for each category or other grouping column.

Basic Usage

koza node-examples -d graph.duckdb -o node_examples.tsv

Custom Sample Size

koza node-examples -d graph.duckdb -o examples.tsv -n 10

Group by Different Column

koza node-examples -d graph.duckdb -o examples.tsv -t provided_by

From File

koza node-examples -f nodes.tsv -o examples.tsv -n 5

Output Example

The output contains N sample rows for each distinct value in the type column:

id  name    category    provided_by ...
HGNC:1234   BRCA1   biolink:Gene    gene_source ...
HGNC:5678   TP53    biolink:Gene    gene_source ...
MONDO:0005148   diabetes mellitus   biolink:Disease disease_source  ...
MONDO:0004975   Alzheimer disease   biolink:Disease disease_source  ...

The output can be used for:

Data validation and spot-checking
Documentation and examples
Investigating data quality issues

Edge Examples

The koza edge-examples command extracts sample edges for each predicate pattern or custom grouping.

Basic Usage

koza edge-examples -d graph.duckdb -o edge_examples.tsv

Custom Sample Size

koza edge-examples -d graph.duckdb -o examples.tsv -s 10

Custom Type Columns

By default, edges are grouped by subject_category, predicate, and object_category. Customize this:

koza edge-examples -d graph.duckdb -o examples.tsv \
    -t predicate -t primary_knowledge_source

From Files

koza edge-examples -n nodes.tsv -e edges.tsv -o examples.tsv -s 5

Output Example

subject predicate   object  subject_name    object_name primary_knowledge_source    ...
HGNC:1234   biolink:interacts_with  HGNC:5678   BRCA1   TP53    string_db   ...
HGNC:9999   biolink:interacts_with  HGNC:8888   MYC MAX biogrid ...
MONDO:0005148   biolink:associated_with HP:0001943  diabetes    Hypoglycemia    hpo ...

Edge examples show:

Subject/object relationships
Predicate usage patterns
Knowledge source attribution

Output Formats

All tabular reports (node-report, edge-report, node-examples, edge-examples) support multiple output formats.

TSV (Default)

koza node-report -d graph.duckdb -o report.tsv --format tsv

Parquet

For large reports or downstream analytics:

koza node-report -d graph.duckdb -o report.parquet --format parquet

CSV

koza node-report -d graph.duckdb -o report.csv --format csv

JSON

koza node-report -d graph.duckdb -o report.json --format json

Common Options

All report commands support these options:

Option	Description
`-d, --database`	Path to DuckDB database file
`-o, --output`	Path to output file
`-q, --quiet`	Suppress progress output

Workflow Examples

Complete QC Pipeline

Generate all reports for a merged graph:

# Generate all report types
koza report qc -d merged.duckdb -o qc_report.yaml
koza report graph-stats -d merged.duckdb -o graph_stats.yaml
koza report schema -d merged.duckdb -o schema_report.yaml

# Generate tabular breakdowns
koza node-report -d merged.duckdb -o node_report.tsv
koza edge-report -d merged.duckdb -o edge_report.tsv

# Extract examples for documentation
koza node-examples -d merged.duckdb -o node_examples.tsv -n 3
koza edge-examples -d merged.duckdb -o edge_examples.tsv -s 3

Post-Merge Validation

After running koza merge, validate the results:

# Check for data quality issues
koza report qc -d merged.duckdb -o post_merge_qc.yaml

# Verify expected categories and predicates
koza report graph-stats -d merged.duckdb -o post_merge_stats.yaml

# Spot-check examples
koza edge-examples -d merged.duckdb -o edge_samples.tsv -s 5

How to Generate Reports

Goal

Prerequisites

QC Report

Basic Usage

Console-Only Analysis

Output Format

Graph Statistics

Basic Usage

Output Format

Schema Compliance

Basic Usage

Output Format

Node Reports

From Database

From File

Custom Columns

Output Example

Edge Reports

From Database

From Files

Custom Columns

Output Example

Node Examples

Basic Usage

Custom Sample Size

Group by Different Column

From File

Output Example

Edge Examples

Basic Usage

Custom Sample Size

Custom Type Columns

From Files

Output Example

Output Formats

TSV (Default)

Parquet

CSV

JSON

Common Options

Workflow Examples

Complete QC Pipeline

Post-Merge Validation

See Also