Biolink Compliance
Understanding how graph operations work with the Biolink Model and KGX format.
Overview
Knowledge graphs in the biomedical domain use standardized data models for interoperability and consistent querying. The Biolink Model defines the vocabulary and structure for representing biological and medical knowledge. Graph operations in Koza work with Biolink-compliant data in the KGX (Knowledge Graph Exchange) format.
What is Biolink?
The Biolink Model is a high-level data model for representing biological and biomedical knowledge graphs. Key aspects include:
- Standard vocabulary: Defines consistent terms for node types (categories) and edge types (predicates) across different data sources
- Node categories: Hierarchical classification of entities like
biolink:Gene,biolink:Disease,biolink:ChemicalEntity - Edge predicates: Standardized relationship types like
biolink:treats,biolink:interacts_with,biolink:associated_with - Slot definitions: Properties that can be attached to nodes and edges, such as
name,description,in_taxon - Maintained by Monarch Initiative: The model evolves based on community needs and is actively maintained
The Biolink Model uses a LinkML schema definition. This schema specifies field definitions, including whether fields are multivalued, required, or optional. Tools can read this schema programmatically.
KGX Format
KGX (Knowledge Graph Exchange) is the file format that implements the Biolink Model for data exchange. It uses a tabular representation for large-scale data processing.
Node Files
Node files contain entity definitions with columns including:
| Column | Description |
|---|---|
id |
Unique identifier (CURIE format, e.g., HGNC:1234) |
category |
Biolink category (e.g., biolink:Gene) |
name |
Human-readable name |
description |
Detailed description |
provided_by |
Data source attribution |
Edge Files
Edge files define relationships between nodes:
| Column | Description |
|---|---|
subject |
Source node ID (CURIE) |
predicate |
Relationship type (e.g., biolink:interacts_with) |
object |
Target node ID (CURIE) |
id |
Unique edge identifier |
category |
Edge category (e.g., biolink:Association) |
primary_knowledge_source |
Original data source |
File Naming Convention
KGX files follow a naming convention that graph operations use for automatic detection:
*_nodes.tsvornodes.tsv- Node files*_edges.tsvoredges.tsv- Edge files
Supported formats include TSV, JSONL, and Parquet.
Required Fields
For valid KGX data, the following fields are required or strongly recommended:
Nodes
| Field | Requirement | Notes |
|---|---|---|
id |
Required | Must be a valid CURIE (e.g., HGNC:1234, HP:0001234) |
category |
Recommended | Biolink category; defaults to biolink:NamedThing if missing |
name |
Recommended | Human-readable label for the entity |
Edges
| Field | Requirement | Notes |
|---|---|---|
subject |
Required | CURIE of the source node |
predicate |
Required | Biolink predicate for the relationship |
object |
Required | CURIE of the target node |
id |
Recommended | Unique identifier for the edge |
primary_knowledge_source |
Recommended | Attribution for TRAPI compliance |
Common Fields
Beyond the required fields, these columns appear frequently in KGX data:
Node Properties
name: Human-readable labeldescription: Longer textual descriptionprovided_by: Data source attribution (used for grouping in QC reports)in_taxon: Taxonomic context for biological entities (e.g.,NCBITaxon:9606for human)in_taxon_label: Human-readable taxon namexref: Cross-references to other databasessynonym: Alternative names
Edge Properties
category: Edge category (typicallybiolink:Associationor a subclass)negated: Boolean indicating negation of the relationshipknowledge_level: TRAPI knowledge level (e.g.,knowledge_assertion,logical_entailment)agent_type: TRAPI agent type (e.g.,manual_agent,automated_agent)aggregator_knowledge_source: Intermediate data aggregatorspublications: Supporting literature references
Multivalued Fields
Some Biolink fields can contain multiple values. Graph operations handle these as arrays.
Common Multivalued Fields
According to the Biolink Model schema, these fields are defined as multivalued:
xref/xrefs: Cross-references to external databasessynonym/synonyms: Alternative names for entitiespublications: Supporting literature citationsprovided_by: Multiple data sources contributing to a recordqualifiers: Edge qualifiers for nuanced relationshipsknowledge_source: Knowledge attribution chainaggregator_knowledge_source: Multiple aggregating sources
Array Handling in Graph Operations
When loading KGX files, graph operations:
- Detect array columns using the Biolink Model schema (via LinkML)
- Parse pipe-delimited values in TSV files (e.g.,
PMID:123|PMID:456) - Store as native arrays in DuckDB
- Preserve array structure when exporting back to KGX format
# Example: How arrays appear in different formats
# TSV: xref column contains "UniProtKB:P12345|ENSEMBL:ENSG00000139618"
# DuckDB: xref column stored as ['UniProtKB:P12345', 'ENSEMBL:ENSG00000139618']
# JSONL: "xref": ["UniProtKB:P12345", "ENSEMBL:ENSG00000139618"]
Configuration for Multivalued Fields
Some fields that are technically multivalued in Biolink are treated as single-valued in graph operations:
category: While nodes can have multiple categories, operations typically use the most specific onein_taxon: Entities usually have a single primary taxontype: Similar to category, treated as single-valued
This behavior can be customized through schema configuration.
Compliance Checking
Graph operations include tools to verify Biolink compliance of your data.
Schema Report Command
Generate a schema analysis report using:
koza report schema --database my_graph.duckdb --output schema_report.yaml
This produces a YAML report containing:
metadata:
operation: schema
generated_at: '2024-01-15 10:30:00'
report_version: '1.0'
schema_analysis:
summary:
nodes:
file_count: 5
unique_columns: 12
all_columns:
- id
- category
- name
- in_taxon
# ... more columns
edges:
file_count: 5
unique_columns: 15
all_columns:
- subject
- predicate
- object
# ... more columns
tables:
nodes:
columns:
- name: id
type: VARCHAR
- name: category
type: VARCHAR
column_count: 12
record_count: 50000
edges:
columns:
- name: subject
type: VARCHAR
- name: predicate
type: VARCHAR
- name: object
type: VARCHAR
column_count: 15
record_count: 100000
biolink_compliance:
status: compliant
compliance_percentage: 95.5
missing_fields: []
extension_fields:
- custom_score
- source_version
What Compliance Checking Validates
The schema report analyzes:
- Required fields present: Ensures
idfor nodes andsubject/predicate/objectfor edges - Column data types: Validates appropriate types (VARCHAR for IDs, arrays for multivalued fields)
- Biolink slot coverage: Identifies which columns map to standard Biolink slots
- Extension fields: Lists custom columns not defined in the Biolink Model
- Schema consistency: Detects variations in column structure across source files
Using with Join Operations
Schema reports are automatically generated during join operations when schema_reporting=True:
koza join --nodes data/*_nodes.tsv --edges data/*_edges.tsv \
--output merged.duckdb
# Produces: merged_schema_report.yaml
Common Compliance Issues
Missing Required Fields
Problem: Edges missing subject, predicate, or object columns.
Error: Required edge columns missing: ['predicate']
Solution: Ensure your source data includes all required columns before joining.
Invalid Predicates
Problem: Using predicates not defined in the Biolink Model.
# Problematic:
predicate: custom:related_to
# Correct:
predicate: biolink:related_to
Solution: Map custom predicates to standard Biolink predicates, or use biolink:related_to as a generic fallback.
Non-Standard Categories
Problem: Node categories that don't exist in Biolink.
# Problematic:
category: MyDatabase:ProteinEntity
# Correct:
category: biolink:Protein
Solution: Map source categories to appropriate Biolink categories. Use the category hierarchy to find the most specific valid category.
Malformed CURIEs
Problem: IDs that don't follow CURIE format.
# Problematic:
id: 12345
id: http://example.org/entity/12345
# Correct:
id: EXAMPLE:12345
Solution: Ensure all IDs follow the prefix:local_id format. Use normalization to standardize IDs.
Missing Provenance
Problem: Data without source attribution lacks data lineage tracking.
# Missing provenance:
- id: HGNC:1234
category: biolink:Gene
name: BRCA1
# With provenance:
- id: HGNC:1234
category: biolink:Gene
name: BRCA1
provided_by: infores:hgnc
Solution: Use the --generate-provided-by flag during join/merge operations to automatically add provenance from filenames.
Schema Mismatches Across Sources
Problem: Different source files have different column structures.
schema_analysis:
# File A has 10 columns, File B has 15 columns
# This causes NULL values in merged data
Solution: The join operation creates a unified schema automatically. Review the schema report to see which columns come from which sources.
Best Practices
- Validate early: Run schema reports on source files before large merge operations
- Use standard prefixes: Stick to well-known CURIE prefixes (HGNC, HP, MONDO, etc.)
- Include provenance: Always populate
provided_byorprimary_knowledge_source - Document extensions: If using custom columns, document their purpose and expected values
- Regular compliance checks: Integrate schema validation into your data pipeline
Related Documentation
- Schema Handling - How graph operations manage schema evolution
- Data Integrity - Ensuring data quality during operations
- Generate Reports How-To - Practical guide to report generation