# 02 Databases and Links

## Quick Answer
EVd3x integrates 17 canonical analysis parquet tables after excluding runtime and auxiliary caches. The tables are built from explicit ingestion pipelines under `data_sources/`. Core linkage keys are `miRBase_ID`, `Gene_ID`, `Protein_ID`, and `Disease_ID`. Source-level provenance and schema snapshots are published in `static/docs/data_source_inventory.json`.

## What this does
This section documents exactly which external databases are used, what fields are retained, how identifiers are linked, and where each source appears in EVd3x analyses.

## Inputs
- Source ingestion scripts under `data_sources/01..10`
- Curated parquet outputs under `sample_databases/*.parquet`
- Source inventory artifact: `static/docs/data_source_inventory.json`

## Outputs
- Canonical linked entities (`miRNAs.parquet`, `genes.parquet`, `proteins.parquet`)
- Analysis tables for targets, expression, pathways, communication, disease, EV evidence
- Traceable source labels (`Source_Database`, `Pathway_Source`, `Source`)

## How calculated
Pipeline scripts standardize raw source files, map aliases to canonical IDs, and emit normalized CSV/parquet tables. EVd3x backend endpoints then read projected columns and join on canonical keys. Examples:
- miRNA targets: `miRNA_ID` -> `Gene_ID`
- expression: `Gene_ID`/`miRBase_ID` -> localization/system bins
- pathways: `Gene_ID` -> `Pathway_Name` and `Pathway_Source`
- disease: `Gene_ID` or `miRBase_ID` -> canonical `Disease_ID`

## What to download
For provenance-grade reporting export:
- `99_meta/export_manifest.csv` from analysis export bundle
- Full CSV modules (`01_network` to `08_disease_analysis`)
- Source inventory artifact `static/docs/data_source_inventory.json`

## Known limits
Some upstream databases evolve schemas and naming conventions. EVd3x preserves canonical IDs and source tags, but source-native label drift can still affect display labels and coverage depth.

## Runtime database distinction

The hosted analysis runtime reads the 17 canonical Parquet tables in `sample_databases/` plus auxiliary caches. The SQLite files named `evd3x_knowledge_graph*.db` are legacy or converted graph snapshots with a smaller table layout. They are useful for archival graph inspection, but they are not the canonical substrate for the app documentation, manuscript counts, or supplementary TSV exports.

## Source-by-Source Provenance

### Identity and Mapping Layer

#### Ensembl BioMart
- Link: https://www.ensembl.org/biomart/martview
- Raw inputs: `01_lookup_tables/ensembl_master_list.csv`
- Retained fields: Ensembl gene IDs, gene symbols, names, UniProt cross-reference
- Used by: canonical `genes.parquet`, `proteins.parquet`, global ID joins

#### miRBase
- Link: https://www.mirbase.org/
- Raw inputs: `01_lookup_tables/miRNA.xlsx`, `01_lookup_tables/miRNAs.csv`
- Retained fields: `miRBase_ID`, `miRNA_Name`
- Used by: query parsing, miRNA-target joins, disease joins, annotation cache keying

#### UniProt
- Link: https://www.uniprot.org/
- Raw inputs: `01_lookup_tables/uniprot_human_data.tsv`
- Retained fields: accession and protein/gene naming fields needed for protein mapping
- Used by: `proteins.parquet`, molecule summary enrichment, LR source scripts

#### STRING
- Link: https://string-db.org/
- Raw inputs: `01_lookup_tables/string_annotations.txt`, parquet interaction table
- Retained fields: `protein1`, `protein2`, `combined_score` plus evidence channels
- Used by: PPI expansion endpoint and protein neighborhood exploration

### Expression and Localization Layer

#### Human Protein Atlas (HPA)
- Link: https://www.proteinatlas.org/
- Raw inputs: `02_hpa_expression/normal_tissue.tsv`, `02_hpa_expression/rna_consensus.tsv`, `02_hpa_expression/rna_single_cell_type.tsv`
- Retained fields: protein/RNA expression levels, tissue and cell-type context
- Used by: collective expression, cell specificity scoring, node localization context

#### RNALocate
- Link: http://www.rnalocate.org/
- Raw inputs: `04_localization_mapping/rnalocate_experimental*.txt`, `rnalocate_predicted*.txt`
- Retained fields: localization labels and scores, experimental vs predicted evidence type
- Used by: gene and miRNA localization/expression context in Node and Collective tabs

#### mirMine
- Link: https://guanfiles.dcmb.med.umich.edu/mirmine/
- Raw inputs: `04_localization_mapping/mirmine.xlsx`
- Retained fields: miRNA expression records mapped to EVd3x localization schema
- Used by: `miRNA_expression.parquet`

#### miRNA-atlas
- Link: https://ccb-web.cs.uni-saarland.de/tissueatlas/
- Raw inputs: `04_localization_mapping/mirna_atlas.csv`
- Retained fields: miRNA expression values with tissue/cell context
- Used by: `miRNA_expression.parquet`

### miRNA Target Layer

#### miRTarBase
- Link: https://mirtarbase.cuhk.edu.cn/
- Raw inputs: `03_miRNA_targets/mirtarbase_human.csv`
- Retained fields: experimentally supported interactions and experiment labels
- Used by: high-confidence target foundation and `miRNA_targets_scored.parquet`

#### TarBase
- Link: https://dianalab.e-ce.uth.gr/tarbasev9
- Raw inputs: `03_miRNA_targets/tarbase_human.tsv`
- Retained fields: additional experimental evidence and confidence hints
- Used by: merged interaction evidence in scored target set

#### TargetScan
- Link: https://www.targetscan.org/
- Raw inputs: `03_miRNA_targets/targetscan_data.txt`
- Retained fields: predicted target interactions
- Used by: enrichment of prediction-only edges in scored targets

Target rows are integrated into `miRNA_targets_scored.parquet` with `miRNA_ID`, `Gene_ID`, `Confidence_Score`, and `Source_Database`. Downstream analyses aggregate targets by mRNA using support count, mean confidence, and maximum confidence. Shared targets are useful for ranking follow-up hypotheses, but they do not validate repression or EV-mediated delivery.

### EV Evidence Layer

#### ExoCarta
- Link: http://www.exocarta.org/
- Raw inputs: `05_EV_databases/exocarta_*`
- Retained fields: molecule IDs and supporting publication mapping
- Used by: `ev_evidence.parquet` (`Source_Database` includes ExoCarta)

#### Vesiclepedia
- Link: http://microvesicles.org/
- Raw inputs: `05_EV_databases/vesiclepedia_*`
- Retained fields: EV molecule reports, sample and method context
- Used by: `ev_evidence.parquet` (largest source share in current snapshot)

#### SVAtlas
- Link: https://ngdc.cncb.ac.cn/svatlas/
- Raw inputs: `05_EV_databases/svatlas_data/*`
- Retained fields: project and marker evidence records
- Used by: `ev_evidence.parquet` as additional EV source family

#### EV-Track
- Link: https://evtrack.org/
- Raw inputs: `05_EV_databases/evtrack_data.xlsx`
- Retained fields: publication metadata (`EV_Track_ID`, methods, score)
- Used by: `publication_details.parquet` and EV evidence context

### Pathway Layer

#### Reactome
- Link: https://reactome.org/
- Raw inputs: `06_pathway_enrichment/ensembl_to_reactome.txt`
- Retained fields: gene-pathway links, pathway names
- Used by: pathway enrichment and pathway category graphs

#### KEGG
- Link: https://www.genome.jp/kegg/
- Raw inputs: `06_pathway_enrichment/ensembl_to_ncbi.txt` + KEGG REST mapping in pipeline script
- Retained fields: KEGG pathway membership and names
- Used by: pathway enrichment source filter

#### Gene Ontology (GO)
- Link: https://geneontology.org/
- Raw inputs: `06_pathway_enrichment/ensembl_to_go.txt`
- Retained fields: `GO biological_process` pathway-like terms
- Used by: pathway enrichment source filter and category views

#### WikiPathways
- Link: https://www.wikipathways.org/
- Raw inputs: `06_pathway_enrichment/wikipathways_homo_sapiens.txt`
- Retained fields: pathway IDs/names with gene membership
- Used by: pathway enrichment source filter and category views

### Cell Communication and Specificity Layer

#### CellPhoneDB
- Link: https://www.cellphonedb.org/
- Raw inputs: `07_cell_communication/interaction_input.csv`, `protein_input.csv`, `complex_input.csv`
- Retained fields: ligand/receptor interaction templates and identifiers
- Used by: LR table generation workflows

#### OmniPath Intercell and linked LR resources
- Link: https://omnipathdb.org/
- Raw inputs: generated by `download_omnipath_full.py`
- Retained fields: interaction directionality, stimulation/inhibition flags, multi-source provenance
- Used by: `ligand_receptor_pairs_full.parquet`

#### CellMarker
- Link: http://bio-bigdata.hrbmu.edu.cn/CellMarker/
- Raw inputs: `08_cell_specificity/Cell_marker_Human.xlsx`
- Retained fields: marker support for gene-cell relationships
- Used by: cell specificity source blending

#### PanglaoDB
- Link: https://panglaodb.se/
- Raw inputs: `08_cell_specificity/PanglaoDB_Human.tsv`
- Retained fields: human marker/cell-type evidence
- Used by: cell specificity source blending

### Disease and Annotation Layer

#### DisGeNET
- Link: https://www.disgenet.org/
- Raw inputs: API-backed scripts in `09_gene_diseases/process_disgenet*.py`
- Retained fields: `Disease_ID`, names, score, association type, publication metadata
- Used by: disease analysis, node disease panels, grouped disease exports

#### RNAcentral
- Link: https://rnacentral.org/
- Raw inputs: runtime API fetch cached in `mirna_annotation_cache.parquet`
- Retained fields: RNAcentral ID, sequence, status, fetch timestamp
- Used by: miRNA summary annotation and exomotif checks

## Dataset Inventory Snapshot

| Dataset | Rows | Key Columns | Primary Sources |
|---|---:|---|---|
| `miRNA_targets_scored.parquet` | 2,639,746 | `miRNA_ID`, `Gene_ID`, `Confidence_Score` | miRTarBase, TarBase, TargetScan |
| `gene_expression.parquet` | 2,479,776 | `Gene_ID`, `Localization_System`, `Source_Database` | HPA, RNALocate |
| `cell_specificity_unified.parquet` | 1,587,444 | `Gene_ID`, `Cell_Type`, `System`, `Source` | HPA, CellMarker, PanglaoDB |
| `gene_disease_associations.parquet` | 501,157 | `Gene_ID`, `Disease_ID`, `Score` | DisGeNET |
| `gene_pathways.parquet` | 404,758 | `Gene_ID`, `Pathway_Name`, `Pathway_Source` | Reactome, KEGG, GO, WikiPathways |
| `miRNA_expression.parquet` | 291,427 | `miRBase_ID`, `Localization_System`, `Source_Database` | miRNA-atlas, mirMine, RNALocate |
| `ev_evidence.parquet` | 234,090 | `Molecule_Standard_ID`, `Molecule_Type`, `Source_Database` | Vesiclepedia, ExoCarta, SVAtlas |
| `ligand_receptor_pairs_full.parquet` | 25,779 | `ligand_gene_symbol`, `receptor_gene_symbol`, `sources` | CellPhoneDB + OmniPath-linked resources |

Full per-table schema and script lineage are in `static/docs/data_source_inventory.json`. Runtime or auxiliary caches, including `mirna_annotation_cache.parquet`, `gene_description_cache.parquet`, and `protein_description_cache.parquet`, are not counted in the 17 canonical analysis tables.

## Known caveats by source
- DisGeNET: source-specific evidence models can produce mixed association types and uneven disease coverage.
- RNALocate: combines experimental and predicted signals; interpretation should respect `Evidence_Type`.
- TargetScan: predicted interactions increase recall but may include lower-confidence edges.
- GO pathway terms: broad term granularity can inflate category density.
- EV evidence databases: reporting bias by molecule type and study methodology is expected.
