Bases: BaseAnnotationCombiner

Combines annotations from SRA-based sources, mapping accession IDs to GSM.

Run-level (xxR) IDs are first resolved to experiment-level (xxX) IDs via src_sra_runs, then both xxR and xxX IDs are mapped to GEO sample IDs (GSM) via src_geo_samples in the OmicIDX DuckDB database.

Example:

>>> combiner = SraCombiner()
>>> combiner.combine().clean().save(SRA_COMBINED_BSON)

combine(db_path=OMICIDX_DB, overrides=None)

Load and combine all SRA source parquets, mapping IDs to GSM.

Sources whose parquet file does not exist are skipped with a warning. Within each source, rows whose accession ID cannot be mapped to a GSM are dropped and counted.

Parameters:
  • db_path (Path, default: OMICIDX_DB ) –

    Path to the OmicIDX DuckDB database file. Defaults to the package-wide OMICIDX_DB constant.

  • overrides (dict[str, Path] | None, default: None ) –

    Per-source path overrides. Keys are source names from SRA_SOURCES; values replace the default path for that source.

Returns:
  • SraCombiner

    self, for chaining.