Bases: BaseAnnotationCombiner

Merges GEO and SRA combined annotation BSONs into a single sample-level DB.

Example:

>>> combiner = SampleCombiner()
>>> combiner.combine().clean().save(SAMPLE_COMBINED_BSON)

combine(geo_bson=GEO_COMBINED_BSON, sra_bson=SRA_COMBINED_BSON, db_path=OMICIDX_DB)

Load and merge the GEO and SRA combined BSONs, then enrich accession IDs.

For each GSM present in either source, annotation-type dicts are merged by source name. accession_ids dicts are unioned, with the SRA entry taking precedence for conflicting keys. Accession IDs are then enriched from OmicIDX with series, platform, srx, and srp fields.

Parameters:
  • geo_bson (Path, default: GEO_COMBINED_BSON ) –

    Path to the GEO combined BSON file. Defaults to PROCESSED_DIR/geo_combined.bson.

  • sra_bson (Path, default: SRA_COMBINED_BSON ) –

    Path to the SRA combined BSON file. Defaults to PROCESSED_DIR/sra_combined.bson.

  • db_path (Path, default: OMICIDX_DB ) –

    Path to the OmicIDX DuckDB database file. Defaults to the package-wide OMICIDX_DB constant.

Returns:
  • SampleCombiner

    self, for chaining.

enrich_annotations(all_gsm, db_path)

Enrich sample annotations with organism, series IDs, platform IDs, and cross-references to SRA. Updates annotations in-place.

Parameters:
  • all_gsm (set[str]) –

    Set of all sample IDs (GSMs) to enrich.

  • db_path (Path) –

    Path to the OmicIDX DuckDB database file. Defaults to the package-wide OMICIDX_DB constant.

Returns:
  • SampleCombiner

    self, for chaining.