Bases: BaseProcessor

Processor for Sirota et al. 2011 annotations.

Explodes comma-separated GSM lists from each GDS row into individual sample records, joins with manually-curated UMLS → MONDO and UMLS → UBERON mapping files, and filters to system-level ontology descendants.

process(output_dir, **kwargs)

Process Sirota 2011 CSV into standardized sample annotations.

Parameters:
  • output_dir (Path) –

    Directory where the processed parquet file will be written.

  • **kwargs (Any, default: {} ) –

    input_path (Path | str) — override the default input path (defaults to SIROTA_2011_CSV from config).

Returns:
  • DataFrame

    Standardized annotations with columns sample_id, annotation_type, term_id, term_label, and ecode.

validate(data)

Validate processed Sirota 2011 data.

Parameters:
  • data (DataFrame) –

    Processed annotations DataFrame to validate.

Returns:
  • bool

    True if validation passes.

Raises: