Bases: BaseProcessor

Processor for Johnson 2023 manually curated annotations.

Processes two datasets: - Microarray (GPL570): GEO samples with MESH-formatted disease and tissue - RNA-seq (refine.bio): SRA samples with DOID disease and free-text tissue

All annotations have expert curation (ecode='expert').

process(output_dir=PROCESSED_DIR, **kwargs)

Process Johnson 2023 datasets into standardized annotations.

Parameters:
  • output_dir (Path, default: PROCESSED_DIR ) –

    Directory where the processed parquet files will be written. Defaults to data/processed.

  • **kwargs (Any, default: {} ) –

    microarray_input_path (Path) - override microarray input file rnaseq_input_path (Path) - override RNA-seq input file

Returns:
  • DataFrame

    Standardized annotations with columns sample_id, annotation_type, term_id, term_label, and ecode.

validate(data)

Validate that processed Johnson 2023 data meets requirements.

Parameters:
  • data (DataFrame) –

    Processed annotations DataFrame to validate.

Returns:
  • bool

    True if validation passes.

Raises: