Base class for data source processors.

Defines the interface that all data source processors must implement to ensure consistency across the pipeline.

ProcessorError

Bases: Exception

Exception raised when processor encounters an error.

ValidationError

Bases: Exception

Exception raised when processor validation fails.

BaseProcessor

Bases: ABC

Abstract base class for all data source processors.

All data source processors must inherit from this class and implement the required methods. This ensures a consistent interface across all processors in the pipeline.

Attributes:
  • source_name (str) –

    Unique identifier for this data source (e.g., "gemma", "ale")

  • version (str) –

    Version string for this processor

  • description (str) –

    Human-readable description of the data source

  • logger (Logger) –

    Logger instance for this processor

__init__()

Initialize the base processor.

process(output_dir, **kwargs) abstractmethod

Process raw data into standardized annotation format.

The output DataFrame must have the following columns: - accession: str - Sample or study identifier (GSM, GSE, SRR, etc.) - attribute: str - Type of annotation (tissue, disease, cell_type, sex, age) - term_id: str - Ontology term ID (e.g., MONDO:0004994, UBERON:0000948) - term_name: str - Human-readable term label - ecode: str - Evidence code (expert, semi, crowd, automated)

Parameters:
  • output_dir (Path) –

    Directory to write processed output

  • **kwargs (Any, default: {} ) –

    Additional processor-specific arguments

Returns:
  • DataFrame

    Standardized annotations DataFrame

Raises:

validate(data) abstractmethod

Validate that processed data meets requirements.

Checks that the DataFrame has the required columns and that values are in the expected format.

Parameters:
  • data (DataFrame) –

    Processed annotations DataFrame to validate

Returns:
  • bool

    True if validation passes

Raises:

cleanup(temp_dir)

Clean up temporary files after processing.

Parameters:
  • temp_dir (Path) –

    Directory containing temporary files to clean

run(output_dir=PROCESSED_DIR, validate_output=True, **kwargs)

Run the complete processor workflow: process, validate.

Parameters:
  • output_dir (Path, default: PROCESSED_DIR ) –

    Directory for outputs

  • validate_output (bool, default: True ) –

    Whether to validate processed data

  • **kwargs (Any, default: {} ) –

    Additional processor-specific arguments

Returns:
  • DataFrame

    Processed and validated annotations

Raises:

__repr__()

String representation of processor.