Bases: BaseProcessor

Processor for DiSignAtlas disease and tissue annotations.

DiSignAtlas distributes gene-expression disease signatures as a GMT file. Each line encodes a dataset with a pipe-delimited metadata string that carries GSM sample IDs (split into control and case groups), disease terms, tissue, platform, and organism fields.

The processor expands every dataset into one row per GSM sample and produces two annotation types:

  • disease — for every sample (controls receive a synthetic Control / C0000000 term).
  • tissue — only for samples whose tissue field is present and not the literal string "None".

process(output_dir, **kwargs)

Process the DiSignAtlas GMT file into standardized sample annotations.

Reads the GMT file, parses the pipe-delimited description field, expands GSM sample IDs for both control and case groups, and emits one disease annotation per sample plus one tissue annotation for samples with a valid tissue value.

Parameters:
  • output_dir (Path) –

    Directory where the processed parquet file will be written.

  • **kwargs (Any, default: {} ) –

    input_path (Path | str) — override the default GMT input path.

Returns:
  • DataFrame

    Standardized annotations with columns sample_id, annotation_type, term_id, term_label, and ecode.

validate(data)

Validate that processed DiSignAtlas data meets minimum requirements.

Parameters:
  • data (DataFrame) –

    Processed annotations DataFrame to validate.

Returns:
  • bool

    True if validation passes.

Raises: