Bases: BaseProcessor
Processor for DiSignAtlas disease and tissue annotations.
DiSignAtlas distributes gene-expression disease signatures as a GMT file. Each line encodes a dataset with a pipe-delimited metadata string that carries GSM sample IDs (split into control and case groups), disease terms, tissue, platform, and organism fields.
The processor expands every dataset into one row per GSM sample and produces two annotation types:
disease— for every sample (controls receive a syntheticControl / C0000000term).tissue— only for samples whose tissue field is present and not the literal string"None".
process(output_dir, **kwargs)
¶
Process the DiSignAtlas GMT file into standardized sample annotations.
Reads the GMT file, parses the pipe-delimited description field, expands GSM sample IDs for both control and case groups, and emits one disease annotation per sample plus one tissue annotation for samples with a valid tissue value.
| Parameters: |
|
|---|
| Returns: |
|
|---|
validate(data)
¶
Validate that processed DiSignAtlas data meets minimum requirements.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|