Schema

Configuration schemas for metahq-build pipeline.

Defines Pydantic models for validating pipeline configuration.

`SampleAnnotationEntry` ¶

Bases: BaseModel

Configuration for a sample-level entry from a source in MetaHQ.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., UBERON:0000948, F, elderly_adult) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry id is a valid sex, age, tissue, or disease entry.

`validate_ecode(v)` `classmethod` ¶

Ensure ecodes were entered correctly.

`SampleTissueAnnotationEntry` ¶

Bases: SampleAnnotationEntry

Configuration for sample-level tissue annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., UBERON:0000948, UBERON:0000955) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has UBERON or CL annotations.

`SampleDiseaseAnnotationEntry` ¶

Bases: SampleAnnotationEntry

Configuration for sample-level disease annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., MONDO:0004994, MONDO:0004992) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has MONDO annotations.

`SampleSexAnnotationEntry` ¶

Bases: SampleAnnotationEntry

Configuration for sample-level sex annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., M or F) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has valid sex ID annotations.

`SampleAgeAnnotationEntry` ¶

Bases: SampleAnnotationEntry

Configuration for sample-level age annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., adult, elderly_adult) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has valid age group ID annotations.

`SeriesAnnotationEntry` ¶

Bases: BaseModel

Configuration for a series-level entry from a source in MetaHQ.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., UBERON:0000948, F, elderly_adult) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry id is a valid sex, age, tissue, or disease entry.

`validate_ecode(v)` `classmethod` ¶

Ensure ecodes were entered correctly.

`SeriesTissueAnnotationEntry` ¶

Bases: SeriesAnnotationEntry

Configuration for series-level tissue annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., UBERON:0000948, UBERON:0000955) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has UBERON or CL annotations.

`SeriesDiseaseAnnotationEntry` ¶

Bases: SeriesAnnotationEntry

Configuration for series-level disease annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., MONDO:0004994, MONDO:0004992, or MONDO:0004994\|MONDO:0004992) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has MONDO annotations.

`SeriesSexAnnotationEntry` ¶

Bases: SeriesAnnotationEntry

Configuration for series-level sex annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., M, F, or M\|F) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has valid sex ID annotations.

`SeriesAgeAnnotationEntry` ¶

Bases: SeriesAnnotationEntry

Configuration for series-level age annotation entries.

Attributes:	`id` (`str`) – The ID of an annotation (e.g., adult, elderly_adult, or adult\|elderly_adult) `ecode` (`str`) – The annotation's evidence code (e.g., expert-curated, crowd-sourced) `value` (`str \| None`) – The value of an annotation. Defualt is None.

`validate_id(v)` `classmethod` ¶

Ensure an entry ID has valid age group ID annotations.

`SampleAccessionIDs` ¶

Bases: BaseModel

Configuration for accession IDs for a sample entry.

Attributes:	`sample` (`str`) – Sample ID starting with GSM. `series` (`str`) – Series ID starting with GSE. `platform` (`str`) – Platform ID starting with GPL. `srx` (`str \| None`) – An SRA experiment ID. `srs` (`str \| None`) – An SRA sample ID. `srp` (`str \| None`) – An SRA project ID.

`validate_sample_prefix(v)` `classmethod` ¶

Ensure all sample IDs start with GSM.

`validate_series_prefix(v)` `classmethod` ¶

Ensure all series IDs start with GSE.

`validate_platform_prefix(v)` `classmethod` ¶

Ensure all platform IDs start with GPL.

`validate_xxx_prefix(v)` `classmethod` ¶

Ensure all SRX IDs start with SRX, ERX, or DRX.

`validate_xxs_prefix(v)` `classmethod` ¶

Ensure all SRS IDs start with SRS, ERS, or DRS.

`validate_xxp_prefix(v)` `classmethod` ¶

Ensure all SRP IDs start with SRP, ERP, or DRP.

`SeriesAccessionIDs` ¶

Bases: BaseModel

Configuration for accession IDs for a sample entry.

Attributes:	`series` (`str`) – Series ID starting with GSE. `platform` (`str`) – Platform ID starting with GPL. `srp` (`str \| None`) – An SRA project ID.

`validate_series_prefix(v)` `classmethod` ¶

Ensure all series IDs start with GSE.

`validate_platform_prefix(v)` `classmethod` ¶

Ensure all platform IDs start with GPL.

`validate_xxp_prefix(v)` `classmethod` ¶

Ensure all SRP IDs start with SRP, ERP, or DRP.

`SampleEntry` ¶

Bases: BaseModel

Configuration for a sample entry in the database.

Attributes:

accession_ids (SampleAccessionIDs) –

Accession IDs required for a sample entry.
organism (str) –

The lowercase genus and species of an organism.
tissue (SourceAnnotations | None) –

Tissue annotations across sources.
disease (SourceAnnotations | None) –

Disease annotations across sources.
sex (SourceAnnotations | None) –

Sex annotations across sources.
age (SourceAnnotations | None) –

Age group annotations across sources.

`verify_organism(v)` `classmethod` ¶

Check that the organism for a particular entry is valid.

`SeriesEntry` ¶

Bases: BaseModel

Configuration for a series entry in the database.

Attributes:

accession_ids (SeriesAccessionIDs) –

Accession IDs required for a sample entry.
organism (str) –

The lowercase genus and species of an organism.
tissue (SourceAnnotations | None) –

Tissue annotations across sources.
disease (SourceAnnotations | None) –

Disease annotations across sources.
sex (SourceAnnotations | None) –

Sex annotations across sources.
age (SourceAnnotations | None) –

Age group annotations across sources.

`verify_organism(v)` `classmethod` ¶

Check that the organism for a particular entry is valid.

`ProcessorConfig` ¶

Bases: BaseModel

Configuration for a single data source processor.

Attributes:	`enabled` (`bool`) – Whether this processor should run `download` (`bool`) – Whether to download raw data `max_records` (`int \| None`) – Maximum number of records to process (None for all) `custom_params` (`dict`) – Processor-specific parameters

`OntologyConfig` ¶

Bases: BaseModel

Configuration for ontology processing.

Attributes:	`name` (`str`) – Name of the ontology (e.g., "mondo", "uberon") `download` (`bool`) – Whether to download the ontology file `url` (`str \| None`) – Custom URL for ontology download `extract_relations` (`bool`) – Whether to extract ancestor/descendant relations

`PipelineStageConfig` ¶

Bases: BaseModel

Configuration for a pipeline stage.

Attributes:	`skip` (`bool`) – Whether to skip this stage `use_checkpoint` (`bool`) – Whether to use checkpoint if available `timeout` (`int \| None`) – Timeout in seconds (None for no timeout)

`ParallelConfig` ¶

Bases: BaseModel

Configuration for parallel processing.

Attributes:	`num_workers` (`int`) – Number of parallel workers `chunk_size` (`int`) – Number of items per chunk `use_multiprocessing` (`bool`) – Use multiprocessing vs threading

`ValidationConfig` ¶

Bases: BaseModel

Configuration for data validation.

Attributes:	`strict` (`bool`) – Strict validation mode (fail on any error) `warn_only` (`bool`) – Only warn on validation failures `check_ontology_coverage` (`bool`) – Validate ontology term coverage

`PipelineConfig` ¶

Bases: BaseModel

Main pipeline configuration.

Attributes:

data_dir (Path) –

Root directory for data files
output_dir (Path) –

Output directory for built database
temp_dir (Path) –

Temporary directory for intermediate files
checkpoint_dir (Path) –

Directory for pipeline checkpoints
log_dir (Path) –

Directory for log files
processors (dict[str, ProcessorConfig]) –

Configuration for each data source processor
ontologies (list[OntologyConfig]) –

Ontologies to process
stages (dict[str, PipelineStageConfig]) –

Configuration for pipeline stages
parallel (ParallelConfig) –

Parallel processing configuration
validation (ValidationConfig) –

Validation configuration
clean_temp (bool) –

Clean temporary files after completion
verbose (bool) –

Enable verbose output

`FileEntry` ¶

Bases: BaseModel

A source → destination file mapping in the data package structure.

`DataPackageConfig` ¶

Bases: BaseModel

Full configuration for the MetaHQ setup pipeline, driven by metahq_build.yaml.

Attributes:

data_dir (Path) –

Root directory for all input data.
output_dir (Path) –

Directory where data packages are written.
package_name (str) –

Name of the data package.
overwrite (bool) –

Overwrite an existing package with the same name.
omicidx_path (Path) –

Path to the OmicIDX DuckDB database.
temp_dir (Path) –

Temporary directory for intermediate files.
checkpoint_dir (Path) –

Directory for pipeline checkpoint state.
log_dir (Path) –

Directory for log files.
validation (ValidationConfig) –

Validation settings.
processors (dict[str, ProcessorConfig]) –

Per-source processor settings.
stages (dict[str, PipelineStageConfig]) –

Per-stage settings.
structure (list[FileEntry]) –

Data package file mapping.
clean_temp (bool) –

Remove temp files after completion.
verbose (bool) –

Enable verbose output.

`data_package_path` `property` ¶

Return the full path to the data package directory.

`md5_path` `property` ¶

Return the full path to the data package directory.

`from_yaml(file)` `classmethod` ¶

Load and validate config from metahq_build.yaml.

Flattens params keys into the top level and resolves {output_dir}/{package_name} placeholders in structure destinations.

`get_processor_config(name)` ¶

Return config for a named processor, falling back to defaults.

`get_stage_config(name)` ¶

Return config for a named pipeline stage, falling back to defaults.

`verify_source_files()` ¶

Ensure every source file listed in the structure exists.

Logs all missing files before exiting so the user can fix them all at once.

`create_directories()` ¶

Create all required pipeline directories.

SampleAnnotationEntry ¶

validate_id(v) classmethod ¶

validate_ecode(v) classmethod ¶

SampleTissueAnnotationEntry ¶

validate_id(v) classmethod ¶

SampleDiseaseAnnotationEntry ¶

validate_id(v) classmethod ¶

SampleSexAnnotationEntry ¶

validate_id(v) classmethod ¶

SampleAgeAnnotationEntry ¶

validate_id(v) classmethod ¶

SeriesAnnotationEntry ¶

validate_id(v) classmethod ¶

validate_ecode(v) classmethod ¶

SeriesTissueAnnotationEntry ¶

validate_id(v) classmethod ¶

SeriesDiseaseAnnotationEntry ¶

validate_id(v) classmethod ¶

SeriesSexAnnotationEntry ¶

validate_id(v) classmethod ¶

SeriesAgeAnnotationEntry ¶

validate_id(v) classmethod ¶

SampleAccessionIDs ¶

validate_sample_prefix(v) classmethod ¶

validate_series_prefix(v) classmethod ¶

validate_platform_prefix(v) classmethod ¶

validate_xxx_prefix(v) classmethod ¶

validate_xxs_prefix(v) classmethod ¶

validate_xxp_prefix(v) classmethod ¶

SeriesAccessionIDs ¶

validate_series_prefix(v) classmethod ¶

validate_platform_prefix(v) classmethod ¶

validate_xxp_prefix(v) classmethod ¶

SampleEntry ¶

verify_organism(v) classmethod ¶

SeriesEntry ¶

verify_organism(v) classmethod ¶

ProcessorConfig ¶

OntologyConfig ¶

PipelineStageConfig ¶

ParallelConfig ¶

ValidationConfig ¶

PipelineConfig ¶

FileEntry ¶

DataPackageConfig ¶

data_package_path property ¶

md5_path property ¶

from_yaml(file) classmethod ¶

get_processor_config(name) ¶

get_stage_config(name) ¶

verify_source_files() ¶

create_directories() ¶

`SampleAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`validate_ecode(v)` `classmethod` ¶

`SampleTissueAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SampleDiseaseAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SampleSexAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SampleAgeAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SeriesAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`validate_ecode(v)` `classmethod` ¶

`SeriesTissueAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SeriesDiseaseAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SeriesSexAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SeriesAgeAnnotationEntry` ¶

`validate_id(v)` `classmethod` ¶

`SampleAccessionIDs` ¶

`validate_sample_prefix(v)` `classmethod` ¶

`validate_series_prefix(v)` `classmethod` ¶

`validate_platform_prefix(v)` `classmethod` ¶

`validate_xxx_prefix(v)` `classmethod` ¶

`validate_xxs_prefix(v)` `classmethod` ¶

`validate_xxp_prefix(v)` `classmethod` ¶

`SeriesAccessionIDs` ¶

`validate_series_prefix(v)` `classmethod` ¶

`validate_platform_prefix(v)` `classmethod` ¶

`validate_xxp_prefix(v)` `classmethod` ¶

`SampleEntry` ¶

`verify_organism(v)` `classmethod` ¶

`SeriesEntry` ¶

`verify_organism(v)` `classmethod` ¶

`ProcessorConfig` ¶

`OntologyConfig` ¶

`PipelineStageConfig` ¶

`ParallelConfig` ¶

`ValidationConfig` ¶

`PipelineConfig` ¶

`FileEntry` ¶

`DataPackageConfig` ¶

`data_package_path` `property` ¶

`md5_path` `property` ¶

`from_yaml(file)` `classmethod` ¶

`get_processor_config(name)` ¶

`get_stage_config(name)` ¶

`verify_source_files()` ¶

`create_directories()` ¶