Configuration schemas for metahq-build pipeline.

Defines Pydantic models for validating pipeline configuration.

SampleAnnotationEntry

Bases: BaseModel

Configuration for a sample-level entry from a source in MetaHQ.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., UBERON:0000948, F, elderly_adult)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry id is a valid sex, age, tissue, or disease entry.

validate_ecode(v) classmethod

Ensure ecodes were entered correctly.

SampleTissueAnnotationEntry

Bases: SampleAnnotationEntry

Configuration for sample-level tissue annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., UBERON:0000948, UBERON:0000955)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has UBERON or CL annotations.

SampleDiseaseAnnotationEntry

Bases: SampleAnnotationEntry

Configuration for sample-level disease annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., MONDO:0004994, MONDO:0004992)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has MONDO annotations.

SampleSexAnnotationEntry

Bases: SampleAnnotationEntry

Configuration for sample-level sex annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., M or F)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has valid sex ID annotations.

SampleAgeAnnotationEntry

Bases: SampleAnnotationEntry

Configuration for sample-level age annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., adult, elderly_adult)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has valid age group ID annotations.

SeriesAnnotationEntry

Bases: BaseModel

Configuration for a series-level entry from a source in MetaHQ.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., UBERON:0000948, F, elderly_adult)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry id is a valid sex, age, tissue, or disease entry.

validate_ecode(v) classmethod

Ensure ecodes were entered correctly.

SeriesTissueAnnotationEntry

Bases: SeriesAnnotationEntry

Configuration for series-level tissue annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., UBERON:0000948, UBERON:0000955)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has UBERON or CL annotations.

SeriesDiseaseAnnotationEntry

Bases: SeriesAnnotationEntry

Configuration for series-level disease annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., MONDO:0004994, MONDO:0004992, or MONDO:0004994|MONDO:0004992)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has MONDO annotations.

SeriesSexAnnotationEntry

Bases: SeriesAnnotationEntry

Configuration for series-level sex annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., M, F, or M|F)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has valid sex ID annotations.

SeriesAgeAnnotationEntry

Bases: SeriesAnnotationEntry

Configuration for series-level age annotation entries.

Attributes:
  • id (str) –

    The ID of an annotation (e.g., adult, elderly_adult, or adult|elderly_adult)

  • ecode (str) –

    The annotation's evidence code (e.g., expert-curated, crowd-sourced)

  • value (str | None) –

    The value of an annotation. Defualt is None.

validate_id(v) classmethod

Ensure an entry ID has valid age group ID annotations.

SampleAccessionIDs

Bases: BaseModel

Configuration for accession IDs for a sample entry.

Attributes:
  • sample (str) –

    Sample ID starting with GSM.

  • series (str) –

    Series ID starting with GSE.

  • platform (str) –

    Platform ID starting with GPL.

  • srx (str | None) –

    An SRA experiment ID.

  • srs (str | None) –

    An SRA sample ID.

  • srp (str | None) –

    An SRA project ID.

validate_sample_prefix(v) classmethod

Ensure all sample IDs start with GSM.

validate_series_prefix(v) classmethod

Ensure all series IDs start with GSE.

validate_platform_prefix(v) classmethod

Ensure all platform IDs start with GPL.

validate_xxx_prefix(v) classmethod

Ensure all SRX IDs start with SRX, ERX, or DRX.

validate_xxs_prefix(v) classmethod

Ensure all SRS IDs start with SRS, ERS, or DRS.

validate_xxp_prefix(v) classmethod

Ensure all SRP IDs start with SRP, ERP, or DRP.

SeriesAccessionIDs

Bases: BaseModel

Configuration for accession IDs for a sample entry.

Attributes:
  • series (str) –

    Series ID starting with GSE.

  • platform (str) –

    Platform ID starting with GPL.

  • srp (str | None) –

    An SRA project ID.

validate_series_prefix(v) classmethod

Ensure all series IDs start with GSE.

validate_platform_prefix(v) classmethod

Ensure all platform IDs start with GPL.

validate_xxp_prefix(v) classmethod

Ensure all SRP IDs start with SRP, ERP, or DRP.

SampleEntry

Bases: BaseModel

Configuration for a sample entry in the database.

Attributes:
  • accession_ids (SampleAccessionIDs) –

    Accession IDs required for a sample entry.

  • organism (str) –

    The lowercase genus and species of an organism.

  • tissue (SourceAnnotations | None) –

    Tissue annotations across sources.

  • disease (SourceAnnotations | None) –

    Disease annotations across sources.

  • sex (SourceAnnotations | None) –

    Sex annotations across sources.

  • age (SourceAnnotations | None) –

    Age group annotations across sources.

verify_organism(v) classmethod

Check that the organism for a particular entry is valid.

SeriesEntry

Bases: BaseModel

Configuration for a series entry in the database.

Attributes:
  • accession_ids (SeriesAccessionIDs) –

    Accession IDs required for a sample entry.

  • organism (str) –

    The lowercase genus and species of an organism.

  • tissue (SourceAnnotations | None) –

    Tissue annotations across sources.

  • disease (SourceAnnotations | None) –

    Disease annotations across sources.

  • sex (SourceAnnotations | None) –

    Sex annotations across sources.

  • age (SourceAnnotations | None) –

    Age group annotations across sources.

verify_organism(v) classmethod

Check that the organism for a particular entry is valid.

ProcessorConfig

Bases: BaseModel

Configuration for a single data source processor.

Attributes:
  • enabled (bool) –

    Whether this processor should run

  • download (bool) –

    Whether to download raw data

  • max_records (int | None) –

    Maximum number of records to process (None for all)

  • custom_params (dict) –

    Processor-specific parameters

OntologyConfig

Bases: BaseModel

Configuration for ontology processing.

Attributes:
  • name (str) –

    Name of the ontology (e.g., "mondo", "uberon")

  • download (bool) –

    Whether to download the ontology file

  • url (str | None) –

    Custom URL for ontology download

  • extract_relations (bool) –

    Whether to extract ancestor/descendant relations

PipelineStageConfig

Bases: BaseModel

Configuration for a pipeline stage.

Attributes:
  • skip (bool) –

    Whether to skip this stage

  • use_checkpoint (bool) –

    Whether to use checkpoint if available

  • timeout (int | None) –

    Timeout in seconds (None for no timeout)

ParallelConfig

Bases: BaseModel

Configuration for parallel processing.

Attributes:
  • num_workers (int) –

    Number of parallel workers

  • chunk_size (int) –

    Number of items per chunk

  • use_multiprocessing (bool) –

    Use multiprocessing vs threading

ValidationConfig

Bases: BaseModel

Configuration for data validation.

Attributes:
  • strict (bool) –

    Strict validation mode (fail on any error)

  • warn_only (bool) –

    Only warn on validation failures

  • check_ontology_coverage (bool) –

    Validate ontology term coverage

PipelineConfig

Bases: BaseModel

Main pipeline configuration.

Attributes:
  • data_dir (Path) –

    Root directory for data files

  • output_dir (Path) –

    Output directory for built database

  • temp_dir (Path) –

    Temporary directory for intermediate files

  • checkpoint_dir (Path) –

    Directory for pipeline checkpoints

  • log_dir (Path) –

    Directory for log files

  • processors (dict[str, ProcessorConfig]) –

    Configuration for each data source processor

  • ontologies (list[OntologyConfig]) –

    Ontologies to process

  • stages (dict[str, PipelineStageConfig]) –

    Configuration for pipeline stages

  • parallel (ParallelConfig) –

    Parallel processing configuration

  • validation (ValidationConfig) –

    Validation configuration

  • clean_temp (bool) –

    Clean temporary files after completion

  • verbose (bool) –

    Enable verbose output

FileEntry

Bases: BaseModel

A source → destination file mapping in the data package structure.

DataPackageConfig

Bases: BaseModel

Full configuration for the MetaHQ setup pipeline, driven by metahq_build.yaml.

Attributes:
  • data_dir (Path) –

    Root directory for all input data.

  • output_dir (Path) –

    Directory where data packages are written.

  • package_name (str) –

    Name of the data package.

  • overwrite (bool) –

    Overwrite an existing package with the same name.

  • omicidx_path (Path) –

    Path to the OmicIDX DuckDB database.

  • temp_dir (Path) –

    Temporary directory for intermediate files.

  • checkpoint_dir (Path) –

    Directory for pipeline checkpoint state.

  • log_dir (Path) –

    Directory for log files.

  • validation (ValidationConfig) –

    Validation settings.

  • processors (dict[str, ProcessorConfig]) –

    Per-source processor settings.

  • stages (dict[str, PipelineStageConfig]) –

    Per-stage settings.

  • structure (list[FileEntry]) –

    Data package file mapping.

  • clean_temp (bool) –

    Remove temp files after completion.

  • verbose (bool) –

    Enable verbose output.

data_package_path property

Return the full path to the data package directory.

md5_path property

Return the full path to the data package directory.

from_yaml(file) classmethod

Load and validate config from metahq_build.yaml.

Flattens params keys into the top level and resolves {output_dir}/{package_name} placeholders in structure destinations.

get_processor_config(name)

Return config for a named processor, falling back to defaults.

get_stage_config(name)

Return config for a named pipeline stage, falling back to defaults.

verify_source_files()

Ensure every source file listed in the structure exists.

Logs all missing files before exiting so the user can fix them all at once.

create_directories()

Create all required pipeline directories.