Base class for building a combined annotation dict from processor outputs.

Subclasses implement combine() to load source data and call add_source() for each. The clean()save() workflow is defined here and shared across all combiners.

Attributes:
  • anno (dict[str, Any]) –

    The combined annotation dictionary, keyed by accession ID.

add_source(source_name, data)

Add annotations from a standard-schema DataFrame.

Rows are grouped by (COL_ACCESSION, COL_ATTRIBUTE). Multiple term IDs and labels for the same group are joined with DELIMITER. The ecode of the first row in the group is used (processors produce a single ecode per source).

Parameters:
  • source_name (str) –

    Name of the data source, used as the key in the nested dict (e.g., "ale", "gemma").

  • data (DataFrame) –

    Standard-schema DataFrame with columns COL_ACCESSION, COL_ATTRIBUTE, COL_TERM_ID, COL_TERM_NAME, COL_ECODE.

clean(specific=False, uberon_relations=UBERON_RELATIONS, mondo_relations=MONDO_RELATIONS, uberon_systems=UBERON_SYSTEMS, mondo_systems=MONDO_SYSTEMS)

Remove empty and undesired annotation entries.

Drops source entries where every value is in UNDESIRED or where the only key remaining after filtering is ecode. Drops entries that have no substantive annotations after cleaning.

Parameters:
  • specific (bool, default: False ) –

    If True, will remove general annotations and find the most specific from all sources.

Returns:
  • BaseAnnotationCombiner

    self, for chaining.

save(output_path)

Save the combined annotation dict to a BSON file.

Parameters:
  • output_path (Path) –

    Destination file path (parent directories are created if needed).