Annotations

Bases: BaseCuration

Class to store and mutate annotations of samples to various attributes like tissues, dieases, sexes, ages, etc.

Attributes:
  • data (DataFrame) –

    Polars DataFrame with index and group ID columns and columns for each attribute entity for each index (e.g. male or female, tissues, diseases, etc).

  • disease (bool) –

    Indicates if the annotations are disease based. Used to account for control samples when converting annotations to labels.

  • index_col (str) –

    Name of the column of data that contains the index IDs.

  • group_cols (tuple[str, ...]) –

    Names of columns of data that contain an ID for each index indicating if it belongs to a particular group (e.g. dataset, sex, platform, etc.).

  • collapsed (bool) –

    Indicates if the annotations have already been collapsed.

entities property

Returns term names of the Annotations frame.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.entities
['UBERON:0000955', 'UBERON:0002349', 'UBERON:0000948', 'UBERON:0002113']

groups property

Returns the groups column of the Annotations curation.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.groups
['GSE1', 'GSE1', 'GSE2']

ids property

Return the IDs dataframe.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.ids
┌────────┬────────┐
│ sample ┆ series │
│ ---    ┆ ---    │
│ str    ┆ str    │
╞════════╪════════╡
│ GSM1   ┆ GSE1   │
│ GSM2   ┆ GSE1   │
│ GSM3   ┆ GSE2   │
└────────┴────────┘

index property

Return the index column as a list.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.index
['GSM1', 'GSM2', 'GSM3']

n_indices property

Returns number of indices.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.n_indices
3

n_entities property

Returns number of entities.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.n_entities
4

unique_groups property

Returns unique groups.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.unique_groups
['GSE2', 'GSE1']

add_ids(new)

Append new group ID columns to the IDs of an Annotations object. The new IDs must have a matching index.

Parameters:
  • new (DataFrame) –

    A DataFrame of additional IDs to join with the current index column of data. Must have a matching index column as the original data.

Returns:
  • Annotations

    A new Annotations object including the new ID columns.

collapse(on, inplace=True)

Collapses annotations on the specified grouping column.

Parameters:
  • on (str) –

    The column to collapse on. This should be one of the columns in group_cols.

  • inplace (bool, default: True ) –

    If True, updates this object and returns self. Otherwise, returns new object.

drop(*args, **kwargs)

Wrapper for polars drop. Drops any of the term columns. ID columns are not dropped through this method.

filter(condition)

Filter both data and ids simultaneously using a mask.

Parameters:
  • condition (Expr) –

    Polars expression for filtering columns.

Examples:

>>> from metahq_core.curations.annotations import Annotations
>>> anno = {
        'sample': ['GSM1', 'GSM2', 'GSM3'],
        'series': ['GSE1', 'GSE1', 'GSE2'],
        'UBERON:0000948': [1, 0, 0],
        'UBERON:0002113': [0, 1, 0],
        'UBERON:0000955': [0, 0, 1],
    }
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.filter(pl.col("UBERON:0000948") == 1)
┌────────┬────────┬────────────────┬────────────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 ┆ UBERON:0002113 ┆ UBERON:0000955 │
│ ---    ┆ ---    ┆ ---            ┆ ---            ┆ ---            │
│ str    ┆ str    ┆ i32            ┆ i32            ┆ i32            │
╞════════╪════════╪════════════════╪════════════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              ┆ 0              ┆ 0              │
└────────┴────────┴────────────────┴────────────────┴────────────────┘

head(*args, **kwargs)

Wrapper for polars head function.

save(outfile, fmt, attribute, level, citation_config, metadata=None)

Save the annotations curation.

Parameters:
  • outfile (str | Path) –

    Path to outfile.json.

  • fmt (Literal['json', 'parquet', 'csv', 'tsv']) –

    File format to save to.

  • attribute (str) –

    A supported MetaHQ annotated attribute.

  • level (str) –

    An index level supported by MetaHQ.

  • citation_config (CitationConfig) –

    Parameters for saving citations.

  • metadata (bool, default: None ) –

    If True, will add index titles to each entry.

Examples:

If `metadata` is None, will only save the index column
with the remaining annotations.

>>> from metahq_core.curations.annotations import Annotations
>>> from metahq_core.export.references import CitationConfig
>>> config = CitationConfig(
        '1.0.1', 'tissue', 'sample', 'human', 'expert', 'rnaseq', 'annotate', '2026-04-20'
    )
>>> anno = {
        'sample': ['GSM1', 'GSM2', 'GSM3'],
        'series': ['GSE1', 'GSE1', 'GSE2'],
        'UBERON:0000948': [1, 0, 0],
        'UBERON:0002113': [0, 1, 0],
        'UBERON:0000955': [0, 0, 1],
    }
>>> anno = Annotations.from_df(anno, index_col='sample', group_cols=['series'])
>>> anno.save(
        '/path/to/out.parquet', fmt='parquet', attribute='tissue', level='sample'
    )

sort_columns()

Sorts term columns.

Examples:

>>> from metahq_core.curations.annotations import Annotations
>>> anno = {
        'sample': ['GSM1', 'GSM2', 'GSM3'],
        'series': ['GSE1', 'GSE1', 'GSE2'],
        'UBERON:0000948': [1, 0, 0],
        'UBERON:0002113': [0, 1, 0],
        'UBERON:0000955': [0, 0, 1],
    }
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.sort_columns()
┌────────┬────────┬────────────────┬────────────────┬────────────────┐
│ series ┆ sample ┆ UBERON:0000948 ┆ UBERON:0000955 ┆ UBERON:0002113 │
│ ---    ┆ ---    ┆ ---            ┆ ---            ┆ ---            │
│ str    ┆ str    ┆ i32            ┆ i32            ┆ i32            │
╞════════╪════════╪════════════════╪════════════════╪════════════════╡
│ GSE1   ┆ GSM1   ┆ 1              ┆ 0              ┆ 0              │
│ GSE1   ┆ GSM2   ┆ 0              ┆ 0              ┆ 1              │
│ GSE2   ┆ GSM3   ┆ 0              ┆ 1              ┆ 0              │
└────────┴────────┴────────────────┴────────────────┴────────────────┘

propagate(to_terms, ontology, mode, control_col='MONDO:0000000')

Convert annotations to propagated labels.

Assigns propagated labels to terms given their annotations.

Parameters:
  • to_terms (list[str]) –

    Array of terms to generate labels for, or "union"/"all".

  • ontology (str) –

    The name of an ontology to reference for annotation propagation.

  • mode (Literal[0, 1]) –

    Mode of propagation.

    If mode is 0, this will propagate any positive annotations
    from any descendants of the to_terms up to the to_terms.
    
    If mode 1, this will convert annotations to -1, 0, +1 labels
    where for a particular term, if an index is annotated to that term or
    any of its descendants, it recieves a +1 label. If it is annotated to an
    ancestor of that term, it receives a 0 (unsure) label. If it is not annotated
    to an ancestor or a descendant of that term, it recieves a -1 label.
    Any indices annotated to the control column are assigned a label of 2 for any
    terms that other indices within the same group are positively labeled to.
    
  • control_col (str, default: 'MONDO:0000000' ) –

    Column name for control annotations.

Returns:
  • Labels | Annotations

    A Labels curation object with propagated -1, 0, +1 labels (and 2 if controls are

  • Labels | Annotations

    present). Any entries in index_col that have a 0 annotation/label across all

  • Labels | Annotations

    entity columns are dropped.

Examples:

With `mode=0`:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.propagate(to_terms=["UBERON:0000948"], ontology="uberon", mode=0)
┌────────┬────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 │
│ ---    ┆ ---    ┆ ---            │
│ str    ┆ str    ┆ i32            │
╞════════╪════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              │
│ GSM2   ┆ GSE1   ┆ 1              │
└────────┴────────┴────────────────┘

With `mode=1`:

>>> anno.propagate(to_terms=["UBERON:0000948"], ontology="uberon", mode=1)
┌────────┬────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 │
│ ---    ┆ ---    ┆ ---            │
│ str    ┆ str    ┆ i32            │
╞════════╪════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              │
│ GSM2   ┆ GSE1   ┆ 1              │
│ GSM3   ┆ GSE2   ┆ -1             │
└────────┴────────┴────────────────┘

select(*args, **kwargs)

Select annotation columns while maintaining ids.

slice(offset, length=None)

Slice both data and ids simultaneously using polars slice.

Parameters:
  • offset (int) –

    Index position to begin the slice.

  • length (int | None, default: None ) –

    Number of indices past offset to slice out.

Returns:
  • Annotations

    Sliced Annotations object as a subset of the original Annotations.

from_df(df, index_col, sources_col, group_cols, **kwargs) classmethod

Creates an Annotations object from a combined DataFrame.

Attributes:
  • df (DataFrame) –

    Polars DataFrame with index and group ID columns and columns for each attribute entity for each index (e.g. male or female, tissues, diseases, etc).

  • index_col (str) –

    Name of the column of data that contains the index IDs.

  • group_cols (tuple[str, ...]) –

    Names of columns of data that contain an ID for each index indicating if it belongs to a particular group (e.g. dataset, sex, platform, etc.).

Returns:
  • Annotations

    An Annotations object constructed from df.

Examples:

>>> from metahq_core.curations.annotations import Annotations
>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
┌────────┬────────┬────────────────┬────────────────┬────────────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 ┆ UBERON:0002349 ┆ UBERON:0002113 ┆ UBERON:0000955 │
│ ---    ┆ ---    ┆ ---            ┆ ---            ┆ ---            ┆ ---            │
│ str    ┆ str    ┆ i64            ┆ i64            ┆ i64            ┆ i64            │
╞════════╪════════╪════════════════╪════════════════╪════════════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              ┆ 1              ┆ 0              ┆ 0              │
│ GSM2   ┆ GSE1   ┆ 0              ┆ 1              ┆ 0              ┆ 0              │
│ GSM3   ┆ GSE2   ┆ 0              ┆ 0              ┆ 0              ┆ 1              │
└────────┴────────┴────────────────┴────────────────┴────────────────┴────────────────┘