Annotations¶

Bases: BaseCuration

Class to store and mutate annotations of samples to various attributes like tissues, dieases, sexes, ages, etc.

Attributes:

data (DataFrame) –

Polars DataFrame with index and group ID columns and columns for each attribute entity for each index (e.g. male or female, tissues, diseases, etc).
disease (bool) –

Indicates if the annotations are disease based. Used to account for control samples when converting annotations to labels.
index_col (str) –

Name of the column of data that contains the index IDs.
group_cols (tuple[str, ...]) –

Names of columns of data that contain an ID for each index indicating if it belongs to a particular group (e.g. dataset, sex, platform, etc.).
collapsed (bool) –

Indicates if the annotations have already been collapsed.

`entities` `property` ¶

Returns term names of the Annotations frame.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.entities
['UBERON:0000955', 'UBERON:0002349', 'UBERON:0000948', 'UBERON:0002113']

`groups` `property` ¶

Returns the groups column of the Annotations curation.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.groups
['GSE1', 'GSE1', 'GSE2']

`ids` `property` ¶

Return the IDs dataframe.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.ids
┌────────┬────────┐
│ sample ┆ series │
│ ---    ┆ ---    │
│ str    ┆ str    │
╞════════╪════════╡
│ GSM1   ┆ GSE1   │
│ GSM2   ┆ GSE1   │
│ GSM3   ┆ GSE2   │
└────────┴────────┘

`index` `property` ¶

Return the index column as a list.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.index
['GSM1', 'GSM2', 'GSM3']

`n_indices` `property` ¶

Returns number of indices.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.n_indices
3

`n_entities` `property` ¶

Returns number of entities.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.n_entities
4

`unique_groups` `property` ¶

Returns unique groups.

Examples:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.unique_groups
['GSE2', 'GSE1']

`add_ids(new)` ¶

Append new group ID columns to the IDs of an Annotations object. The new IDs must have a matching index.

Parameters:	`new` (`DataFrame`) – A DataFrame of additional IDs to join with the current index column of `data`. Must have a matching index column as the original `data`.

Returns:	`Annotations` – A new Annotations object including the new ID columns.

`collapse(on, inplace=True)` ¶

Collapses annotations on the specified grouping column.

Parameters:	`on` (`str`) – The column to collapse on. This should be one of the columns in `group_cols`. `inplace` (`bool`, default: `True` ) – If True, updates this object and returns self. Otherwise, returns new object.

`drop(*args, **kwargs)` ¶

Wrapper for polars drop. Drops any of the term columns. ID columns are not dropped through this method.

`filter(condition)` ¶

Filter both data and ids simultaneously using a mask.

Parameters:	`condition` (`Expr`) – Polars expression for filtering columns.

Examples:

>>> from metahq_core.curations.annotations import Annotations
>>> anno = {
        'sample': ['GSM1', 'GSM2', 'GSM3'],
        'series': ['GSE1', 'GSE1', 'GSE2'],
        'UBERON:0000948': [1, 0, 0],
        'UBERON:0002113': [0, 1, 0],
        'UBERON:0000955': [0, 0, 1],
    }
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.filter(pl.col("UBERON:0000948") == 1)
┌────────┬────────┬────────────────┬────────────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 ┆ UBERON:0002113 ┆ UBERON:0000955 │
│ ---    ┆ ---    ┆ ---            ┆ ---            ┆ ---            │
│ str    ┆ str    ┆ i32            ┆ i32            ┆ i32            │
╞════════╪════════╪════════════════╪════════════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              ┆ 0              ┆ 0              │
└────────┴────────┴────────────────┴────────────────┴────────────────┘

`head(*args, **kwargs)` ¶

Wrapper for polars head function.

`save(outfile, fmt, attribute, level, citation_config, metadata=None)` ¶

Save the annotations curation.

Parameters:

outfile (str | Path) –

Path to outfile.json.
fmt (Literal['json', 'parquet', 'csv', 'tsv']) –

File format to save to.
attribute (str) –

A supported MetaHQ annotated attribute.
level (str) –

An index level supported by MetaHQ.
citation_config (CitationConfig) –

Parameters for saving citations.
metadata (bool, default: None ) –

If True, will add index titles to each entry.

Examples:

If `metadata` is None, will only save the index column
with the remaining annotations.

>>> from metahq_core.curations.annotations import Annotations
>>> from metahq_core.export.references import CitationConfig
>>> config = CitationConfig(
        '1.0.1', 'tissue', 'sample', 'human', 'expert', 'rnaseq', 'annotate', '2026-04-20'
    )
>>> anno = {
        'sample': ['GSM1', 'GSM2', 'GSM3'],
        'series': ['GSE1', 'GSE1', 'GSE2'],
        'UBERON:0000948': [1, 0, 0],
        'UBERON:0002113': [0, 1, 0],
        'UBERON:0000955': [0, 0, 1],
    }
>>> anno = Annotations.from_df(anno, index_col='sample', group_cols=['series'])
>>> anno.save(
        '/path/to/out.parquet', fmt='parquet', attribute='tissue', level='sample'
    )

`sort_columns()` ¶

Sorts term columns.

Examples:

>>> from metahq_core.curations.annotations import Annotations
>>> anno = {
        'sample': ['GSM1', 'GSM2', 'GSM3'],
        'series': ['GSE1', 'GSE1', 'GSE2'],
        'UBERON:0000948': [1, 0, 0],
        'UBERON:0002113': [0, 1, 0],
        'UBERON:0000955': [0, 0, 1],
    }
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.sort_columns()
┌────────┬────────┬────────────────┬────────────────┬────────────────┐
│ series ┆ sample ┆ UBERON:0000948 ┆ UBERON:0000955 ┆ UBERON:0002113 │
│ ---    ┆ ---    ┆ ---            ┆ ---            ┆ ---            │
│ str    ┆ str    ┆ i32            ┆ i32            ┆ i32            │
╞════════╪════════╪════════════════╪════════════════╪════════════════╡
│ GSE1   ┆ GSM1   ┆ 1              ┆ 0              ┆ 0              │
│ GSE1   ┆ GSM2   ┆ 0              ┆ 0              ┆ 1              │
│ GSE2   ┆ GSM3   ┆ 0              ┆ 1              ┆ 0              │
└────────┴────────┴────────────────┴────────────────┴────────────────┘

`propagate(to_terms, ontology, mode, control_col='MONDO:0000000')` ¶

Convert annotations to propagated labels.

Assigns propagated labels to terms given their annotations.

Parameters:

to_terms (list[str]) –

Array of terms to generate labels for, or "union"/"all".
ontology (str) –

The name of an ontology to reference for annotation propagation.

mode (Literal[0, 1]) –

Mode of propagation.

If mode is 0, this will propagate any positive annotations
from any descendants of the to_terms up to the to_terms.

If mode 1, this will convert annotations to -1, 0, +1 labels
where for a particular term, if an index is annotated to that term or
any of its descendants, it recieves a +1 label. If it is annotated to an
ancestor of that term, it receives a 0 (unsure) label. If it is not annotated
to an ancestor or a descendant of that term, it recieves a -1 label.
Any indices annotated to the control column are assigned a label of 2 for any
terms that other indices within the same group are positively labeled to.

control_col (str, default: 'MONDO:0000000' ) –

Column name for control annotations.

Returns:	`Labels \| Annotations` – A Labels curation object with propagated -1, 0, +1 labels (and 2 if controls are `Labels \| Annotations` – present). Any entries in `index_col` that have a 0 annotation/label across all `Labels \| Annotations` – entity columns are dropped.

Examples:

With `mode=0`:

>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
>>> anno.propagate(to_terms=["UBERON:0000948"], ontology="uberon", mode=0)
┌────────┬────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 │
│ ---    ┆ ---    ┆ ---            │
│ str    ┆ str    ┆ i32            │
╞════════╪════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              │
│ GSM2   ┆ GSE1   ┆ 1              │
└────────┴────────┴────────────────┘

With `mode=1`:

>>> anno.propagate(to_terms=["UBERON:0000948"], ontology="uberon", mode=1)
┌────────┬────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 │
│ ---    ┆ ---    ┆ ---            │
│ str    ┆ str    ┆ i32            │
╞════════╪════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              │
│ GSM2   ┆ GSE1   ┆ 1              │
│ GSM3   ┆ GSE2   ┆ -1             │
└────────┴────────┴────────────────┘

`select(*args, **kwargs)` ¶

Select annotation columns while maintaining ids.

`slice(offset, length=None)` ¶

Slice both data and ids simultaneously using polars slice.

Parameters:	`offset` (`int`) – Index position to begin the slice. `length` (`int \| None`, default: `None` ) – Number of indices past `offset` to slice out.

Returns:	`Annotations` – Sliced Annotations object as a subset of the original Annotations.

`from_df(df, index_col, sources_col, group_cols, **kwargs)` `classmethod` ¶

Creates an Annotations object from a combined DataFrame.

Attributes:

df (DataFrame) –

Polars DataFrame with index and group ID columns and columns for each attribute entity for each index (e.g. male or female, tissues, diseases, etc).
index_col (str) –

Name of the column of data that contains the index IDs.
group_cols (tuple[str, ...]) –

Names of columns of data that contain an ID for each index indicating if it belongs to a particular group (e.g. dataset, sex, platform, etc.).

Returns:	`Annotations` – An Annotations object constructed from `df`.

Examples:

>>> from metahq_core.curations.annotations import Annotations
>>> anno = pl.DataFrame(
        {
            "series": ["GSE1", "GSE1", "GSE2"],
            "sample": ["GSM1", "GSM2", "GSM3"],
            "UBERON:0000948": [1, 0, 0],
            "UBERON:0002349": [1, 1, 0],
            "UBERON:0002113": [0, 0, 0],
            "UBERON:0000955": [0, 0, 1],
        }
    )
>>> anno = Annotations.from_df(anno, index_col="sample", group_cols=["series"])
┌────────┬────────┬────────────────┬────────────────┬────────────────┬────────────────┐
│ sample ┆ series ┆ UBERON:0000948 ┆ UBERON:0002349 ┆ UBERON:0002113 ┆ UBERON:0000955 │
│ ---    ┆ ---    ┆ ---            ┆ ---            ┆ ---            ┆ ---            │
│ str    ┆ str    ┆ i64            ┆ i64            ┆ i64            ┆ i64            │
╞════════╪════════╪════════════════╪════════════════╪════════════════╪════════════════╡
│ GSM1   ┆ GSE1   ┆ 1              ┆ 1              ┆ 0              ┆ 0              │
│ GSM2   ┆ GSE1   ┆ 0              ┆ 1              ┆ 0              ┆ 0              │
│ GSM3   ┆ GSE2   ┆ 0              ┆ 0              ┆ 0              ┆ 1              │
└────────┴────────┴────────────────┴────────────────┴────────────────┴────────────────┘

Annotations¶

entities property ¶

groups property ¶

ids property ¶

index property ¶

n_indices property ¶

n_entities property ¶

unique_groups property ¶

add_ids(new) ¶

collapse(on, inplace=True) ¶

drop(*args, **kwargs) ¶

filter(condition) ¶

head(*args, **kwargs) ¶

save(outfile, fmt, attribute, level, citation_config, metadata=None) ¶

sort_columns() ¶

propagate(to_terms, ontology, mode, control_col='MONDO:0000000') ¶

select(*args, **kwargs) ¶

slice(offset, length=None) ¶

from_df(df, index_col, sources_col, group_cols, **kwargs) classmethod ¶

`entities` `property` ¶

`groups` `property` ¶

`ids` `property` ¶

`index` `property` ¶

`n_indices` `property` ¶

`n_entities` `property` ¶

`unique_groups` `property` ¶

`add_ids(new)` ¶

`collapse(on, inplace=True)` ¶

`drop(*args, **kwargs)` ¶

`filter(condition)` ¶

`head(*args, **kwargs)` ¶

`save(outfile, fmt, attribute, level, citation_config, metadata=None)` ¶

`sort_columns()` ¶

`propagate(to_terms, ontology, mode, control_col='MONDO:0000000')` ¶

`select(*args, **kwargs)` ¶

`slice(offset, length=None)` ¶

`from_df(df, index_col, sources_col, group_cols, **kwargs)` `classmethod` ¶