Skip to content

Aggregation

Recommendation

We recommend using the sopa.aggregate function below, which is a wrapper for all types of aggregation. Internally, it uses aggregate_channels, count_transcripts, and/or aggregate_bins, which are also documented below if needed.

sopa.aggregate(sdata, aggregate_genes=None, aggregate_channels=True, image_key=None, points_key=None, gene_column=None, shapes_key=None, bins_key=None, expand_radius_ratio=None, min_transcripts=0, min_intensity_ratio=0.1, key_added='table')

Aggregate gene counts and/or channel intensities over a SpatialData object to create an AnnData table (saved in sdata["table"]).

Info

The main arguments are sdata, aggregate_genes, and aggregate_channels. The rest of the arguments are optional and will be inferred from the data if not provided.

  • If channels are aggregated and not genes, then sdata['table'].X will contain the mean channel intensities per cell.
  • If genes are aggregated and not channels, then sdata['table'].X will contain the gene counts per cell.
  • If both genes and channels are aggregated, then sdata['table'].X will contain the gene counts per cell and sdata['table'].obsm['intensities'] will contain the mean channel intensities per cell.

Parameters:

Name Type Description Default
sdata SpatialData

A SpatialData object

required
aggregate_genes bool | None

Whether to aggregate gene counts. If None, it will be inferred from the data.

None
aggregate_channels bool

Whether to aggregate channel intensities inside cells.

True
image_key str | None

Key of sdata with the image channels to be averaged. By default, uses the segmentation image.

None
points_key str | None

Key of sdata with the points dataframe representing the transcripts.

None
gene_column str | None

Key of sdata[points_key] with the gene names.

None
shapes_key str | None

Key of sdata with the shapes corresponding to the cells boundaries.

None
bins_key str | None

Key of sdata with the table corresponding to the bin-by-gene table of gene counts (e.g., for Visium HD data).

None
expand_radius_ratio float | None

Ratio to expand the cells polygons for channels averaging. For instance, a ratio of 0.5 expands the shape radius by 50%. If None (default), use 1 if we aggregate bins data, and 0 otherwise.

None
min_transcripts int

Min number of transcripts to keep a cell.

0
min_intensity_ratio float

Min ratio of the 90th quantile of the mean channel intensity to keep a cell.

0.1
key_added str | None

Key to save the table in sdata.tables. If None, it will be f"{shapes_key}_table".

'table'
Source code in sopa/aggregation/aggregation.py
def aggregate(
    sdata: SpatialData,
    aggregate_genes: bool | None = None,
    aggregate_channels: bool = True,
    image_key: str | None = None,
    points_key: str | None = None,
    gene_column: str | None = None,
    shapes_key: str | None = None,
    bins_key: str | None = None,
    expand_radius_ratio: float | None = None,
    min_transcripts: int = 0,
    min_intensity_ratio: float = 0.1,
    key_added: str | None = "table",
):
    """Aggregate gene counts and/or channel intensities over a `SpatialData` object to create an `AnnData` table (saved in `sdata["table"]`).

    !!! info
        The main arguments are `sdata`, `aggregate_genes`, and `aggregate_channels`. The rest of the arguments are optional and will be inferred from the data if not provided.

        - If channels are aggregated and not genes, then `sdata['table'].X` will contain the mean channel intensities per cell.
        - If genes are aggregated and not channels, then `sdata['table'].X` will contain the gene counts per cell.
        - If both genes and channels are aggregated, then `sdata['table'].X` will contain the gene counts per cell and `sdata['table'].obsm['intensities']` will contain the mean channel intensities per cell.

    Args:
        sdata: A `SpatialData` object
        aggregate_genes: Whether to aggregate gene counts. If None, it will be inferred from the data.
        aggregate_channels: Whether to aggregate channel intensities inside cells.
        image_key: Key of `sdata` with the image channels to be averaged. By default, uses the segmentation image.
        points_key: Key of `sdata` with the points dataframe representing the transcripts.
        gene_column: Key of `sdata[points_key]` with the gene names.
        shapes_key: Key of `sdata` with the shapes corresponding to the cells boundaries.
        bins_key: Key of `sdata` with the table corresponding to the bin-by-gene table of gene counts (e.g., for Visium HD data).
        expand_radius_ratio: Ratio to expand the cells polygons for channels averaging. For instance, a ratio of 0.5 expands the shape radius by 50%. If `None` (default), use 1 if we aggregate bins data, and 0 otherwise.
        min_transcripts: Min number of transcripts to keep a cell.
        min_intensity_ratio: Min ratio of the 90th quantile of the mean channel intensity to keep a cell.
        key_added: Key to save the table in `sdata.tables`. If `None`, it will be `f"{shapes_key}_table"`.
    """
    if points_key is None:
        bins_key = bins_key or sdata.attrs.get(SopaAttrs.BINS_TABLE)

    points_key = None
    if (bins_key is None) and (aggregate_genes or (aggregate_genes is None and sdata.points)):
        points_key, _ = get_spatial_element(
            sdata.points, key=points_key or sdata.attrs.get(SopaAttrs.TRANSCRIPTS), return_key=True
        )

    aggr = Aggregator(sdata, image_key=image_key, shapes_key=shapes_key, bins_key=bins_key, points_key=points_key)

    if key_added is None:
        key_added = f"{aggr.shapes_key}_{SopaKeys.TABLE}"
        log.info(f"key_added is None, saving the table as '{key_added}' by default.")

    aggr.compute_table(
        aggregate_genes=aggregate_genes,
        aggregate_channels=aggregate_channels,
        expand_radius_ratio=expand_radius_ratio,
        min_transcripts=min_transcripts,
        min_intensity_ratio=min_intensity_ratio,
        gene_column=gene_column,
        key_added=key_added,
    )

sopa.aggregation.aggregate_channels(sdata, image_key=None, shapes_key=None, expand_radius_ratio=0, mode='average')

Aggregate the channel intensities per cell (either "average", or take the "min" / "max").

Parameters:

Name Type Description Default
sdata SpatialData

A SpatialData object

required
image_key str | None

Key of sdata containing the image. If only one images element, this does not have to be provided.

None
shapes_key str | None

Key of sdata containing the cell boundaries. If only one shapes element, this does not have to be provided.

None
expand_radius_ratio float

Cells polygons will be expanded by expand_radius_ratio * mean_radius. This help better aggregate boundary stainings.

0
mode str

Aggregation mode. One of "average", "min", "max". By default, average the intensity inside the cell mask.

'average'

Returns:

Type Description
ndarray

A numpy ndarray of shape (n_cells, n_channels)

Source code in sopa/aggregation/channels.py
def aggregate_channels(
    sdata: SpatialData,
    image_key: str | None = None,
    shapes_key: str | None = None,
    expand_radius_ratio: float = 0,
    mode: str = "average",
) -> np.ndarray:
    """Aggregate the channel intensities per cell (either `"average"`, or take the `"min"` / `"max"`).

    Args:
        sdata: A `SpatialData` object
        image_key: Key of `sdata` containing the image. If only one `images` element, this does not have to be provided.
        shapes_key: Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.
        expand_radius_ratio: Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate boundary stainings.
        mode: Aggregation mode. One of `"average"`, `"min"`, `"max"`. By default, average the intensity inside the cell mask.

    Returns:
        A numpy `ndarray` of shape `(n_cells, n_channels)`
    """
    assert mode in AVAILABLE_MODES, f"Invalid {mode=}. Available modes are {AVAILABLE_MODES}"

    image = get_spatial_image(sdata, image_key)

    geo_df = get_boundaries(sdata, key=shapes_key)
    geo_df = to_intrinsic(sdata, geo_df, image)
    geo_df = expand_radius(geo_df, expand_radius_ratio)

    return _aggregate_channels_aligned(image, geo_df, mode)

sopa.aggregation.count_transcripts(sdata, gene_column=None, shapes_key=None, points_key=None, geo_df=None)

Counts transcripts per cell.

Parameters:

Name Type Description Default
sdata SpatialData

A SpatialData object

required
gene_column str | None

Column of the transcript dataframe containing the gene names

None
shapes_key str | None

Key of sdata containing the cell boundaries. If only one shapes element, this does not have to be provided.

None
points_key str | None

Key of sdata containing the transcripts. If only one points element, this does not have to be provided.

None
geo_df GeoDataFrame | None

If the cell boundaries are not yet in sdata, a GeoDataFrame can be directly provided for cell boundaries

None

Returns:

Type Description
AnnData

An AnnData object of shape (n_cells, n_genes) with the counts per cell

Source code in sopa/aggregation/transcripts.py
def count_transcripts(
    sdata: SpatialData,
    gene_column: str | None = None,
    shapes_key: str | None = None,
    points_key: str | None = None,
    geo_df: gpd.GeoDataFrame | None = None,
) -> AnnData:
    """Counts transcripts per cell.

    Args:
        sdata: A `SpatialData` object
        gene_column: Column of the transcript dataframe containing the gene names
        shapes_key: Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.
        points_key: Key of `sdata` containing the transcripts. If only one `points` element, this does not have to be provided.
        geo_df: If the cell boundaries are not yet in `sdata`, a `GeoDataFrame` can be directly provided for cell boundaries

    Returns:
        An `AnnData` object of shape `(n_cells, n_genes)` with the counts per cell
    """
    points_key, points = get_spatial_element(
        sdata.points, key=points_key or sdata.attrs.get(SopaAttrs.TRANSCRIPTS), return_key=True
    )

    if geo_df is None:
        geo_df = get_boundaries(sdata, key=shapes_key)
        geo_df = to_intrinsic(sdata, geo_df, points_key)

    gene_column = gene_column or get_feature_key(points, raise_error=True)

    log.info(f"Aggregating transcripts over {len(geo_df)} cells")
    return _count_transcripts_aligned(geo_df, points, gene_column)

sopa.aggregation.aggregate_bins(sdata, shapes_key, bins_key, expand_radius_ratio=0)

Aggregate bins (for instance, from Visium HD data) into cells.

Parameters:

Name Type Description Default
sdata SpatialData

The SpatialData object

required
shapes_key str

Key of the shapes containing the cell boundaries

required
bins_key str

Key of the table containing the bin-by-gene counts

required
expand_radius_ratio float

Cells polygons will be expanded by expand_radius_ratio * mean_radius. This help better aggregate bins from the cytoplasm.

0

Returns:

Type Description
AnnData

An AnnData object of shape with the cell-by-gene count matrix

Source code in sopa/aggregation/bins.py
def aggregate_bins(
    sdata: SpatialData,
    shapes_key: str,
    bins_key: str,
    expand_radius_ratio: float = 0,
) -> AnnData:
    """Aggregate bins (for instance, from Visium HD data) into cells.

    Args:
        sdata: The `SpatialData` object
        shapes_key: Key of the shapes containing the cell boundaries
        bins_key: Key of the table containing the bin-by-gene counts
        expand_radius_ratio: Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate bins from the cytoplasm.

    Returns:
        An `AnnData` object of shape with the cell-by-gene count matrix
    """
    bins_table: AnnData = sdata.tables[bins_key]

    bins_shapes_key = sdata.get_annotated_regions(bins_table)
    bins_shapes_key = bins_shapes_key[0] if isinstance(bins_shapes_key, list) else bins_shapes_key
    bins = sdata.shapes[bins_shapes_key].loc[sdata.get_instance_key_column(bins_table).values]
    bins = gpd.GeoDataFrame(geometry=bins.centroid.values)  # bins as points

    cells = to_intrinsic(sdata, shapes_key, bins_shapes_key).reset_index(drop=True)
    cells = expand_radius(cells, expand_radius_ratio)

    bin_within_cell = gpd.sjoin(bins, cells)

    indices_matrix = coo_matrix(
        (np.full(len(bin_within_cell), 1), (bin_within_cell["index_right"], bin_within_cell.index)),
        shape=(len(cells), len(bins)),
    )

    adata = AnnData(indices_matrix @ bins_table.X, obs=cells[[]], var=bins_table.var)
    adata.obsm["spatial"] = np.stack([cells.centroid.x, cells.centroid.y], axis=1)
    return adata