Skip to content

Aggregation

Recommendation

We recommend using the sopa.aggregate function below, which is a wrapper for all types of aggregation. Internally, it uses aggregate_channels, count_transcripts, and/or aggregate_bins, which are also documented below if needed.

sopa.aggregate(sdata, aggregate_genes=None, aggregate_channels=True, image_key=None, points_key=None, gene_column=None, shapes_key=None, bins_key=None, expand_radius_ratio=None, min_transcripts=0, min_intensity_ratio=0.1, key_added='table')

Aggregate gene counts and/or channel intensities over a SpatialData object to create an AnnData table (saved in sdata["table"]).

Info

The main arguments are sdata, aggregate_genes, and aggregate_channels. The rest of the arguments are optional and will be inferred from the data if not provided.

  • If channels are aggregated and not genes, then sdata['table'].X will contain the mean channel intensities per cell.
  • If genes are aggregated and not channels, then sdata['table'].X will contain the gene counts per cell.
  • If both genes and channels are aggregated, then sdata['table'].X will contain the gene counts per cell and sdata['table'].obsm['intensities'] will contain the mean channel intensities per cell.

Parameters:

Name Type Description Default
sdata SpatialData

A SpatialData object

required
aggregate_genes bool | None

Whether to aggregate gene counts. If None, it will be inferred from the data.

None
aggregate_channels bool

Whether to aggregate channel intensities inside cells.

True
image_key str | None

Key of sdata with the image channels to be averaged. By default, uses the segmentation image.

None
points_key str | None

Key of sdata with the points dataframe representing the transcripts.

None
gene_column str | None

Key of sdata[points_key] with the gene names.

None
shapes_key str | None

Key of sdata with the shapes corresponding to the cells boundaries.

None
bins_key str | None

Key of sdata with the table corresponding to the bin-by-gene table of gene counts (e.g., for Visium HD data).

None
expand_radius_ratio float | None

Ratio to expand the cells polygons for channels averaging. For instance, a ratio of 0.5 expands the shape radius by 50%. If None (default), use 1 if we aggregate bins data, and 0 otherwise.

None
min_transcripts int

Min number of transcripts to keep a cell.

0
min_intensity_ratio float

Min ratio of the 90th quantile of the mean channel intensity to keep a cell.

0.1
key_added str | None

Key to save the table in sdata.tables. If None, it will be f"{shapes_key}_table".

'table'
Source code in sopa/aggregation/aggregation.py
def aggregate(
    sdata: SpatialData,
    aggregate_genes: bool | None = None,
    aggregate_channels: bool = True,
    image_key: str | None = None,
    points_key: str | None = None,
    gene_column: str | None = None,
    shapes_key: str | None = None,
    bins_key: str | None = None,
    expand_radius_ratio: float | None = None,
    min_transcripts: int = 0,
    min_intensity_ratio: float = 0.1,
    key_added: str | None = "table",
):
    """Aggregate gene counts and/or channel intensities over a `SpatialData` object to create an `AnnData` table (saved in `sdata["table"]`).

    !!! info
        The main arguments are `sdata`, `aggregate_genes`, and `aggregate_channels`. The rest of the arguments are optional and will be inferred from the data if not provided.

        - If channels are aggregated and not genes, then `sdata['table'].X` will contain the mean channel intensities per cell.
        - If genes are aggregated and not channels, then `sdata['table'].X` will contain the gene counts per cell.
        - If both genes and channels are aggregated, then `sdata['table'].X` will contain the gene counts per cell and `sdata['table'].obsm['intensities']` will contain the mean channel intensities per cell.

    Args:
        sdata: A `SpatialData` object
        aggregate_genes: Whether to aggregate gene counts. If None, it will be inferred from the data.
        aggregate_channels: Whether to aggregate channel intensities inside cells.
        image_key: Key of `sdata` with the image channels to be averaged. By default, uses the segmentation image.
        points_key: Key of `sdata` with the points dataframe representing the transcripts.
        gene_column: Key of `sdata[points_key]` with the gene names.
        shapes_key: Key of `sdata` with the shapes corresponding to the cells boundaries.
        bins_key: Key of `sdata` with the table corresponding to the bin-by-gene table of gene counts (e.g., for Visium HD data).
        expand_radius_ratio: Ratio to expand the cells polygons for channels averaging. For instance, a ratio of 0.5 expands the shape radius by 50%. If `None` (default), use 1 if we aggregate bins data, and 0 otherwise.
        min_transcripts: Min number of transcripts to keep a cell.
        min_intensity_ratio: Min ratio of the 90th quantile of the mean channel intensity to keep a cell.
        key_added: Key to save the table in `sdata.tables`. If `None`, it will be `f"{shapes_key}_table"`.
    """
    assert points_key is None or bins_key is None, "Provide either `points_key` or `bins_key`, not both."

    if points_key is None:
        bins_key = bins_key or sdata.attrs.get(SopaAttrs.BINS_TABLE)

    if (bins_key is None) and (aggregate_genes or (aggregate_genes is None and sdata.points)):
        points_key, _ = get_spatial_element(
            sdata.points, key=points_key or sdata.attrs.get(SopaAttrs.TRANSCRIPTS), return_key=True
        )

    aggr = Aggregator(sdata, image_key=image_key, shapes_key=shapes_key, bins_key=bins_key, points_key=points_key)

    if key_added is None:
        key_added = f"{aggr.shapes_key}_{SopaKeys.TABLE}"
        log.info(f"key_added is None, saving the table as '{key_added}' by default.")

    aggr.compute_table(
        aggregate_genes=aggregate_genes,
        aggregate_channels=aggregate_channels,
        expand_radius_ratio=expand_radius_ratio,
        min_transcripts=min_transcripts,
        min_intensity_ratio=min_intensity_ratio,
        gene_column=gene_column,
        key_added=key_added,
    )

sopa.aggregation.aggregate_channels(sdata, image_key=None, shapes_key=None, expand_radius_ratio=0, mode='average')

Aggregate the channel intensities per cell (either "average", or take the "min" / "max").

Parameters:

Name Type Description Default
sdata SpatialData

A SpatialData object

required
image_key str | None

Key of sdata containing the image. If only one images element, this does not have to be provided.

None
shapes_key str | None

Key of sdata containing the cell boundaries. If only one shapes element, this does not have to be provided.

None
expand_radius_ratio float

Cells polygons will be expanded by expand_radius_ratio * mean_radius. This help better aggregate boundary stainings.

0
mode str

Aggregation mode. One of "average", "min", "max". By default, average the intensity inside the cell mask.

'average'

Returns:

Type Description
ndarray

A numpy ndarray of shape (n_cells, n_channels)

Source code in sopa/aggregation/channels.py
def aggregate_channels(
    sdata: SpatialData,
    image_key: str | None = None,
    shapes_key: str | None = None,
    expand_radius_ratio: float = 0,
    mode: str = "average",
) -> np.ndarray:
    """Aggregate the channel intensities per cell (either `"average"`, or take the `"min"` / `"max"`).

    Args:
        sdata: A `SpatialData` object
        image_key: Key of `sdata` containing the image. If only one `images` element, this does not have to be provided.
        shapes_key: Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.
        expand_radius_ratio: Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate boundary stainings.
        mode: Aggregation mode. One of `"average"`, `"min"`, `"max"`. By default, average the intensity inside the cell mask.

    Returns:
        A numpy `ndarray` of shape `(n_cells, n_channels)`
    """
    assert mode in AVAILABLE_MODES, f"Invalid {mode=}. Available modes are {AVAILABLE_MODES}"

    image = get_spatial_image(sdata, image_key)

    geo_df = get_boundaries(sdata, key=shapes_key)
    geo_df = to_intrinsic(sdata, geo_df, image)
    geo_df = expand_radius(geo_df, expand_radius_ratio)

    return _aggregate_channels_aligned(image, geo_df, mode)

sopa.aggregation.count_transcripts(sdata, gene_column=None, shapes_key=None, points_key=None, geo_df=None)

Counts transcripts per cell.

Parameters:

Name Type Description Default
sdata SpatialData

A SpatialData object

required
gene_column str | None

Column of the transcript dataframe containing the gene names

None
shapes_key str | None

Key of sdata containing the cell boundaries. If only one shapes element, this does not have to be provided.

None
points_key str | None

Key of sdata containing the transcripts. If only one points element, this does not have to be provided.

None
geo_df GeoDataFrame | None

If the cell boundaries are not yet in sdata, a GeoDataFrame can be directly provided for cell boundaries

None

Returns:

Type Description
AnnData

An AnnData object of shape (n_cells, n_genes) with the counts per cell

Source code in sopa/aggregation/transcripts.py
def count_transcripts(
    sdata: SpatialData,
    gene_column: str | None = None,
    shapes_key: str | None = None,
    points_key: str | None = None,
    geo_df: gpd.GeoDataFrame | None = None,
) -> AnnData:
    """Counts transcripts per cell.

    Args:
        sdata: A `SpatialData` object
        gene_column: Column of the transcript dataframe containing the gene names
        shapes_key: Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.
        points_key: Key of `sdata` containing the transcripts. If only one `points` element, this does not have to be provided.
        geo_df: If the cell boundaries are not yet in `sdata`, a `GeoDataFrame` can be directly provided for cell boundaries

    Returns:
        An `AnnData` object of shape `(n_cells, n_genes)` with the counts per cell
    """
    points_key, points = get_spatial_element(
        sdata.points, key=points_key or sdata.attrs.get(SopaAttrs.TRANSCRIPTS), return_key=True
    )

    if geo_df is None:
        geo_df = get_boundaries(sdata, key=shapes_key)
        geo_df = to_intrinsic(sdata, geo_df, points_key)

    gene_column = gene_column or get_feature_key(points, raise_error=True)

    log.info(f"Aggregating transcripts over {len(geo_df)} cells")
    return _count_transcripts_aligned(geo_df, points, gene_column)

sopa.aggregation.aggregate_bins(sdata, shapes_key, bins_key, expand_radius_ratio=0)

Aggregate bins (for instance, from Visium HD data) into cells.

Parameters:

Name Type Description Default
sdata SpatialData

The SpatialData object

required
shapes_key str

Key of the shapes containing the cell boundaries

required
bins_key str

Key of the table containing the bin-by-gene counts

required
expand_radius_ratio float

Cells polygons will be expanded by expand_radius_ratio * mean_radius. This help better aggregate bins from the cytoplasm.

0

Returns:

Type Description
AnnData

An AnnData object of shape with the cell-by-gene count matrix

Source code in sopa/aggregation/bins.py
def aggregate_bins(
    sdata: SpatialData,
    shapes_key: str,
    bins_key: str,
    expand_radius_ratio: float = 0,
) -> AnnData:
    """Aggregate bins (for instance, from Visium HD data) into cells.

    Args:
        sdata: The `SpatialData` object
        shapes_key: Key of the shapes containing the cell boundaries
        bins_key: Key of the table containing the bin-by-gene counts
        expand_radius_ratio: Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate bins from the cytoplasm.

    Returns:
        An `AnnData` object of shape with the cell-by-gene count matrix
    """
    bins_table: AnnData = sdata.tables[bins_key]

    bins_shapes_key = sdata.get_annotated_regions(bins_table)
    bins_shapes_key = bins_shapes_key[0] if isinstance(bins_shapes_key, list) else bins_shapes_key
    bins = sdata.shapes[bins_shapes_key].loc[sdata.get_instance_key_column(bins_table).values]
    bins = gpd.GeoDataFrame(geometry=bins.centroid.values)  # bins as points

    cells = to_intrinsic(sdata, shapes_key, bins_shapes_key).reset_index(drop=True)
    cells = expand_radius(cells, expand_radius_ratio)

    bin_within_cell = gpd.sjoin(bins, cells)

    indices_matrix = coo_matrix(
        (np.full(len(bin_within_cell), 1), (bin_within_cell["index_right"], bin_within_cell.index)),
        shape=(len(cells), len(bins)),
    )

    adata = AnnData(indices_matrix @ bins_table.X, obs=cells[[]], var=bins_table.var)
    adata.obsm["spatial"] = np.stack([cells.centroid.x, cells.centroid.y], axis=1)
    return adata

sopa.overlay_segmentation(sdata, shapes_key, gene_column=None, area_ratio_threshold=0.25, image_key=None, table_key=SopaKeys.TABLE)

Overlay a segmentation on top of an existing segmentation

Parameters:

Name Type Description Default
sdata SpatialData

A SpatialData object

required
shapes_key str

The key of the new shapes to be added

required
gene_column str | None

Key of the points dataframe containing the genes names

None
area_ratio_threshold float

Threshold between 0 and 1. For each original cell overlapping with a new cell, we compute the overlap-area/cell-area, if above the threshold the cell is removed.

0.25
image_key str | None

Optional key of the original image

None
table_key str

Key of the table to be overlayed

TABLE
Source code in sopa/aggregation/overlay.py
def overlay_segmentation(
    sdata: SpatialData,
    shapes_key: str,
    gene_column: str | None = None,
    area_ratio_threshold: float = 0.25,
    image_key: str | None = None,
    table_key: str = SopaKeys.TABLE,
):
    """Overlay a segmentation on top of an existing segmentation

    Args:
        sdata: A `SpatialData` object
        shapes_key: The key of the new shapes to be added
        gene_column: Key of the points dataframe containing the genes names
        area_ratio_threshold: Threshold between 0 and 1. For each original cell overlapping with a new cell, we compute the overlap-area/cell-area, if above the threshold the cell is removed.
        image_key: Optional key of the original image
        table_key: Key of the table to be overlayed
    """
    aggregate_genes, aggregate_channels = False, False

    assert table_key in sdata.tables, f"No table with name '{table_key}' found in the SpatialData object"

    old_table: AnnData = sdata.tables[table_key]

    assert SopaKeys.UNS_KEY in old_table.uns, "It seems the table was not aggregated using `sopa.aggregate`"

    sopa_attrs = old_table.uns[SopaKeys.UNS_KEY]

    aggregate_genes = sopa_attrs[SopaKeys.UNS_HAS_TRANSCRIPTS]
    aggregate_channels = sopa_attrs[SopaKeys.UNS_HAS_INTENSITIES]

    if aggregate_genes and gene_column is None:
        points = get_spatial_element(sdata.points, key=sdata.attrs.get(SopaAttrs.TRANSCRIPTS))
        gene_column = get_feature_key(points, raise_error=True)

    aggregator = Aggregator(sdata, image_key=image_key, shapes_key=shapes_key)
    aggregator.sdata.tables[f"{SopaKeys.OLD_TABLE_PREFFIX}{table_key}"] = old_table
    del aggregator.sdata.tables[table_key]

    old_shapes_key = old_table.uns["spatialdata_attrs"]["region"]
    instance_key = old_table.uns["spatialdata_attrs"]["instance_key"]

    if isinstance(old_shapes_key, list):
        assert len(old_shapes_key) == 1, "Can't overlap segmentation on multi-region SpatialData object"
        old_shapes_key = old_shapes_key[0]

    old_geo_df = aggregator.sdata[old_shapes_key]
    geo_df = to_intrinsic(aggregator.sdata, aggregator.geo_df, old_geo_df)

    geo_df.index.name = None
    gdf_join = gpd.sjoin(old_geo_df, geo_df)
    gdf_join["geometry_right"] = gdf_join["index_right"].map(lambda i: geo_df.geometry.iloc[i])
    gdf_join["overlap_ratio"] = gdf_join.apply(_overlap_area_ratio, axis=1)
    gdf_join: gpd.GeoDataFrame = gdf_join[gdf_join.overlap_ratio >= area_ratio_threshold]

    table_crop = old_table[~np.isin(old_table.obs[instance_key], gdf_join.index)].copy()
    table_crop.obs[SopaKeys.CELL_OVERLAY_KEY] = False

    aggregator.compute_table(
        aggregate_channels=aggregate_channels,
        aggregate_genes=aggregate_genes,
        gene_column=gene_column,
        key_added=table_key,
    )
    aggregator.table.obs[SopaKeys.CELL_OVERLAY_KEY] = True

    aggregator.table = anndata.concat(
        [table_crop, aggregator.table],
        uns_merge="first",
        join="outer",
    )

    aggregator.shapes_key = f"{old_shapes_key}_overlay_{aggregator.shapes_key}"

    geo_df_cropped = old_geo_df.loc[~old_geo_df.index.isin(gdf_join.index)]
    aggregator.geo_df = pd.concat([geo_df_cropped, geo_df], join="outer", axis=0)
    aggregator.geo_df.attrs = old_geo_df.attrs

    aggregator.add_standardized_table(table_key)