Aggregation

Recommendation

We recommend using the sopa.aggregate function below, which is a wrapper for all types of aggregation. Internally, it uses aggregate_channels, count_transcripts, and/or aggregate_bins, which are also documented below if needed.

`sopa.aggregate(sdata, aggregate_genes=None, aggregate_channels=True, image_key=None, points_key=None, gene_column=None, shapes_key=None, bins_key=None, expand_radius_ratio=None, min_transcripts=0, min_intensity_ratio=0.1, no_overlap=False, key_added='table')`

Aggregate gene counts and/or channel intensities over a SpatialData object to create an AnnData table (saved in sdata["table"]).

Info

The main arguments are sdata, aggregate_genes, and aggregate_channels. The rest of the arguments are optional and will be inferred from the data if not provided.

If channels are aggregated and not genes, then sdata['table'].X will contain the mean channel intensities per cell.
If genes are aggregated and not channels, then sdata['table'].X will contain the gene counts per cell.
If both genes and channels are aggregated, then sdata['table'].X will contain the gene counts per cell and sdata['table'].obsm['intensities'] will contain the mean channel intensities per cell.

Parameters:

Name	Type	Description	Default
`sdata`	`SpatialData`	A `SpatialData` object	required
`aggregate_genes`	`bool \| None`	Whether to aggregate gene counts. If None, it will be inferred from the data.	`None`
`aggregate_channels`	`bool`	Whether to aggregate channel intensities inside cells.	`True`
`image_key`	`str \| None`	Key of `sdata` with the image channels to be averaged. By default, uses the segmentation image.	`None`
`points_key`	`str \| None`	Key of `sdata` with the points dataframe representing the transcripts. Inferred by default.	`None`
`gene_column`	`str \| None`	Key of `sdata[points_key]` with the gene names. Inferred by default.	`None`
`shapes_key`	`str \| None`	Key of `sdata` with the shapes corresponding to the cells boundaries. Inferred by default.	`None`
`bins_key`	`str \| None`	Key of `sdata` with the table corresponding to the bin-by-gene table of gene counts (e.g., for Visium HD data). Inferred by default.	`None`
`expand_radius_ratio`	`float \| None`	Ratio to expand the cells polygons for channels averaging. For instance, a ratio of 0.5 expands the shape radius by 50%. If `None` (default), use 1 if we aggregate bins data, and 0 otherwise.	`None`
`min_transcripts`	`int`	Min number of transcripts to keep a cell.	`0`
`min_intensity_ratio`	`float`	Min ratio of the 90th quantile of the mean channel intensity to keep a cell.	`0.1`
`no_overlap`	`bool`	Experimental feature: If `True`, the (expanded) cells will not overlap for channels and bins aggregation.	`False`
`key_added`	`str \| None`	Key to save the table in `sdata.tables`. If `None`, it will be `f"{shapes_key}_table"`.	`'table'`

Source code in sopa/aggregation/aggregation.py

def aggregate(
    sdata: SpatialData,
    aggregate_genes: bool | None = None,
    aggregate_channels: bool = True,
    image_key: str | None = None,
    points_key: str | None = None,
    gene_column: str | None = None,
    shapes_key: str | None = None,
    bins_key: str | None = None,
    expand_radius_ratio: float | None = None,
    min_transcripts: int = 0,
    min_intensity_ratio: float = 0.1,
    no_overlap: bool = False,
    key_added: str | None = "table",
):
    """Aggregate gene counts and/or channel intensities over a `SpatialData` object to create an `AnnData` table (saved in `sdata["table"]`).

    !!! info
        The main arguments are `sdata`, `aggregate_genes`, and `aggregate_channels`. The rest of the arguments are optional and will be inferred from the data if not provided.

        - If channels are aggregated and not genes, then `sdata['table'].X` will contain the mean channel intensities per cell.
        - If genes are aggregated and not channels, then `sdata['table'].X` will contain the gene counts per cell.
        - If both genes and channels are aggregated, then `sdata['table'].X` will contain the gene counts per cell and `sdata['table'].obsm['intensities']` will contain the mean channel intensities per cell.

    Args:
        sdata: A `SpatialData` object
        aggregate_genes: Whether to aggregate gene counts. If None, it will be inferred from the data.
        aggregate_channels: Whether to aggregate channel intensities inside cells.
        image_key: Key of `sdata` with the image channels to be averaged. By default, uses the segmentation image.
        points_key: Key of `sdata` with the points dataframe representing the transcripts. Inferred by default.
        gene_column: Key of `sdata[points_key]` with the gene names. Inferred by default.
        shapes_key: Key of `sdata` with the shapes corresponding to the cells boundaries. Inferred by default.
        bins_key: Key of `sdata` with the table corresponding to the bin-by-gene table of gene counts (e.g., for Visium HD data). Inferred by default.
        expand_radius_ratio: Ratio to expand the cells polygons for channels averaging. For instance, a ratio of 0.5 expands the shape radius by 50%. If `None` (default), use 1 if we aggregate bins data, and 0 otherwise.
        min_transcripts: Min number of transcripts to keep a cell.
        min_intensity_ratio: Min ratio of the 90th quantile of the mean channel intensity to keep a cell.
        no_overlap: *Experimental feature*: If `True`, the (expanded) cells will not overlap for channels and bins aggregation.
        key_added: Key to save the table in `sdata.tables`. If `None`, it will be `f"{shapes_key}_table"`.
    """
    assert points_key is None or bins_key is None, "Provide either `points_key` or `bins_key`, not both."

    if points_key is None:
        bins_key = bins_key or sdata.attrs.get(SopaAttrs.BINS_TABLE)

    if (bins_key is None) and (aggregate_genes or (aggregate_genes is None and sdata.points)):
        assert sdata.points, (
            "No points in the SpatialData object. You must have points, or set the `bins_key` argument (for VisiumHD-like data)."
        )

        points_key, _ = get_spatial_element(
            sdata.points, key=points_key or sdata.attrs.get(SopaAttrs.TRANSCRIPTS), return_key=True
        )

    aggr = Aggregator(sdata, image_key=image_key, shapes_key=shapes_key, bins_key=bins_key, points_key=points_key)

    if key_added is None:
        key_added = f"{aggr.shapes_key}_{SopaKeys.TABLE}"
        log.info(f"key_added is None, saving the table as '{key_added}' by default.")

    aggr.compute_table(
        aggregate_genes=aggregate_genes,
        aggregate_channels=aggregate_channels,
        expand_radius_ratio=expand_radius_ratio,
        min_transcripts=min_transcripts,
        min_intensity_ratio=min_intensity_ratio,
        gene_column=gene_column,
        no_overlap=no_overlap,
        key_added=key_added,
    )

`sopa.aggregation.aggregate_channels(sdata, image_key=None, shapes_key=None, expand_radius_ratio=0, mode='average', no_overlap=False)`

Aggregate the channel intensities per cell (either "average", or take the "min" / "max").

Parameters:

Name	Type	Description	Default
`sdata`	`SpatialData`	A `SpatialData` object	required
`image_key`	`str \| None`	Key of `sdata` containing the image. If only one `images` element, this does not have to be provided.	`None`
`shapes_key`	`str \| None`	Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.	`None`
`expand_radius_ratio`	`float`	Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate boundary stainings.	`0`
`mode`	`str`	Aggregation mode. One of `"average"`, `"min"`, `"max"`. By default, average the intensity inside the cell mask.	`'average'`
`no_overlap`	`bool`	If `True`, the (expanded) cells will not overlap.	`False`

Returns:

Type	Description
`ndarray`	A numpy `ndarray` of shape `(n_cells, n_channels)`

Source code in sopa/aggregation/channels.py

def aggregate_channels(
    sdata: SpatialData,
    image_key: str | None = None,
    shapes_key: str | None = None,
    expand_radius_ratio: float = 0,
    mode: str = "average",
    no_overlap: bool = False,
) -> np.ndarray:
    """Aggregate the channel intensities per cell (either `"average"`, or take the `"min"` / `"max"`).

    Args:
        sdata: A `SpatialData` object
        image_key: Key of `sdata` containing the image. If only one `images` element, this does not have to be provided.
        shapes_key: Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.
        expand_radius_ratio: Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate boundary stainings.
        mode: Aggregation mode. One of `"average"`, `"min"`, `"max"`. By default, average the intensity inside the cell mask.
        no_overlap: If `True`, the (expanded) cells will not overlap.

    Returns:
        A numpy `ndarray` of shape `(n_cells, n_channels)`
    """
    assert mode in AVAILABLE_MODES, f"Invalid {mode=}. Available modes are {AVAILABLE_MODES}"

    image = get_spatial_image(sdata, image_key)

    geo_df = get_boundaries(sdata, key=shapes_key)
    geo_df = to_intrinsic(sdata, geo_df, image)
    geo_df = expand_radius(geo_df, expand_radius_ratio, no_overlap=no_overlap)

    return _aggregate_channels_aligned(image, geo_df, mode)

`sopa.aggregation.count_transcripts(sdata, gene_column=None, shapes_key=None, points_key=None, geo_df=None)`

Counts transcripts per cell.

Parameters:

Name	Type	Description	Default
`sdata`	`SpatialData`	A `SpatialData` object	required
`gene_column`	`str \| None`	Column of the transcript dataframe containing the gene names	`None`
`shapes_key`	`str \| None`	Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.	`None`
`points_key`	`str \| None`	Key of `sdata` containing the transcripts. If only one `points` element, this does not have to be provided.	`None`
`geo_df`	`GeoDataFrame \| None`	If the cell boundaries are not yet in `sdata`, a `GeoDataFrame` can be directly provided for cell boundaries	`None`

Returns:

Type	Description
`AnnData`	An `AnnData` object of shape `(n_cells, n_genes)` with the counts per cell

Source code in sopa/aggregation/transcripts.py

def count_transcripts(
    sdata: SpatialData,
    gene_column: str | None = None,
    shapes_key: str | None = None,
    points_key: str | None = None,
    geo_df: gpd.GeoDataFrame | None = None,
) -> AnnData:
    """Counts transcripts per cell.

    Args:
        sdata: A `SpatialData` object
        gene_column: Column of the transcript dataframe containing the gene names
        shapes_key: Key of `sdata` containing the cell boundaries. If only one `shapes` element, this does not have to be provided.
        points_key: Key of `sdata` containing the transcripts. If only one `points` element, this does not have to be provided.
        geo_df: If the cell boundaries are not yet in `sdata`, a `GeoDataFrame` can be directly provided for cell boundaries

    Returns:
        An `AnnData` object of shape `(n_cells, n_genes)` with the counts per cell
    """
    points_key, points = get_spatial_element(
        sdata.points, key=points_key or sdata.attrs.get(SopaAttrs.TRANSCRIPTS), return_key=True
    )

    if geo_df is None:
        geo_df = get_boundaries(sdata, key=shapes_key)
        geo_df = to_intrinsic(sdata, geo_df, points_key)

    gene_column = gene_column or get_feature_key(points, raise_error=True)

    log.info(f"Aggregating transcripts over {len(geo_df)} cells")
    return _count_transcripts_aligned(geo_df, points, gene_column)

`sopa.aggregation.aggregate_bins(sdata, shapes_key, bins_key, expand_radius_ratio=0, no_overlap=False)`

Aggregate bins (for instance, from Visium HD data) into cells.

Parameters:

Name	Type	Description	Default
`sdata`	`SpatialData`	The `SpatialData` object	required
`shapes_key`	`str`	Key of the shapes containing the cell boundaries	required
`bins_key`	`str`	Key of the table containing the bin-by-gene counts	required
`expand_radius_ratio`	`float`	Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate bins from the cytoplasm.	`0`
`no_overlap`	`bool`	If `True`, bins belonging to multiples cells with be assigned to only one, based on transcript-profile proximity.	`False`

Returns:

Type	Description
`AnnData`	An `AnnData` object of shape with the cell-by-gene count matrix

Source code in sopa/aggregation/bins.py

def aggregate_bins(
    sdata: SpatialData,
    shapes_key: str,
    bins_key: str,
    expand_radius_ratio: float = 0,
    no_overlap: bool = False,
) -> AnnData:
    """Aggregate bins (for instance, from Visium HD data) into cells.

    Args:
        sdata: The `SpatialData` object
        shapes_key: Key of the shapes containing the cell boundaries
        bins_key: Key of the table containing the bin-by-gene counts
        expand_radius_ratio: Cells polygons will be expanded by `expand_radius_ratio * mean_radius`. This help better aggregate bins from the cytoplasm.
        no_overlap: If `True`, bins belonging to multiples cells with be assigned to only one, based on transcript-profile proximity.

    Returns:
        An `AnnData` object of shape with the cell-by-gene count matrix
    """
    bins_table: AnnData = sdata.tables[bins_key]

    bins_shapes_key = sdata.get_annotated_regions(bins_table)
    bins_shapes_key = bins_shapes_key[0] if isinstance(bins_shapes_key, list) else bins_shapes_key
    bins = sdata.shapes[bins_shapes_key].loc[sdata.get_instance_key_column(bins_table).values]
    bins = gpd.GeoDataFrame(geometry=bins.centroid.values)  # bins as points

    cells = to_intrinsic(sdata, shapes_key, bins_shapes_key).reset_index(drop=True)
    cells = expand_radius(cells, expand_radius_ratio, no_overlap=False)

    bin_within_cell = gpd.sjoin(bins, cells)

    indices_matrix = coo_matrix(
        (np.full(len(bin_within_cell), 1), (bin_within_cell["index_right"], bin_within_cell.index)),
        shape=(len(cells), len(bins)),
    )

    if no_overlap:
        log.warning("Unique bin mapping is currently experimental. Any feedback on GitHub is welcome.")
        indices_matrix = _get_unique_bins_mapping(indices_matrix, bins_table)

    adata = AnnData(indices_matrix @ bins_table.X, obs=cells[[]], var=bins_table.var)
    adata.obsm["spatial"] = np.stack([cells.centroid.x, cells.centroid.y], axis=1)
    return adata

`sopa.overlay_segmentation(sdata, shapes_key, gene_column=None, area_ratio_threshold=0.25, image_key=None, table_key=SopaKeys.TABLE)`

Overlay a segmentation on top of an existing segmentation

Parameters:

Name	Type	Description	Default
`sdata`	`SpatialData`	A `SpatialData` object	required
`shapes_key`	`str`	The key of the new shapes to be added	required
`gene_column`	`str \| None`	Key of the points dataframe containing the genes names	`None`
`area_ratio_threshold`	`float`	Threshold between 0 and 1. For each original cell overlapping with a new cell, we compute the overlap-area/cell-area, if above the threshold the cell is removed.	`0.25`
`image_key`	`str \| None`	Optional key of the original image	`None`
`table_key`	`str`	Key of the table to be overlayed	`TABLE`

Source code in sopa/aggregation/overlay.py

def overlay_segmentation(
    sdata: SpatialData,
    shapes_key: str,
    gene_column: str | None = None,
    area_ratio_threshold: float = 0.25,
    image_key: str | None = None,
    table_key: str = SopaKeys.TABLE,
):
    """Overlay a segmentation on top of an existing segmentation

    Args:
        sdata: A `SpatialData` object
        shapes_key: The key of the new shapes to be added
        gene_column: Key of the points dataframe containing the genes names
        area_ratio_threshold: Threshold between 0 and 1. For each original cell overlapping with a new cell, we compute the overlap-area/cell-area, if above the threshold the cell is removed.
        image_key: Optional key of the original image
        table_key: Key of the table to be overlayed
    """
    aggregate_genes, aggregate_channels = False, False

    assert table_key in sdata.tables, f"No table with name '{table_key}' found in the SpatialData object"

    old_table: AnnData = sdata.tables[table_key]

    assert SopaKeys.UNS_KEY in old_table.uns, "It seems the table was not aggregated using `sopa.aggregate`"

    sopa_attrs = old_table.uns[SopaKeys.UNS_KEY]

    aggregate_genes = sopa_attrs[SopaKeys.UNS_HAS_TRANSCRIPTS]
    aggregate_channels = sopa_attrs[SopaKeys.UNS_HAS_INTENSITIES]

    if aggregate_genes and gene_column is None:
        points = get_spatial_element(sdata.points, key=sdata.attrs.get(SopaAttrs.TRANSCRIPTS))
        gene_column = get_feature_key(points, raise_error=True)

    aggregator = Aggregator(sdata, image_key=image_key, shapes_key=shapes_key)
    aggregator.sdata.tables[f"{SopaKeys.OLD_TABLE_PREFFIX}{table_key}"] = old_table
    del aggregator.sdata.tables[table_key]

    old_shapes_key = old_table.uns["spatialdata_attrs"]["region"]
    instance_key = old_table.uns["spatialdata_attrs"]["instance_key"]

    if isinstance(old_shapes_key, list):
        assert len(old_shapes_key) == 1, "Can't overlap segmentation on multi-region SpatialData object"
        old_shapes_key = old_shapes_key[0]

    old_geo_df = aggregator.sdata[old_shapes_key]
    geo_df = to_intrinsic(aggregator.sdata, aggregator.geo_df, old_geo_df)

    geo_df.index.name = None
    gdf_join = gpd.sjoin(old_geo_df, geo_df)
    gdf_join["geometry_right"] = gdf_join["index_right"].map(lambda i: geo_df.geometry.iloc[i])
    gdf_join["overlap_ratio"] = gdf_join.apply(_overlap_area_ratio, axis=1)
    gdf_join: gpd.GeoDataFrame = gdf_join[gdf_join.overlap_ratio >= area_ratio_threshold]

    table_crop = old_table[~np.isin(old_table.obs[instance_key], gdf_join.index)].copy()
    table_crop.obs[SopaKeys.CELL_OVERLAY_KEY] = False

    aggregator.compute_table(
        aggregate_channels=aggregate_channels,
        aggregate_genes=aggregate_genes,
        gene_column=gene_column,
        key_added=table_key,
    )
    aggregator.table.obs[SopaKeys.CELL_OVERLAY_KEY] = True

    aggregator.table = anndata.concat(
        [table_crop, aggregator.table],
        uns_merge="first",
        join="outer",
    )

    aggregator.shapes_key = f"{old_shapes_key}_overlay_{aggregator.shapes_key}"

    geo_df_cropped = old_geo_df.loc[~old_geo_df.index.isin(gdf_join.index)]
    aggregator.geo_df = pd.concat([geo_df_cropped, geo_df], join="outer", axis=0)
    aggregator.geo_df.attrs = old_geo_df.attrs

    aggregator.add_standardized_table(table_key)