CLI usage
Here, we provide a minimal example of command line usage. For more details and to learn about other optional arguments, refer to the full CLI documentation.
Tip
The Snakemake pipeline is recommended to get started with Sopa. Using the CLI is advised if you want more flexibility, but you'll need to parallelize the segmentation yourself, as detailed below.
Save the SpatialData
object
For this tutorial, we use a generated dataset. You can expect a total runtime of a few minutes.
The command below will generate and save it on disk (you can change the path tuto.zarr
to save it somewhere else). If you want to load your own data: choose the right panel below. For more information, refer to this FAQ describing which data input you need, or see the sopa read
documentation.
Info
It has created a .zarr
directory which stores a SpatialData
object corresponding to your data sample. You can choose the location of the .zarr
directory using the --sdata-path
command line argument.
(Optional) ROI selection
Sometimes, your slide may contain a region with low-quality data, and we want to run the analysis only on the good-quality region. For this, we can interactively select a region of interest (ROI), and Sopa will only run on the selected ROI.
Run the following command line and follow the instructions displayed in the console:
When interactive mode is unavailable, the ROI selection will be performed in three steps.
-
On the machine where the data is stored, save a light view of the original image (here, it will create a file called
image.zarr.zip
): -
Download the
image.zip
file locally (or on a machine with interactive mode), and select the ROI. Here, it will create a file calledroi.zarr.zip
: -
Upload the
roi.zarr.zip
file, and save it inside theSpatialData
object:
Run segmentation
Option 1: Cellpose
First, generate the bounding boxes of the patches on which Cellpose will be run. Here, the patches have a width and height of 1500 pixels and an overlap of 50 pixels. We advise bigger sizes for real datasets (see our default parameters in one of our config files). On the toy dataset, this will generate 4 patches.
Now, we can run Cellpose on each individual patch. Execute the following command line on all patch-index
(i.e., 0
, 1
, 2
, and 3
) to run Cellpose using DAPI only (you can add an additional channel, for instance, --channels DAPI --channels PolyT
):
Tip
Manually running the commands below can involve using many consecutive commands, so we recommend automatizing it. For instance, this can be done using Snakemake or Nextflow. This will help you parallelize it since you can run each task on separate jobs or using multithreading. You can also see how we do it in our Snakefile. If you prefer using the already existing pipeline instead of the CLI, you can read our Snakemake pipeline tutorial.
To automatically get the number of patches, you can either open the tuto.zarr/.sopa_cache/patches_file_image
file, or compute len(sdata['sopa_patches'])
in Python.
Note
In the above commands, the --diameter
and --min-area
parameters are specific to the data type we work on. For your own data, consider using the default parameters from one of our config files. Here, min-area
is in pixels^2.
At this stage, you executed 4 times Cellpose (once per patch). Now, we need to resolve the conflict, i.e. where boundaries are overlapping due to segmentation on multiple patches:
Option 2: Baysor
Baysor needs a config to be executed. You can find official config examples here.
Note
You can also reuse the Baysor parameter we have defined for each machine, as in our Snakemake config files. Note that, our Snakemake config is a .yaml
file, but the Baysor config should still be a .toml
file.
For this tutorial, we will use the config below. Save this in a config.toml
file.
[data]
force_2d = true
min_molecules_per_cell = 10
x = "x"
y = "y"
z = "z"
gene = "genes"
min_molecules_per_gene = 0
min_molecules_per_segment = 3
confidence_nn_id = 6
[segmentation]
scale = 30 # typical cell radius
scale_std = "25%" # cell radius standard deviation
prior_segmentation_confidence = 0
estimate_scale_from_centers = false
n_clusters = 4
iters = 500
n_cells_init = 0
nuclei_genes = ""
cyto_genes = ""
new_component_weight = 0.2
new_component_fraction = 0.3
Then, we generate the bounding boxes of the patches on which Baysor will be run. Here, the patches have a width and height of 1200 microns and an overlap of 50 microns. We advise bigger sizes for real datasets (see our default parameters in one of our config files). On the toy dataset, this will generate 4 patches.
# config.toml is the Baysor config file you generated above
sopa patchify baysor tuto.zarr --config-path config.toml --patch-width-microns 1200 --patch-overlap-microns 50
Now, we can run Baysor on each individual patch. Execute the following command lines to run Baysor on each patch (i.e., 0
, 1
, 2
, and 3
).
Tip
Manually running the commands below can involve using many consecutive commands, so we recommend automatizing it. For instance, this can be done using Snakemake or Nextflow. This will help you parallelize it since you can run each task on separate jobs or using multithreading. You can also see how we do it in the Sopa Snakemake pipeline.
To automatically get the number of patches, you can open the tuto.zarr/.sopa_cache/patches_file_baysor
file. This lists the names of the directories inside tuto.zarr/.sopa_cache/baysor
related to each patch. If you selected an ROI, the excluded patches are effectively not in the patches_file_baysor
file.
At this stage, you executed 4 times Baysor (once per patch). Now, we need to resolve the conflict, i.e. where boundaries are overlapping due to segmentation on multiple patches:
Aggregation
This mandatory step turns the data into an AnnData
object. We can count the transcript inside each cell, and/or average each channel intensity inside each cell boundary.
Info
The --gene-column
option below tells which column contains the gene names inside the transcript dataframe. If you don't know it, you can look to our configs to find the right gene-column
corresponding to your machine.
If using Baysor
Baysor already counts the transcripts inside each cell to create a cell-by-gene table. So you'll always have this table, and there is no need to use the --gene-column
argument. If you don't want to average the intensities, you will still need to run sopa aggregate tuto.zarr
before continuing.
Annotation
If desired, cell-type annotation can be run. Currently, we support Tangram for transcript-based annotation and a simple scoring approach for channel-based annotation (called channel z-score).
For now, our fluorescence-based annotation is very simple. We provide a dictionary where a channel is associated with a population. Then, each cell is associated with the cell type whose corresponding channel is the brightest (according to a certain Z-score). In this tutorial example, we can annotate Tumoral cells, T cells, and B cells:
sopa annotate fluorescence tuto.zarr --marker-cell-dict '{"CK": "Tumoral cell", "CD3": "T cell", "CD20": "B cell"}'
More complex annotation
If you have a large number of channels, it may be preferable to run clustering on your data, for instance, using `Leiden clustering. Then, you can annotate each cluster manually by plotting a heatmap of all channels expressions per cluster.
Tangram is a transcript-based annotation that uses an annotated single-cell reference. Let's suppose your reference AnnData
object is stored in a file called adata_reference.h5ad
(preferably, keep raw counts), and the cell type is in adata.obs["cell_type"]
. Then, you can annotate your spatial data as follows:
Pipeline report
You can optionally create an HTML report of the pipeline run (in the example below, we save it under report.html
). It contains some quality controls for your data.
Visualization (Xenium Explorer)
The Xenium Explorer is a software developed by 10X Genomics for visualizing spatial data, and it can be downloaded freely here. Sopa allows the conversion to the Xenium Explorer, whatever the type of spatial data you worked on. It will create some files under a new tuto.explorer
directory:
If you have downloaded the Xenium Explorer, you can now open the results in the explorer: open tuto.explorer/experiment.xenium
(if using a Unix operating system), or double-click on the latter file.
Time efficiency
Creating the image needed by the Xenium Explorer can be time-consuming. Therefore, we recommend performing one run for the image generation (below) and another to save the transcripts/boundaries/observations.
# this can be done directly after saving the raw data in a .zarr directory
sopa explorer write tuto.zarr --mode "+i" --no-save-h5ad
After running everything with Sopa, you can finally save all the other Xenium Explorer input (e.g. boundaries and cell categories):
# this should be done after aggregation and an eventual annotation
sopa explorer write tuto.zarr --mode "-i" --gene-column genes
Geometric and spatial statistics
All functions to compute geometric and spatial statistics are detailed in the sopa.spatial
API. You can also read this tutorial.
Further analyses
- If you are familiar with the
spatialdata
library, you can directly use thetuto.zarr
directory, corresponding to aSpatialData
object: - You can use Squidpy which operates on both the
SpatialData
object or theAnnData
object, or use other tools of thescverse
ecosystem such asScanpy
. - You can also use the file
tuto.explorer/adata.h5ad
if you prefer theAnnData
object instead of the fullSpatialData
object.