Process Visium HD raw data¶
Key points¶
1) Look at the json file, find 'microns_per_pixel' and later set mu_scale=1/microns_per_pixel
for your analysis.
2) Combine information from multiple raw files and create a single input file sorted by one axis, either X or Y.
3) The coordinates in the input files are in "pixel" unit, but please keep track of the min and max coordinate values of X and Y axis in micrometer, they would be needed to create the final pixel level visualization.
4) Since Visium HD has resolution \(2\mu m\), set --plot_um_per_pixel 2
when visualizing the final pixel output (in ficture plot_pixel_full
).
Alternative: Using spatula convert-sge
command¶
The spatula sge-convert tool offers a convenient way to convert Visium HD raw data to FICTURE input format. You may want to use the tool to convert the raw Visium HD data to FICTURE input format instead of following the manual steps below.
Details¶
Visium HD output contains a sparse count matrix and a separate parquet file defining pixels' spatial locations.
Locate Visium HD outputs
The barcode (tissue_positions.parquet
) file looks like
pxl_row_in_fullres
and pxl_col_in_fullres
are in the unit of "pixel", we need to look at the scalefactors_json.json
file (should be in the same folder as tissue_positions.parquet
) to get its ratio to micrometer. In our example the json file looks like
microns_per_pixel
and later set mu_scale=1/microns_per_pixel
for your analysis.
The matrix directory looks like
We need to annotate the sparse matrix matrix.mtx.gz
with barcode locations from tissue_positions.parquet
by getting the barcode ID from barcodes.tsv.gz
then lookuping its spatial coordinate. Unfortunately barcodes in tissue_positions.parquet
and in barcodes.tsv.gz
are stored in different orders and the parquet file contains a single row group (why??) based on the few public datasets we've inspected, making this process uglier. Although one could read all four files fully in memory and match them, the following is a slower alternative.
The requirements for the merged file are
1) Containing the following columns: X, Y, gene, Count.
2) Is sorted according to one axis. The output from the following commands is sorted first along the Y-axis, so later you would set major_axis=Y
.
Given the current data formats, we first match barcodes' integer indices in the matrix with their spatial locations, then annotate the spatial locations and gene IDs to the sparse count matrix.
You may need to install a tool to read parquet file, one option is pip install parquet-tools
. The following command takes ~8.5 min for the public Visium HD mouse brain dataset.
Coordinate range in ${opath}/coordinate_minmax.tsv
is in micrometer, for the mouse brain data it looks like
Merge coordinate and gene count informations
(You might want to either delte or zip the intermediate files tissue_positions.raw.csv
and barcodes.tsv
)
The sorting by coordinates part can take some time for large data, you could check intermediate results first to see if it makes sense.
Output looks like
matrix.mtx.gz
(the row number in barcodes.tsv.gz
), it will be ignored in analysis.)