Skip to content

spatula subset-sge

Summary

spatula subset-sge creates a subset of spatial gene expression (SGE) matrix based on the boundary of coordinates specified by the input arguments, or input geojson file.

Here is a summary of a typical use case:

  • Input: Takes spatial gene expression (SGE) matrix created by dge2sge, and the boundary of coordinates specified by the input arguments, or input geojson file.
  • Output: Produces a subset of the SGE matrix based on the specified spatial boundary.

A typical use case is as follows:

spatula subset-sge --sge  /path/to/input/sge/dir/ \
                   --out  /path/to/output/sge/dir/ \
                   --json /path/to/geojson/file.geojson

See below for a more detailed usage description.

Required options

  • --sge : The path to the input SGE directory that contains barcodes.tsv.gz, feature.tsv.gz, and matrix.mtx.gz file generated by dge2sge.
  • --out : The path to the output SGE directory that contains a subset of the original SGE files.See Expected Output for more details.
  • In addition, either --json or --xmin/--xmax/--ymin/--ymax are expected. (See Additional Options for more details)

Additional Options

  • --json : The path to the input geojson file that contains the boundary of the spatial coordinates. Note that the unit of geojson file is micron (um), and the scale has to be matched by --px-per-um parameter. The boundary is defined by the geometry field in the geojson file. The boundary is expected to be a set of polygons, and the coordinates are expected to be in the same coordinate system as the spatial coordinates in the SGE matrix.
  • --json-x-offset : The x-coordinate offset to be added to the boundary defined in the geojson file. This is useful when the boundary is defined in a shifted coordinate system from the spatial coordinates in the SGE matrix.
  • --json-y-offset : The y-coordinate offset to be added to the boundary defined in the geojson file. This is useful when the boundary is defined in a shifted coordinate system from the spatial coordinates in the SGE matrix.
  • --px-per-um : Number of pixel units in the SGE matrix per micron (um). This is used to convert the boundary defined in the geojson file to the coordinate system of the SGE matrix.

Other Options

  • --xmin : The minimum value of the x coordinate to crop the SGE matrix.
  • --xmax : The maximum value of the x coordinate to crop the SGE matrix.
  • --ymin : The minimum value of the y coordinate to crop the SGE matrix.
  • --ymax : The maximum value of the y coordinate to crop the SGE matrix.
  • --out-minmax-fixed : Do not update the range of x/y coordinates of the subset SGE matrix. By default, the barcodes.minmax.tsv file will be generated based on the coordinates of the cropped SGE matrix. With this option turned on, the barcodes.minmax.tsv will be identical to the original SGE matrix.
  • --whitelist : The whitelist barcodes to be subsetted to. This is useful when external softwares are used to precisely select the barcodes to be subsetted to. The whitelist file is expected to be a tsv file with the barcode sequences in the first column.
  • --bcd : The name of the barcode file in the SGE matrix. By default, it is barcodes.tsv.gz.
  • --ftr : The name of the feature file in the SGE matrix. By default, it is features.tsv.gz.
  • --mtx : The name of the matrix file in the SGE matrix. By default, it is matrix.mtx.gz.
  • --minmax : The name of tsv file specifying the rectangular boundary of the SGE matrix. By default, it is barcodes.minmax.tsv.

Expected Output

In the output directory [outdir], the following files will be created.

  • [outdir]/barcodes.tsv.gz contains the list of barcodes in the subsetted SGE matrix. Each line contains the following information:
    1. Barcode sequence
    2. Increasing index of the barcode (1-based), which is not necessarily contiguous.
    3. Sequential and contiguous index of the barcode (1-based). This should match to the integer IDs of barcodes in the matrix.mtx.gz file.
    4. Lane of spatial coordinate
    5. Tile of spatial coordinate
    6. X-coordinate of spatial coordinate
    7. Y-coordinate of spatial coordinate
    8. Comma-separated counts of observations in each matrix.mtx.gz file. The order of the counts should match the order of --mtx options.
  • [outdir]/features.tsv.gz contains the list of genes in the subsetted SGE matrix. Each line contains the following information:
    1. Gene ID (unique identifier)
    2. Gene name
    3. Sequential and contiguous index of the gene (1-based). This should match to the integer IDs of genes in the matrix.mtx.gz file.
    4. Comma-separated counts of observations in each matrix.mtx.gz file. The order of the counts should match the order of --mtx options.
  • [outdir]/matrix.mtx.gz contains the subsetted spatial expression matrix in SGE format. After three header lines following the Market Exchange format, the matrix file contains the following information, separated by spaces.
    1. barcode index (1-based)
    2. gene index (1-based)
    3. (multiple space-separated entries) counts of observations, in the order of input files in --mtx options.
  • barcodes.minmax.tsv file contains four columns - xmin, xmax, ymin, and ymax or the spatial coordinates of subsetted SGE matrix.

Full Usage

The full usage of spatula subset-sge can be viewed with the --help option:

$ ./spatula subset-sge --help
[./spatula subset-sge] -- Subset Spatial SGE based on bounding box

 Copyright (c) 2022-2024 by Hyun Min Kang
 Licensed under the Apache License v2.0 http://www.apache.org/licenses/

Detailed instructions of parameters are available. Ones with "[]" are in effect:

Available Options:

== Input options ==
   --sge              [STR: ]             : Spatial gene expression directory
   --bcd              [STR: barcodes.tsv.gz] : Barcode file path (e.g. barcodes.tsv.gz)
   --ftr              [STR: features.tsv.gz] : Feature file path (e.g. feature.tsv.gz)
   --mtx              [STR: matrix.mtx.gz] : Matrix file path (e.g. matrix.mtx.gz)
   --minmax           [STR: barcodes.minmax.tsv] : Boundary file path (e.g. barcodes.minmax.tsv)

== Filter options ==
   --xmin             [INT: 0]            : Minimum x coordinate
   --xmax             [INT: 2147483647]   : Maximum x coordinate
   --ymin             [INT: 0]            : Minimum y coordinate
   --ymax             [INT: 2147483647]   : Maximum y coordinate
   --out-minmax-fixed [FLG: OFF]          : Do not update output minmax coordinates based on the observed points
   --json             [STR: ]             : Geojson file containing multiple polygons
   --json-x-offset    [FLT: 0.00]         : X-offset to add to the geojson boundary
   --json-y-offset    [FLT: 0.00]         : Y-offset to add to the geojson boundary
   --whitelist        [STR: ]             : Barcode whitelist file path
   --px-per-um        [FLT: 1000.00]      : Pixels/um scale (default: 1000, 26.67 for HiSeq2500, 28.75 for NovaSeq 6000)

== Output Options ==
   --out              [STR: ]             : Output directory


NOTES:
When --help was included in the argument. The program prints the help message but do not actually run