Rule align
¶
Purpose¶
The purpose of the align
rule is to map the reads from 2nd-seq FASTQ files to the reference genome, focusing on a per-chip approach. For a chip associated with multiple pairs of 2nd-seq FASTQ files, NovaScope executes the align
rule once utilizing all file pairs.
Specifically, the process involves: 1) combining all FASTQ files from 2nd-seq that are related to this chip, 2) discarding any 2nd-seq reads from that do not possess a spatial barcode sequence (HDMI) identified in 1st-seq, 3) mapping 2nd-seq reads to the reference genome utilizing STARsolo.
Input Files¶
-
2nd-seq FASTQ file All pairs of 2nd-seq FASTQ files that are associated to the given chip are designed as input.
-
Matched Spatial Barcode Files Matched Spatial barcode files for all pairs of 2nd-seq FASTQ files, which are produced by Rule
smatch
.
Output Files¶
The rule generates the following output in the specified directory path:
1 |
|
(1) A Binary Alignment Map (BAM) File¶
Description: A Binary Alignment Map (BAM) file contains the aligned reads, sorted by genomic coordinates. The BAM file is accompanied by a BAM index (BAI) file.
File Naming Convention:
- The BAM file:
sttoolsAligned.sortedByCoord.out.bam
- The BAI file:
sttoolsAligned.sortedByCoord.out.bam.bai
File Format: For detailed information on the file formats for BAM and BAI files, please refer to the Format Specification provided by Samtools.
(2) Alignment summary metrics¶
Description: A table file containing metrics such as the total number of input reads, average length of input reads, and summary statistics for unique, multi-mapping, unmapped, and chimeric reads.
File Naming Convention:
sttoolsLog.final.out
File Format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
(3) Digital Gene Expression Matrices (DGEs)¶
Description: A digital gene expression matrix (DGE) is generated for each genomic feature, including Gene, GeneFull, splice junctions (SJ), and Velocyto. The DGE for Gene counts reads match the gene transcript while the DGE for GeneFull counts all reads overlapping the exons and introns of the gene.
File Naming Convention: For each genomic feature, a DGE, which is composed of barcodes.tsv.gz
, features.tsv.gz
, and matrix.mtx.gz
, is stored in a directory named after the genomic feature.
File Format:
-
barcodes.tsv.gz
: A single-column file with Unix line endings and no header, where each row lists a barcode.1 2 3
AAAAAAAATAGTTCTGCTAGCTGGTAAGCT AAAAAAAGTGATCAGAGGTGATATTATGCT AAAAAAAGTTCGCACTATACGAACAGGGAT
-
features.tsv.gz
: Each row includes the following three columns without header: feature ID (column 1), feature name (column 2), and type of feature (column 3).1 2 3
ENSMUSG00000100764 Gm29155 Gene Expression ENSMUSG00000100635 Gm29157 Gene Expression ENSMUSG00000100480 Gm29156 Gene Expression
-
matrix.tsv.gz
: A compressed sparse matrix file format storing non-zero gene expression values across spatial locations or barcodes in spatial transcriptomics data.1 2 3 4 5 6
%%MatrixMarket matrix coordinate integer general % 33989 17641021 17801209 9677 1 1 20305 2 1 23800 2 1
Header
: Initial lines form the header, declaring the matrix's adherence to the Market Matrix (MTX) format, outlining its traits. This section may include comments (lines beginning with%
) for extra metadata, all marked by a “%”.Dimensions
: Following the header, the first line details the matrix dimensions: the count of rows (features), columns (barcodes), and non-zero entries.Data Entries
: Post-dimensions, subsequent lines enumerate non-zero entries in triplet form: row index (feature index), column index (barcode index), and value (expression level).
Output Guidelines¶
It is suggested to review the summary metrics to confirm the total read count, the percentage of reads aligned to genomes and genes, library saturation, the count of aligned spatial barcodes, and the count of unique transcripts.
Parameters¶
The following parameter in the job configuration file will be applied in this rule.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
-
Reformat FASTQ Paramaters
Parameters in
smatch
and three parameters inalign
(includinglen_sbcd
,len_umi
, andlen_r2
) are used to pass values to thereformat-fastqs
function in spatula. Below, for each parameter, the corresponding parameter in spatula, description, and the default value in NovaScope are provided.Parameter spatula
parameterDescription Default Value skip_sbcd
--skip-sbcd
The number of initial bases to omit from the read. 1 match_len
--match-len
The length of the spatial barcode to be considered as a perfectmatch. 27 len_sbcd
--len_sbcd
The length of the spatial barcode sequence to be copied in Read 1 30 len_umi
--len_umi
The length of the UMI sequence (randomer) to be copied from Read 2 (beginning) to Read 1 (after spatial barcode) 9 len_r2
--len_r2
The length of Read 2 sequences to be trimmed 101 skip_sbcd
: This is useful if the 1st-seq spatial barcode lacks sufficient bases. When absent in the job configuration file, NovaScope determinesskip_sbcd
following theformat
infastq2sbcd
: 1 for DraI31 and 0 for DraI32.
-
Alignment Paramaters
Four parameters in
align
(includinglen_sbcd
,len_umi
,min_match_len
, andmin_match_frac
) are used to pass values to STARsolo. Below, for each parameter, the corresponding parameter in STARsolo, description, and the default value in NovaScope are provided.Parameter STARsolo
parameterDescription Default Value len_sbcd
--soloCBlen
The cell barcode length 30 --soloUMIstart
Defined as len_sbcd + 1
, this indicates UMI sequence (randomer) start base.31 len_umi
--soloUMIlen
The length of UMI sequence (randomer) start base. 9 min_match_len
--outFilterMatchNmin
An alignment is only output if the count of matched bases >= this value. 30 min_match_frac
--outFilterMatchNminOverLread
Similar to min_match_len
, normalized to the read length0.66 -
The
exist_action
ParameterThe
exist_action
parameter withinalign
provides two choices for handling existing intermediate or output files: "skip
" tells NovaScope to bypass these files, whereas "overwrite
" instructs NovaScope to replace them. -
The
resource
ParameterThe
resource
parameters, specific to HPC users, determine the partitions, CPU count, and memory allocation for the alignment process. Details for theresource
parameters inalign
are provided in theupstream
parameters in Job Configuration.assign_type
: two available options for how NovaScope allocates resources for alignment. The options include"stdin"
(recommended) and"filesize"
. Details for each option are provided in the blocks below.-
Option
Advantages:stdin
- Directly allocates resources as specified in the
stdin
field, bypassing calculations for precision in resource management. - Enables customization of resources for different datasets in the job configuration file, allowing for optimization of costs based on file size. Disadvantages:
- Requires users to specify resources for each job unless default settings (partition name, threads, memory) fit the computing environment. An example is provided in the template.
- Directly allocates resources as specified in the
-
Option
filesize
Advantages:
- Automatically allocates resources based on the total size of input 2nd-seq FASTQ files and specified computing resources in the environment configuration file.
- Once computing resources are specified in the environment file, they automatically apply to all jobs, simplifying the setup. Disadvantages:
- Requires computing time to calculate the total size of input files, potentially delaying the start of data processing.
The resource allocation strategy is as follows:
Total File Size (GB) Memory Allocated for Alignment (GB) Under 200 70 200 to 400 140 Over 400 330
Dependencies¶
Rule align
requires the matched spatial barcode files from Rule smatch
generates. Hence, if the input files are not available, align
relies on the successful completion of smatch
for proper operation. See an overview of the rule dependencies in the Workflow Structure.
Code Snippet¶
The code for this rule is provided in a04_align.smk