Skip to content

Rule align

Purpose

The purpose of the align rule is to map the reads from 2nd-seq FASTQ files to the reference genome, focusing on a per-chip approach. For a chip associated with multiple pairs of 2nd-seq FASTQ files, NovaScope executes the align rule once utilizing all file pairs.

Specifically, the process involves: 1) combining all FASTQ files from 2nd-seq that are related to this chip, 2) discarding any 2nd-seq reads from that do not possess a spatial barcode sequence (HDMI) identified in 1st-seq, 3) mapping 2nd-seq reads to the reference genome utilizing STARsolo.

Input Files

  • 2nd-seq FASTQ file All pairs of 2nd-seq FASTQ files that are associated to the given chip are designed as input.

  • Matched Spatial Barcode Files Matched Spatial barcode files for all pairs of 2nd-seq FASTQ files, which are produced by Rule smatch.

Output Files

The rule generates the following output in the specified directory path:

1
<output_directory>/align/<flowcell_id>/<chip_id>/<run_id>/bam

(1) A Binary Alignment Map (BAM) File

Description: A Binary Alignment Map (BAM) file contains the aligned reads, sorted by genomic coordinates. The BAM file is accompanied by a BAM index (BAI) file.

File Naming Convention:

  • The BAM file: sttoolsAligned.sortedByCoord.out.bam
  • The BAI file: sttoolsAligned.sortedByCoord.out.bam.bai

File Format: For detailed information on the file formats for BAM and BAI files, please refer to the Format Specification provided by Samtools.

(2) Alignment summary metrics

Description: A table file containing metrics such as the total number of input reads, average length of input reads, and summary statistics for unique, multi-mapping, unmapped, and chimeric reads.

File Naming Convention: sttoolsLog.final.out

File Format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
                                 Started job on |       Apr 01 18:32:36
                             Started mapping on |       Apr 01 18:33:47
                                    Finished on |       Apr 01 18:51:55
       Mapping speed, Million of reads per hour |       275.83

                          Number of input reads |       83363295
                      Average input read length |       68
                                    UNIQUE READS:
                   Uniquely mapped reads number |       44728786
                        Uniquely mapped reads % |       53.66%
                          Average mapped length |       86.51
                       Number of splices: Total |       4221113
            Number of splices: Annotated (sjdb) |       4108506
                       Number of splices: GT/AG |       4174922
                       Number of splices: GC/AG |       13282
                       Number of splices: AT/AC |       581
               Number of splices: Non-canonical |       32328
                      Mismatch rate per base, % |       0.92%
                         Deletion rate per base |       0.03%
                        Deletion average length |       1.56
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.20
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       9853625
             % of reads mapped to multiple loci |       11.82%
        Number of reads mapped to too many loci |       1330676
             % of reads mapped to too many loci |       1.60%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       4764703
                 % of reads unmapped: too short |       5.72%
                Number of reads unmapped: other |       22685505
                     % of reads unmapped: other |       27.21%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

(3) Digital Gene Expression Matrices (DGEs)

Description: A digital gene expression matrix (DGE) is generated for each genomic feature, including Gene, GeneFull, splice junctions (SJ), and Velocyto. The DGE for Gene counts reads match the gene transcript while the DGE for GeneFull counts all reads overlapping the exons and introns of the gene.

File Naming Convention: For each genomic feature, a DGE, which is composed of barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz, is stored in a directory named after the genomic feature.

File Format:

  • barcodes.tsv.gz: A single-column file with Unix line endings and no header, where each row lists a barcode.

    1
    2
    3
    AAAAAAAATAGTTCTGCTAGCTGGTAAGCT
    AAAAAAAGTGATCAGAGGTGATATTATGCT
    AAAAAAAGTTCGCACTATACGAACAGGGAT
    

  • features.tsv.gz: Each row includes the following three columns without header: feature ID (column 1), feature name (column 2), and type of feature (column 3).

    1
    2
    3
    ENSMUSG00000100764  Gm29155 Gene Expression
    ENSMUSG00000100635  Gm29157 Gene Expression
    ENSMUSG00000100480  Gm29156 Gene Expression
    

  • matrix.tsv.gz: A compressed sparse matrix file format storing non-zero gene expression values across spatial locations or barcodes in spatial transcriptomics data.

    1
    2
    3
    4
    5
    6
    %%MatrixMarket matrix coordinate integer general
    %
    33989 17641021 17801209
    9677 1 1
    20305 2 1
    23800 2 1
    

    • Header: Initial lines form the header, declaring the matrix's adherence to the Market Matrix (MTX) format, outlining its traits. This section may include comments (lines beginning with %) for extra metadata, all marked by a “%”.
    • Dimensions: Following the header, the first line details the matrix dimensions: the count of rows (features), columns (barcodes), and non-zero entries.
    • Data Entries: Post-dimensions, subsequent lines enumerate non-zero entries in triplet form: row index (feature index), column index (barcode index), and value (expression level).

Output Guidelines

It is suggested to review the summary metrics to confirm the total read count, the percentage of reads aligned to genomes and genes, library saturation, the count of aligned spatial barcodes, and the count of unique transcripts.

Parameters

The following parameter in the job configuration file will be applied in this rule.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
upstream:
  smatch:                   
    skip_sbcd: 1            
    match_len: 27           
  align:                    
     min_match_len: 30      
     min_match_frac: 0.66   
     len_sbcd: 30            
     len_umi: 9              
     len_r2: 101             
     exist_action: overwrite
     resource:               
       assign_type: stdin
       stdin:
         partition: standard
         threads: 10
         memory: 70000m
  • Reformat FASTQ Paramaters

    Parameters in smatch and three parameters in align (including len_sbcd, len_umi, and len_r2) are used to pass values to the reformat-fastqs function in spatula. Below, for each parameter, the corresponding parameter in spatula, description, and the default value in NovaScope are provided.

    Parameter spatula parameter Description Default Value
    skip_sbcd --skip-sbcd The number of initial bases to omit from the read. 1
    match_len --match-len The length of the spatial barcode to be considered as a perfectmatch. 27
    len_sbcd --len_sbcd The length of the spatial barcode sequence to be copied in Read 1 30
    len_umi --len_umi The length of the UMI sequence (randomer) to be copied from Read 2 (beginning) to Read 1 (after spatial barcode) 9
    len_r2 --len_r2 The length of Read 2 sequences to be trimmed 101
    • skip_sbcd: This is useful if the 1st-seq spatial barcode lacks sufficient bases. When absent in the job configuration file, NovaScope determines skip_sbcd following the format in fastq2sbcd: 1 for DraI31 and 0 for DraI32.
  • Alignment Paramaters

    Four parameters in align (including len_sbcd, len_umi, min_match_len, and min_match_frac) are used to pass values to STARsolo. Below, for each parameter, the corresponding parameter in STARsolo, description, and the default value in NovaScope are provided.

    Parameter STARsolo parameter Description Default Value
    len_sbcd --soloCBlen The cell barcode length 30
    --soloUMIstart Defined as len_sbcd + 1, this indicates UMI sequence (randomer) start base. 31
    len_umi --soloUMIlen The length of UMI sequence (randomer) start base. 9
    min_match_len --outFilterMatchNmin An alignment is only output if the count of matched bases >= this value. 30
    min_match_frac --outFilterMatchNminOverLread Similar to min_match_len, normalized to the read length 0.66
  • The exist_action Parameter

    The exist_action parameter within align provides two choices for handling existing intermediate or output files: "skip" tells NovaScope to bypass these files, whereas "overwrite" instructs NovaScope to replace them.

  • The resource Parameter

    The resource parameters, specific to HPC users, determine the partitions, CPU count, and memory allocation for the alignment process. Details for the resource parameters in align are provided in the upstream parameters in Job Configuration.

    • assign_type: two available options for how NovaScope allocates resources for alignment. The options include "stdin" (recommended) and "filesize". Details for each option are provided in the blocks below.
    • Option stdin Advantages:
      • Directly allocates resources as specified in the stdin field, bypassing calculations for precision in resource management.
      • Enables customization of resources for different datasets in the job configuration file, allowing for optimization of costs based on file size. Disadvantages:
      • Requires users to specify resources for each job unless default settings (partition name, threads, memory) fit the computing environment. An example is provided in the template.
    • Option filesize

      Advantages:

      • Automatically allocates resources based on the total size of input 2nd-seq FASTQ files and specified computing resources in the environment configuration file.
      • Once computing resources are specified in the environment file, they automatically apply to all jobs, simplifying the setup. Disadvantages:
      • Requires computing time to calculate the total size of input files, potentially delaying the start of data processing.

      The resource allocation strategy is as follows:

      Total File Size (GB) Memory Allocated for Alignment (GB)
      Under 200 70
      200 to 400 140
      Over 400 330

Dependencies

Rule align requires the matched spatial barcode files from Rule smatch generates. Hence, if the input files are not available, align relies on the successful completion of smatch for proper operation. See an overview of the rule dependencies in the Workflow Structure.

Code Snippet

The code for this rule is provided in a04_align.smk