Setting Up a Environment YAML File¶

NovaScope requires a YAML file to configure the environment. This file (e.g., config_env.yaml) specifies paths to tools, reference databases, and the Python environment.

Below is a brief description of all the items in the YAML file.

Tip

To create your own config_env.yaml file for the environment setup, you may copy from our example.

Please replace the placeholders with your specific input variables.

Tools¶

The pipeline detects and includes undefined tools in the system path automatically.

tools:
  spatula: /path/to/spatula/bin/spatula                     ## Default: "spatula"
  samtools: /path/to/samtools/samtools                      ## Default: "samtools"
  star: /path/to/STAR_2_7_11b/bin/Linux_x86_64_static/STAR  ## Default: "STAR"

samtools

For users in High-Performance Computing (HPC) environments with samtools installed, it's feasible to use envmodules (see Environment Modules) to load samtools rather than defining its path here.

(Optional) Environment Modules¶

Info

Only applicable to HPC environments. For local executions, remove this section from config_env.yaml.

For HPC users, it is feasible to use the envmodules section to load the required software tools as modules. If a tool is not listed in the envmodules section, the pipeline will assume it's installed system-wide.

Tip

The version information is required.

envmodules:
  python: "python/<version_information>"
  gcc: "gcc/<version_information>"
  gdal: "gdal/<version_information>"
  imagemagick: "imagemagick/<version_information>"
  # snakemake: "snakemake/<version_information>"
  # samtools: "Bioinformatics && samtools"

python

If your Python environment was set up using a Python accessed through a module, specify the same Python module in the envmodules section to maintain the environment. If using a local Python installation (not through module load), DO NOT INCLUDE any Python module here.

samtools

Using envmodules to load samtools can be an alternative to specifying its path in tools.

The given example is designed for instances where samtools is integrated into the Bioinformatics module system, which necessitates loading the Bioinformatics module prior to loading samtools. In this case, provide all modules that required to be loaded in the correct order, joint by &&.

Reference Databases¶

Tip

Ensure that the reference files match the species of your input data.

Define all necessary reference databases for the input species in the ref field.

ref:
  align:
    mouse: "/path/to/refdata-gex-GRCm39-2024-A/star_2.7_11b"
    human: "/path/to/refdata-gex-GRCh39-2024-A/star_2.7_11b"
    #...
  genelists:
    mouse: "/path/to/ref_gene_list_directory_for_mouse"
    human: "/path/to/ref_gene_list_directory_for_human"
    #...
  #geneinfo:                                        ## (optional) skip if the users prefer to use precompiled files
    #mouse: "/path/to/ref_gene_info_file_for_mouse"
    #human: "/path/to/ref_gene_info_file_for_human"
    #...

(1) Reference Genome Index for Alignment¶

Specify the alignment reference genome index in the align field. Reference genome indices can be accessed via the cellranger download page. Users may also generate their own genome index, with detailed instructions for building a STAR index provided in the Requirements section.

(2) (Optional) Reference Gene List Files for Spatial Expression Visualization¶

Tip

By default, NovaScope requires reference gene list files for visualizing spatial expression patterns. If these files are unavailable, users can disable this feature by setting action in draw_sge to False.

The genelists field should specify the directory containing species-specific gene lists, which are essential for visualizing spatial expression in Rule sdge_visual. Each file in this directory must be named <gene_group>.genes.tsv (e.g., MT.genes.tsv) and list gene names line by line.

NovaScope provides precompiled gene lists for mouse (mm39) and human (hg38). If not specified, these defaults will be used. Users may also supply custom gene lists or disable the visualization of gene sets.

(3) (Optional) Reference Gene Information for Gene Filtering¶

Gene information files are needed for if additional functionalities are utilized, specified in the geneinfo field for filtering. The geneinfo field should point to the gene information file used for gene filtering. By default, NovaScope uses precompiled files for: mouse (mm39), human (hg38), and chick (g6a)

Users need to specify a gene information file in the geneinfo field only if:

The dataset is from a species other than human or mouse.
The dataset version differs from the precompiled files.

Python Environment¶

Specify the path of Python virtual environment by modifying the following line:

pyenv: "/path/to/python/virtual/env"

(Optional) Computing Capabilities¶

Info

Only applicable to HPC environments and when the filesize resource allocation method is applied.

NovaScope offers two resource allocation methods for alignment:

stdin: Manually define resources in the job configuration file.

filesize: Automatically allocate resources based on input file size and available computational resources, which must be specified in available_nodes when using this option (see an example below):

available_nodes:
  - partition: standard     # partition name
    max_n_cpus: 20          # the maximum number of CPUs per node
    mem_per_cpu: 7g         # the memory allocation per CPU 
  - partition: largemem
    max_n_cpus: 10
    mem_per_cpu: 25g

For details on activating stdin or filesize and understanding the filesize strategy, see the Job Configuration page.