Setting Up a Environment YAML File¶
NovaScope requires a YAML file to configure the environment. This file (e.g., config_env.yaml
) specifies paths to tools, reference databases, and the Python environment.
Below is a brief description of all the items in the YAML file.
Tip
To create your own config_env.yaml
file for the environment setup, you may copy from our example.
Please replace the placeholders with your specific input variables.
Tools¶
The pipeline detects and includes undefined tools in the system path automatically.
1 2 3 4 |
|
samtools
For users in High-Performance Computing (HPC) environments with samtools
installed, it's feasible to use envmodules
(see Environment Modules) to load samtools
rather than defining its path here.
(Optional) Environment Modules¶
Info
Only applicable to HPC environments. For local executions, remove this section from config_env.yaml
.
For HPC users, it is feasible to use the envmodules
section to load the required software tools as modules. If a tool is not listed in the envmodules
section, the pipeline will assume it's installed system-wide.
Tip
The version information is required.
1 2 3 4 5 6 7 |
|
python
If your Python environment was set up using a Python accessed through a module, specify the same Python module in the envmodules
section to maintain the environment. If using a local Python installation (not through module load
), DO NOT INCLUDE any Python module here.
samtools
Using envmodules
to load samtools
can be an alternative to specifying its path in tools
.
The given example is designed for instances where samtools
is integrated into the Bioinformatics
module system, which necessitates loading the Bioinformatics
module prior to loading samtools
. In this case, provide all modules that required to be loaded in the correct order, joint by &&
.
Reference Databases¶
Tip
Ensure that the reference files match the species of your input data.
Define all necessary reference databases for the input species in the ref
field.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
(1) Reference Genome Index for Alignment¶
Specify the alignment reference genome index in the align
field. Reference genome indices can be accessed via the cellranger download page. Users may also generate their own genome index, with detailed instructions for building a STAR index provided in the Requirements section.
(2) (Optional) Reference Gene List Files for Spatial Expression Visualization¶
Tip
By default, NovaScope requires reference gene list files for visualizing spatial expression patterns. If these files are unavailable, users can disable this feature by setting action
in draw_sge
to False
.
The genelists
field should specify the directory containing species-specific gene lists, which are essential for visualizing spatial expression in Rule sdge_visual. Each file in this directory must be named <gene_group>.genes.tsv
(e.g., MT.genes.tsv
) and list gene names line by line.
NovaScope provides precompiled gene lists for mouse (mm39) and human (hg38). If not specified, these defaults will be used. Users may also supply custom gene lists or disable the visualization of gene sets.
(3) (Optional) Reference Gene Information for Gene Filtering¶
Gene information files are needed for if additional functionalities are utilized, specified in the geneinfo
field for filtering. The geneinfo
field should point to the gene information file used for gene filtering. By default, NovaScope uses precompiled files for: mouse (mm39), human (hg38), and chick (g6a)
Users need to specify a gene information file in the geneinfo
field only if:
- The dataset is from a species other than human or mouse.
- The dataset version differs from the precompiled files.
Python Environment¶
Specify the path of Python virtual environment by modifying the following line:
1 |
|
(Optional) Computing Capabilities¶
Info
Only applicable to HPC environments and when the filesize
resource allocation method is applied.
NovaScope offers two resource allocation methods for alignment:
stdin
: Manually define resources in the job configuration file.filesize
: Automatically allocate resources based on input file size and available computational resources, which must be specified inavailable_nodes
when using this option (see an example below):1 2 3 4 5 6 7
available_nodes: - partition: standard # partition name max_n_cpus: 20 # the maximum number of CPUs per node mem_per_cpu: 7g # the memory allocation per CPU - partition: largemem max_n_cpus: 10 mem_per_cpu: 25g
For details on activating stdin
or filesize
and understanding the filesize
strategy, see the Job Configuration page.