Configuring a NovaScope Run¶
Overview¶
Once you have installed NovaScope and downloaded the input data, the next step is to configure a NovaScope run. This mainly involves preparing the input job configuration file (in YAML, config_job.yaml
) for the run.
Preparing Input Config Files¶
Info
The pipeline requires to have config_job.yaml
file in the working directory, which will be indicated by -d
or --directory
when executing NovaScope, to specify all input files, output files, and parameters.
For user's convenience, we provide separate example config_job.yaml
files for the Minimal Test Run Dataset, Shallow Liver Section Dataset, and Deep Liver Section Dataset test runs.
The details of each item specified in the config_job.yaml
is described below:
A Template of the Config File¶
Below is a template of the config_job.yaml
file.
Mandatory fields are marked as "REQUIRED FIELD".
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
Detailed Description of Individual Fields¶
Input¶
Tip
NovaScope supports using relative paths in the job configuration file, which should be relative to the working directory. If a relative path is found, NovaScope automatically obtains its real path and uses it in the process.
-
seq1st
:-
id
: Theid
will be used to organize the 1st-seq FASTQ files. Make sure theid
parameter for 1st-seq in the corresponding flowcell is unique. -
layout
: A spatial barcode (sbcd) layout file to provide the layout of tiles in a chip with the following format. If absent, NovaScope will automatically look for the sbcd layout within the NovaScope repository at info/assets/layout_per_tile_basis, using the section chip ID for reference.1 2 3
lane tile row col rowshift colshift 3 2556 1 1 0 0 3 2456 2 1 0 0.1715
lane
: Lane IDs;tile
: Tile IDs;row
&col
: The layout position;rowshift
&colshift
: The gap information
-
-
seq2nd
: This parameter requires all FASTQ pairs associated with the input section chip to be provided underseq2nd
.How to generate
seq2nd_pair_id
?If an ID is not specified, NovaScope will automatically generate one using the format
<flowcell_id>.<chip_id>.<randomer>
, whererandomer
is the last 5 digits of the md5 hash of the real path of the read 1 FASTQ file from the 2nd-seq. -
run_id
: Only needed if alignment is required to generate the requested output. It is used as an identifier for alignment and Spatial Digital Gene Expression matrices (SGEs) to differentiate between input 2nd-seq FASTQ files. This is particularly useful when generating SGEs using the same 1st-seq files but different 2nd-seq files. If not provided, NovaScope will generate it based on the flowcell ID, chip ID, and all input 2nd-seq read 1 FASTQ files.How to generate
run_id
?NovaScope automatically generates
run_id
in the format<flowcell_id>-<chip_id>-<species>-<randomer>
. Therandomer
is created by sorting all input seq2nd_pair_id, concatenating these seq2nd_pair_id into a single long string, and then computing the md5 hash of this string. The last 5 digits of this hash are used as therandomer
. -
unit_id
: Only needed if reformat feature is required to generate the requested output. It acts as an identifier for SGEs that are prepared for reformatting. This identifier is especially useful when users wish to manually modify SGE outside of NovaScope and then proceed to reformat both the original and modified SGEs. Theunit_id
ensures clear distinction between the original and modified datasets.How to generate
unit_id
If
unit_id
is not specified and reformatting is requested, it will default to<run_id>-default
, indicating that no manual preprocessing has occurred.Users who prefer to reformat manually modified SGEs should define their own
unit_id
. We recommend incorporatingrun_id
into theunit_id
to maintain a clear trace of the dataset lineage. -
histology
: NovaScope allows multiple input histology files for alignment. However, it is important to note that the magnification and type of each histology file serve as identifiers. Ensure that no two input histology files share the same magnification and type. Currently, historef supports the following types:"hne"
: Hematoxylin and Eosin (H&E) stained histology images;"dapi"
: DAPI or 4',6-diamidino-2-phenylindole stained histology images;"fl"
: Fluorescence stained histology images.
Output¶
The output directory will be used to organize the input files and store output files. Please see the structure directory here.
Requests¶
The pipeline interprets the requested output files via request
and determines the execution flow.
Info
The request
parameter should indicate the final output required, and all intermediary files contributing to the final output will be automatically generated (i.e., the dependencies between rules).
Below are the options with their final output files and links to detailed output information. For more insights into the excution flow, please consult the execution flow by request alongside the rulegraph.
Option | Main/Final Output Files | Details |
---|---|---|
sbcd-per-flowcell |
Spatial barcode map (per-tile basis) and Manifest file for a flowcell | fastq2sbcd |
sbcd-per-chip |
Spatial barcode map for a section chip, Image of spatial barcode distribution | sbcd2chip |
smatch-per-chip |
File with matched spatial barcodes, Image of matched barcode spatial distribution | smatch |
align-per-run |
Binary Alignment Map (BAM) file, Digital gene expression matrix (DGE) for genomic features | align |
sge-per-run |
Spatial digital gene expression matrix (SGE), Spatial distribution images for transcripts | dge2sdge and sdge_visual |
hist-per-run |
Geotiff files for coordinate transformation between SGE and histology image, and a resized one | historef |
transcript-per-unit |
SGE in in the FICTURE-compatible format | sdgeAR_reformat |
segment-per-unit |
Hexagon-based SGE in the 10x genomics format | sdgeAR_segment |
Upstream & Downstream¶
Parameter details for the upstream
and downstream
fields are outlined in the NovaScope Walkthrough, under the specific rule pages to which they apply.
align
-
resource
: Only applicable for HPC users.assign_type
: two available options for how NovaScope allocates resources for alignment. The options include"stdin"
(recommended) and"filesize"
. Details for each option are provided in the blocks below.
Option
stdin
Advantages: - Directly allocates resources as specified in the
stdin
field, bypassing calculations for precision in resource management. - Enables customization of resources for different datasets in the job configuration file, allowing for optimization of costs based on file size.Disadvantages: - Requires users to specify resources for each job unless default settings (partition name, threads, memory) fit the computing environment. An example is provided in the template.
Option
filesize
Advantages: - Automatically allocates resources based on the total size of input 2nd-seq FASTQ files and specified computing resources in the environment configuration file. - Once computing resources are specified in the environment file, they automatically apply to all jobs, simplifying the setup.
Disadvantages: - Requires computing time to calculate the total size of input files, potentially delaying the start of data processing.
The resource allocation strategy is as follows:
Total File Size (GB) Memory Allocated for Alignment (GB) Under 200 70 200 to 400 140 Over 400 330
-