Configuring a NovaScope Run¶
After installing NovaScope and downloading the input data, prepare a job configuration file to specify inputs, outputs, and parameters.
Job Configuration File Specifications
The job configuration file must adhere to the following guidelines:
- Naming convention:
config_job.yaml
. - Location: Ensure the
config_job.yaml
file is placed in the working directory. The working directory should be specified to NovaScope using the-d
or--directory
option. - Fields: The
config_job.yaml
file must include the following fields: :input
,output
,request
,env_yml
. Additional fields can be included as per the user's requirements.
Prepare the Job Configuration file¶
Prepare your job configuration file following the template below.
Example job configuration files
For user's convenience, we provide separate example config_job.yaml
files for the Minimal Test Run, Shallow Liver Test Run, and Deep Liver Test Run.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
Main Fields
Parameters¶
Input¶
Relative Path
NovaScope supports using relative paths in the job configuration file, which should be relative to the working directory. If a relative path is found, NovaScope automatically obtains its real path and uses it in the process.
-
seq1st
:-
id
: Theid
will be used to organize the 1st-seq FASTQ files. Make sure theid
field for 1st-seq in the corresponding flow cell is unique. -
layout
&layout_shift
: These parameters identify the spatial barcode (sbcd) layout file to provide the layout of tiles in a chip.- To use a specific layout file, set
layout
with the path to the desired layout file. - To use pre-built layout files in NovaScope, leave
layout
empty and define thelayout_shift
. Thelayout_shift
defaults to"tobe"
. NovaScope will then automatically identify the layout file from info/assets/layout_per_tile_basis, usinglayout_shift
andchip
as reference. NovaScope supports two types of layout:"tobe"
and"tebo"
, which shift tiles in different ways (see sbcd_layout.md#purpose).
Spatial layout examination
If the
layout_shift
is unclear or if misaligned fiducial marks are observed in the "sbcd" image, perform a spatial layout examination by settingrequest
tosbcdlo-per-flowcell
.The spatial barcode (sbcd) layout format
*1 2 3
lane tile row col rowshift colshift 3 2556 1 1 0 0 3 2456 2 1 0 0.1715
lane
: Lane IDs; *tile
: Tile IDs; *row
&col
: The layout position; *rowshift
&colshift
: The gap information - To use a specific layout file, set
-
-
seq2nd
: This field requires all FASTQ pairs associated with the input section chip to be provided underseq2nd
.How to generate
seq2nd_pair_id
?If an ID is not specified, NovaScope will automatically generate one using the format
<flowcell_id>.<chip_id>.<randomer>
, whererandomer
is the last 5 digits of the md5 hash of the real path of the read 1 FASTQ file from the 2nd-seq. -
run_id
: Only needed if alignment is required to generate the requested output. It is used as an identifier for alignment and Spatial Digital Gene Expression matrices (SGEs) to differentiate between input 2nd-seq FASTQ files. This is particularly useful when generating SGEs using the same 1st-seq files but different 2nd-seq files. If not provided, NovaScope will generate it based on the flow cell ID, chip ID, and all input 2nd-seq read 1 FASTQ files.How to generate
run_id
?NovaScope automatically generates
run_id
in the format<flowcell_id>-<chip_id>-<species>-<randomer>
. Therandomer
is created by sorting all input seq2nd_pair_id, concatenating these seq2nd_pair_id into a single long string, and then computing the md5 hash of this string. The last 5 digits of this hash are used as therandomer
. -
unit_id
: Only needed if reformat feature is required to generate the requested output. It acts as an identifier for SGEs that are prepared for reformatting. This identifier is especially useful when users wish to manually modify SGE outside of NovaScope and then proceed to reformat both the original and modified SGEs. Theunit_id
ensures clear distinction between the original and modified datasets.How to generate
unit_id
If
unit_id
is not specified and reformatting is requested, it will default to<run_id>-default
, indicating that no manual preprocessing has occurred.Users who prefer to reformat manually modified SGEs should define their own
unit_id
. We recommend incorporatingrun_id
into theunit_id
to maintain a clear trace of the dataset lineage. -
histology
: NovaScope allows multiple input histology files for alignment. However, it is important to note that the magnification and type of each histology file serve as identifiers. Ensure that no two input histology files share the same magnification and type. Currently, historef supports the following types:"hne"
: Hematoxylin and Eosin (H&E) stained histology images;"dapi"
: DAPI or 4',6-diamidino-2-phenylindole stained histology images;"fl"
: Fluorescence stained histology images.
Output¶
The output directory will be used to organize the input files and store output files. Please see the structure directory here.
Request¶
The pipeline interprets the requested output files via the request
field and determines the execution flow. The request
field allows multiple desired output.
Info
The request
field should indicate the final output required, and all intermediary files contributing to the final output will be automatically generated (i.e., the dependencies between rules).
Main Request¶
Below are request options for NovaScope's main functionalities, alongside their final output and links to detailed output information.
Option | Final Output Files | Details |
---|---|---|
sbcd-per-flowcell |
Spatial barcode maps for a flowcell at per-tile basis, and a manifest file of summary statistics for each tile. | fastq2sbcd |
sbcd-per-chip |
A spatial barcode map for a chip, and an image of spatial barcode distribution. | sbcd2chip |
smatch-per-chip |
A TSV file of spatial barcodes matched to the 2nd-Seq reads, and an image of matched spatial barcode distribution. | smatch |
align-per-run |
A Binary Alignment Map file with summary metrics, and a digital gene expression matrix for genomic features. | align |
sge-per-run |
An SGE matrix with a coordinate metadata file, an image showing distributions of all, matched, and aligned spatial barcodes, and images of specific gene expressions. | dge2sdge and sdge_visual |
Plus Request¶
The options below are only for executing the additional functionalities. Please make sure you have installed the additional requirements properly.
Option | Final Output Files | Details |
---|---|---|
histology-per-run |
Geotiff files for coordinate transformation between SGE matrix and histology image. | historef |
sbcdlo-per-flowcell |
Two spatial barcode maps each aligns tile pairs using one spatial map layout. | sbcd_layout |
transcript-per-unit |
An SGE matrix in the TSV format that is compatible toFICTURE. | sdgeAR_reformat |
filterftr-per-unit |
A feature file for genes that pass gene-based filtering, formatted as a TSV file that contains detailed information about each gene. | sdgeAR_featurefilter |
filterpoly-per-unit |
An SGE matrix, a coordinate metadata file, a feature file, and a boundary JSON file, all reflecting the SGE matrix that passed the polygon-based density filtering. | sdgeAR_polygonfilter |
segment-10x-per-unit |
A hexagon-indexed SGE matrix in the 10x genomics format. | sdgeAR_segment_10x |
segment-ficture-per-unit |
A hexagon-indexed SGE matrix in the FICTURE-compatible TSV format. | sdgeAR_segment_ficture |