FASTQs

Overview

FASTQ is a common format for storing sequencing reads and the associated quality assessment. The Stereo-seq method is paired-end (PE) sequencing. Read 1 contains information of Coordinate ID (CID) and Molecular ID (MID) while read 2 contains captured RNA sequencing data. During multi-sample sequencing, an additional sequence (sample barcode) is added to identify samples. When sequencing data is inaccurate, a filtration process is conducted to remove low-quality MID sequences (those containing N bases in MID or having two or more bases with a quality value lower than 10) from read 1 and its paired read 2. The read ID of the filtered read 2 is then appended with CID and MID. Subsequently, the single strands in read 2, which contain RNA data, are written into the file in FASTQ format as the original sequencing data, with the exclusion of the sample barcode.

Quality records

Q40 FASTQ and Q4 FASTQ are two methods for base quality records, for the original sequencing data. Q40 adopts an evaluation system that describes the quality of sequenced bases with 41 quality values. Q4 refers to a similar evaluation system but with 4 quality values.

Storage types

Paired-FASTQ and grouped-FASTQ are two optional output formats for the original sequencing data.

You should notice that methods of quality records in FASTQs have nothing to do with storage types.

Paired FASTQs include a pair of read files, read 1 for CID, MID information and read 2 for captured RNA sequencing data respectively. An example of paired FASTQ:

# read 1
@E100026571L1C001R00300000000/1
TGTCCAACGGAGACGGCTCCGACAAGGCACTGGCA
+
>DG;<BGH=>*EFE8*G/3E@2:F0-GBGG188F<

# read 2
@E100026571L1C001R00300000000/2
GTCTCACCATACTTTTACAAAGTTATTTCAACCCAAATCACAATTTAAGAATTATTTGTTCTACCTATGCCACACTTTAAATAAATGTCTATTAAAACCA
+
-GFEECG?ECBFF<=@A@<E@><;FGCF=>=E53FEF5>FGF@,0ADE9CEAG2GBE@HF3EA<CE;G2F@=G8=?@G9FBGE.EG6G2;974E*D9DE9

Grouped-FASTQ is an output format with only one read file split from a dataset (containing 16 or 64 parts in a group). Read ID in the file starts with "@" and includes the read name and encoded CID and MID information. The sequence part contains captured RNA sequencing data. The file storage space is greatly reduced because of the combined output format and fewer quality values. An example of grouped FASTQ:

@FP300000513L1C002R00400000218 CE242DF29A57 97D26
GTGTAGTGAACCCCATGGTAGTTTTCTGATTGTTGTTAAAAAAAATGACTTAACATATTACATGGACACTCAATAAAAATGTTTTATTTCCTGTTGAAAA
+
FFFFFFFFFFFF8F8FFFFFFFFFFFFF8FFFFFFFFF8FF8FFF8FFFFFFF,FFFFFFFFFFF8FFFFFF8F8F,F8FFFFFF,FFFFFFFFFF,FFF

Expected name prefixes

The raw data generated from the sequencing platform is divided into two categories according to the storage types. Their file names have their own rules. --fastqs requires a or several folder path(s), and all FASTQs under folders will be input into the SAW count run. Please pay attention to your input directory.

If your FASTQs are stored in multiple file directories, --fastqs accepts values like:

--fastqs=/path/to/directory1,/path/to/directory2,...

Notice that all FASTQ files under these directories will be loaded for analysis.

For paired FASTQs, the file prefix indicates the sequencing slide, lane number and read index. A standard paired-FASTQ file follows the naming scheme: <slide>_<lane_number>_<read_index>.fq.gz. Your paired-FASTQ file hierarchy looks similar to this:

/saw/datasets/paired_fastqs
                ├── TestFlowcell01_L01_read_1.fq.gz
                ├── TestFlowcell01_L01_read_2.fq.gz
                ├── TestFlowcell01_L03_read_1.fq.gz
                ├── TestFlowcell01_L03_read_2.fq.gz
                └── ...

For grouped FASTQs, the file prefix indicates the sequencing slide, lane number, sample barcode and split index. A standard grouped-FASTQ file follows the naming scheme:<slide>_<lane_number>_<sample_barcode>_<split_index>.fq.gz.

Because of the split approach for data storage, grouped FASTQs need to be used in sets, which contain a set of 16 or 64 files.

Your grouped-FASTQ file hierarchy looks similar to this:

/saw/datasets/grouped_fastqs
               ├── TestFlowcell02_L01_25_1.fq.gz
               ├── TestFlowcell02_L01_25_2.fq.gz
               ├── TestFlowcell02_L01_25_3.fq.gz
               ├── TestFlowcell02_L01_25_4.fq.gz
               ├── TestFlowcell02_L01_25_5.fq.gz
               ├── TestFlowcell02_L01_25_6.fq.gz
               ├── TestFlowcell02_L01_25_7.fq.gz
               ├── TestFlowcell02_L01_25_8.fq.gz
               ├── TestFlowcell02_L01_25_9.fq.gz
               ├── TestFlowcell02_L01_25_10.fq.gz
               ├── TestFlowcell02_L01_25_11.fq.gz
               ├── TestFlowcell02_L01_25_12.fq.gz
               ├── TestFlowcell02_L01_25_13.fq.gz
               ├── TestFlowcell02_L01_25_14.fq.gz
               ├── TestFlowcell02_L01_25_15.fq.gz
               ├── TestFlowcell02_L01_25_16.fq.gz
               └── ...

FASTQs of paired ones and grouped ones cannot be mixed for use in SAW count run.

Last updated 9 months ago

FASTQs

Overview

Quality records

Storage types

Paired-FASTQ and grouped-FASTQ are two optional output formats for the original sequencing data.

You should notice that methods of quality records in FASTQs have nothing to do with storage types.

Paired FASTQs include a pair of read files, read 1 for CID, MID information and read 2 for captured RNA sequencing data respectively. An example of paired FASTQ:

# read 1
@E100026571L1C001R00300000000/1
TGTCCAACGGAGACGGCTCCGACAAGGCACTGGCA
+
>DG;<BGH=>*EFE8*G/3E@2:F0-GBGG188F<

# read 2
@E100026571L1C001R00300000000/2
GTCTCACCATACTTTTACAAAGTTATTTCAACCCAAATCACAATTTAAGAATTATTTGTTCTACCTATGCCACACTTTAAATAAATGTCTATTAAAACCA
+
-GFEECG?ECBFF<=@A@<E@><;FGCF=>=E53FEF5>FGF@,0ADE9CEAG2GBE@HF3EA<CE;G2F@=G8=?@G9FBGE.EG6G2;974E*D9DE9

@FP300000513L1C002R00400000218 CE242DF29A57 97D26
GTGTAGTGAACCCCATGGTAGTTTTCTGATTGTTGTTAAAAAAAATGACTTAACATATTACATGGACACTCAATAAAAATGTTTTATTTCCTGTTGAAAA
+
FFFFFFFFFFFF8F8FFFFFFFFFFFFF8FFFFFFFFF8FF8FFF8FFFFFFF,FFFFFFFFFFF8FFFFFF8F8F,F8FFFFFF,FFFFFFFFFF,FFF

Expected name prefixes

If your FASTQs are stored in multiple file directories, --fastqs accepts values like:

--fastqs=/path/to/directory1,/path/to/directory2,...

Notice that all FASTQ files under these directories will be loaded for analysis.

/saw/datasets/paired_fastqs
                ├── TestFlowcell01_L01_read_1.fq.gz
                ├── TestFlowcell01_L01_read_2.fq.gz
                ├── TestFlowcell01_L03_read_1.fq.gz
                ├── TestFlowcell01_L03_read_2.fq.gz
                └── ...

Because of the split approach for data storage, grouped FASTQs need to be used in sets, which contain a set of 16 or 64 files.

Your grouped-FASTQ file hierarchy looks similar to this:

/saw/datasets/grouped_fastqs
               ├── TestFlowcell02_L01_25_1.fq.gz
               ├── TestFlowcell02_L01_25_2.fq.gz
               ├── TestFlowcell02_L01_25_3.fq.gz
               ├── TestFlowcell02_L01_25_4.fq.gz
               ├── TestFlowcell02_L01_25_5.fq.gz
               ├── TestFlowcell02_L01_25_6.fq.gz
               ├── TestFlowcell02_L01_25_7.fq.gz
               ├── TestFlowcell02_L01_25_8.fq.gz
               ├── TestFlowcell02_L01_25_9.fq.gz
               ├── TestFlowcell02_L01_25_10.fq.gz
               ├── TestFlowcell02_L01_25_11.fq.gz
               ├── TestFlowcell02_L01_25_12.fq.gz
               ├── TestFlowcell02_L01_25_13.fq.gz
               ├── TestFlowcell02_L01_25_14.fq.gz
               ├── TestFlowcell02_L01_25_15.fq.gz
               ├── TestFlowcell02_L01_25_16.fq.gz
               └── ...

FASTQs of paired ones and grouped ones cannot be mixed for use in SAW count run.

Last updated 9 months ago