Preparation of reference

SAW provides two complementary pipelines, SAW makeRef and SAW checkGTF, for genome reference and annotation files. Before running SAW count, one or more reference index files should be built in advance.

SAW makeRef

A complementary tool for genome reference builds index files needed by SAW count pipeline.

Because of the multiple uses of makeRef for three bioinformatical tools, --mode decides which one works.

Transcriptome

For STAR

Annotation (GTF/GFF) and genome (FASTA) files are needed to build index files for read alignment during SAW count run.

STAR

cd /saw/datasets/reference

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/FASTA1 \
    --gtf=/path/to/GTF/or/GFF \
    --genome=./transcriptome

Give a non-existent folder name to --genome parameter.

After running the command lines, a standard, SAW-compatible directory structure is automatically generated. The output folder, named according to --genome, includes all input FASTAs in ./transcriptome/fasta, an annotation file in ./transcriptome/genes, and STAR index files optimized by STOmics Tech in ./transcriptome/STAR.

/saw/datasets/reference/transcriptome
├── fasta
│    └── genome.fa
├── genes
│    └── genes.gtf
└── STAR
      ├── chrLength.txt
      ├── chrNameLength.txt
      ├── chrName.txt
      ├── chrStart.txt
      ├── exonGeTrInfo.tab
      ├── exonInfo.tab
      ├── FMindex
      ├── geneInfo.tab
      ├── Genome
      ├── genomeParameters.txt
      ├── SA
      ├── SAindex
      ├── SAindexAux
      ├── sjdbInfo.txt
      ├── sjdbList.fromGTF.out.tab
      ├── sjdbList.out.tab
      └── transcriptInfo.tab

With rRNA

If you plan to remove rRNA fragments during SAW analysis, use --rRNA-FASTA to mark the input rRNA information specifically, which will be added to --fasta.

Key steps of the processing:

Step 1: given the rRNA fragments of --rRNA-FASTA are short and highly repetitive so that the pipeline will remove their redundancy first.

Step 2: add rRNA information to --fasta file(s), with the suffix '_rRNA' on the chromosome, like '1_rRNA', to distinguish rRNA ones from the basic genome.

Step 3: build index files using the genome integrated with de-duplicated rRNA information.

STAR with rRNA

cd /saw/datasets/reference

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/FASTA \
    --gtf=/path/to/GTF/or/GFF \
    --rRNA-fasta=/path/to/rRNA/FASTA \
    --genome=./transcriptome_with_rRNA

Also, the output is similar to the last one.

/saw/datasets/reference/transcriptome_with_rRNA         
├── fasta
│     └── genome.fa
├── genes
│     └── genes.gtf
└── STAR
      ├── chrLength.txt
      ├── chrNameLength.txt
      ├── chrName.txt
      ├── chrStart.txt
      ├── exonGeTrInfo.tab
      ├── exonInfo.tab
      ├── FMindex
      ├── geneInfo.tab
      ├── Genome
      ├── genomeParameters.txt
      ├── SA
      ├── SAindex
      ├── SAindexAux
      ├── sjdbInfo.txt
      ├── sjdbList.fromGTF.out.tab
      ├── sjdbList.out.tab
      └── transcriptInfo.tab

Simple use

Because of the organized output directory, set the --reference for SAW count like this:

...
--reference=/saw/datasets/reference/transcriptome
or 
--reference=/saw/datasets/reference/transcriptome_with_rRNA

Microorganism

Microorganism analysis is now supported during SAW count of FFPE tissue samples! If you focus on the microbes of your FFPE analysis, --microorganism-detect and --ref-libraries should be used together when running SAW count.

But before starting the pipeline, related index files should be built respectively, STAR for host transcriptome alignment, Bowtie2 for de-host alignment, and Kraken2 for a taxonomic classification of microbes.

For Bowtie2

In SAW count, microorganism analysis requires removing the host information from the unmapped reads. Bowtie2 plays an important role in the removal.

Bowtie2

#Scenario 1
cd /saw/datasets/reference

saw makeRef \
    --mode=Bowtie2 \
    --fasta=/path/to/host/FASTA1,/path/to/host/FASTA2,... \
    --basename=mouse_genome_rRNA \
    --genome=./Bowtie2

After running the command lines, the output directory includes such files:

/saw/dataset/reference/Bowtie2
├── mouse_genome_rRNA.fa  ##host FASTA
├── mouse_genome_rRNA.1.bt2  ##Bowtie2 index files, suffixed with .bt2
├── mouse_genome_rRNA.2.bt2
├── mouse_genome_rRNA.3.bt2
├── mouse_genome_rRNA.4.bt2
├── mouse_genome_rRNA.rev.1.bt2
└── mouse_genome_rRNA.rev.2.bt2

SAW makeRef provides simple and essential parameters from the bowtie2-build indexer, for basic microorganism analysis in SAW count.

Two ways to realize the full functionality of Bowtie2.

Use the original Bowtie2 software.
--params-csv for complex parameters of the original Bowtie2.

Bowtie2

# Senario 2
cd /saw/datasets/reference

saw makeRef \
    --mode=Bowtie2 \
    --params-csv=/path/to/parameter/setting/Bowtie2_build.csv

More about parameter setting CSV.

Bowtie2_build.csv

Parameter,Value
,/path/to/host/FASTA1,/path/to/host/FASTA2,...
,<basename>

For Kraken2

Kraken2 is specifically designed for the taxonomic classification of metagenomic sequences. In SAW count, microorganism analysis uses Kraken2 to quickly and accurately identify the microorganisms present in environmental samples or from complex microbial communities. Download the databases from Kraken2 database website.

#Scenario 1 
cd /saw/datasets/reference

##Step 1 (optional) if needed,add FASTAs needed for a customed database
saw makeRef \
    --mode=Kraken2 \
    --fasta=/path/to/host/FASTA1,/path/to/host/FASTA2,... \
    --database=/path/to/Kraken2/database

##Step 2 build
saw makeRef \
    --mode=Kraken2 \
    --database=/path/to/Kraken2/database

There is no need to input --genomefor Kraken2 index files, modifications and additions happen under the database folder.

Before building a customed database (Step 2), you should install a ./taxonomy/ under the database folder, which can be obtained from NCBI/Taxonomy.

After running the command lines, the output directory includes such files:

/saw/datasets/reference/Kraken2_db1
├── hash.k2d  ##Contains the minimizer to taxon mappings
├── opts.k2d  ##Contains information about the options used to build the database
├── taxo.k2d  ##Contains taxonomy information used to build the database
├── inspect.txt
├── seqid2taxid.map
├── database100mers.kmer_distrib
├── database150mers.kmer_distrib
├── database200mers.kmer_distrib
├── database250mers.kmer_distrib
├── database300mers.kmer_distrib
├── database50mers.kmer_distrib
└── database75mers.kmer_distrib

SAW makeRef provides simple and essential parameters, for basic microorganism analysis in SAW count.

Two ways to realize the full functionality of Kraken2.

Use the original Kraken2.
--params-csvfor the complex parameters of the original Kraken2.

#Scenario 2 
cd /saw/datasets/reference

saw makeRef \
    --mode=Kraken2 \
    --params-csv=/path/to/parameter/setting/Kraken2_build.csv

More about parameter setting CSV.

Kraken2_build.csv

Parameter,Value
--add-to-library,/path/to/fasta
--db,/path/to/db

Reference libraries

After the construction of index files for STAR, Bowtie2 and Kraken2, a CSV of --ref-libraries can be built to combine all needed references for microorganism analysis.

Reference,Type
/saw/datasets/reference/transcriptome,STAR
/saw/datasets/reference/Bowtie2,Bowtie2
/saw/datasets/reference/Kraken2_db1,Kraken2

--ref-libraries is not compatible with --reference.

SAW checkGTF

Annotation files in the standard format can be accepted by SAW count. The verification will be performed automatically before read alignment in SAW count. In addition to the usual format check, the extraction of specific annotations is also implemented.

SAW accepts the annotation files suffix withgtf/gtf.gz, gff/gff.gz, gff3/gff3.gz.

If the file has the following formatting issues, which are common errors in annotation files, SAW checkGTF will fulfill the correction, to ensure the file can be used properly.

Issue

Solution

In the seventh column indicating the sense and antisense strands, "-" and "_" symbols are mistakenly mixed.

Check each row of the annotation file and correct the error symbol "_" to "-".

Any of "transcript_id", "transcription_name", "gene_id", "gene_name" is missed in GTF.

For each row, use the existing information of ID and name to fill in the missing items.

Part of gene or transcript rows are absent in GTF.

According to the attributes of exon rows, including gene, transcript, id and name, add the missing gene and transcript rows to the file.

Part of mRNA rows lack parent information in GFF.

Use the parent information of the previous neighboring record to fill in the missing one.

A simple check runs as:

saw checkGTF \
    --input-gtf=/path/to/input/GTF/or/GFF \
    --output-gtf=/path/to/output/GTF/or/GFF

If you want to extract specific annotations, like gene_biotype:protein_coding or gene_biotype:lincRNA, run as:

saw checkGTF \
    --input-gtf=/path/to/input/GTF/or/GFF \
    --attribute=key:value \
    --output-gtf=/path/to/output/GTF/or/GFF

If --attribute works, SAW checkGTF will extract specific annotation records but not perform a format check.

Last updated 6 months ago

Preparation of reference

SAW makeRef

A complementary tool for genome reference builds index files needed by SAW count pipeline.

Because of the multiple uses of makeRef for three bioinformatical tools, --mode decides which one works.

Transcriptome

For STAR

Annotation (GTF/GFF) and genome (FASTA) files are needed to build index files for read alignment during SAW count run.

STAR

cd /saw/datasets/reference

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/FASTA1 \
    --gtf=/path/to/GTF/or/GFF \
    --genome=./transcriptome

Give a non-existent folder name to --genome parameter.

/saw/datasets/reference/transcriptome
├── fasta
│    └── genome.fa
├── genes
│    └── genes.gtf
└── STAR
      ├── chrLength.txt
      ├── chrNameLength.txt
      ├── chrName.txt
      ├── chrStart.txt
      ├── exonGeTrInfo.tab
      ├── exonInfo.tab
      ├── FMindex
      ├── geneInfo.tab
      ├── Genome
      ├── genomeParameters.txt
      ├── SA
      ├── SAindex
      ├── SAindexAux
      ├── sjdbInfo.txt
      ├── sjdbList.fromGTF.out.tab
      ├── sjdbList.out.tab
      └── transcriptInfo.tab

With rRNA

If you plan to remove rRNA fragments during SAW analysis, use --rRNA-FASTA to mark the input rRNA information specifically, which will be added to --fasta.

Key steps of the processing:

Step 1: given the rRNA fragments of --rRNA-FASTA are short and highly repetitive so that the pipeline will remove their redundancy first.

Step 2: add rRNA information to --fasta file(s), with the suffix '_rRNA' on the chromosome, like '1_rRNA', to distinguish rRNA ones from the basic genome.

Step 3: build index files using the genome integrated with de-duplicated rRNA information.

STAR with rRNA

cd /saw/datasets/reference

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/FASTA \
    --gtf=/path/to/GTF/or/GFF \
    --rRNA-fasta=/path/to/rRNA/FASTA \
    --genome=./transcriptome_with_rRNA

Also, the output is similar to the last one.

/saw/datasets/reference/transcriptome_with_rRNA         
├── fasta
│     └── genome.fa
├── genes
│     └── genes.gtf
└── STAR
      ├── chrLength.txt
      ├── chrNameLength.txt
      ├── chrName.txt
      ├── chrStart.txt
      ├── exonGeTrInfo.tab
      ├── exonInfo.tab
      ├── FMindex
      ├── geneInfo.tab
      ├── Genome
      ├── genomeParameters.txt
      ├── SA
      ├── SAindex
      ├── SAindexAux
      ├── sjdbInfo.txt
      ├── sjdbList.fromGTF.out.tab
      ├── sjdbList.out.tab
      └── transcriptInfo.tab

Simple use

Because of the organized output directory, set the --reference for SAW count like this:

...
--reference=/saw/datasets/reference/transcriptome
or 
--reference=/saw/datasets/reference/transcriptome_with_rRNA

Microorganism

For Bowtie2

In SAW count, microorganism analysis requires removing the host information from the unmapped reads. Bowtie2 plays an important role in the removal.

Bowtie2

#Scenario 1
cd /saw/datasets/reference

saw makeRef \
    --mode=Bowtie2 \
    --fasta=/path/to/host/FASTA1,/path/to/host/FASTA2,... \
    --basename=mouse_genome_rRNA \
    --genome=./Bowtie2

After running the command lines, the output directory includes such files:

/saw/dataset/reference/Bowtie2
├── mouse_genome_rRNA.fa  ##host FASTA
├── mouse_genome_rRNA.1.bt2  ##Bowtie2 index files, suffixed with .bt2
├── mouse_genome_rRNA.2.bt2
├── mouse_genome_rRNA.3.bt2
├── mouse_genome_rRNA.4.bt2
├── mouse_genome_rRNA.rev.1.bt2
└── mouse_genome_rRNA.rev.2.bt2

SAW makeRef provides simple and essential parameters from the bowtie2-build indexer, for basic microorganism analysis in SAW count.

Two ways to realize the full functionality of Bowtie2.

Use the original Bowtie2 software.
--params-csv for complex parameters of the original Bowtie2.

Bowtie2

# Senario 2
cd /saw/datasets/reference

saw makeRef \
    --mode=Bowtie2 \
    --params-csv=/path/to/parameter/setting/Bowtie2_build.csv

More about parameter setting CSV.

Bowtie2_build.csv

Parameter,Value
,/path/to/host/FASTA1,/path/to/host/FASTA2,...
,<basename>

78B

Bowtie2_build.csv

For Kraken2

#Scenario 1 
cd /saw/datasets/reference

##Step 1 (optional) if needed,add FASTAs needed for a customed database
saw makeRef \
    --mode=Kraken2 \
    --fasta=/path/to/host/FASTA1,/path/to/host/FASTA2,... \
    --database=/path/to/Kraken2/database

##Step 2 build
saw makeRef \
    --mode=Kraken2 \
    --database=/path/to/Kraken2/database

There is no need to input --genomefor Kraken2 index files, modifications and additions happen under the database folder.

Before building a customed database (Step 2), you should install a ./taxonomy/ under the database folder, which can be obtained from NCBI/Taxonomy.

After running the command lines, the output directory includes such files:

/saw/datasets/reference/Kraken2_db1
├── hash.k2d  ##Contains the minimizer to taxon mappings
├── opts.k2d  ##Contains information about the options used to build the database
├── taxo.k2d  ##Contains taxonomy information used to build the database
├── inspect.txt
├── seqid2taxid.map
├── database100mers.kmer_distrib
├── database150mers.kmer_distrib
├── database200mers.kmer_distrib
├── database250mers.kmer_distrib
├── database300mers.kmer_distrib
├── database50mers.kmer_distrib
└── database75mers.kmer_distrib

SAW makeRef provides simple and essential parameters, for basic microorganism analysis in SAW count.

Two ways to realize the full functionality of Kraken2.

Use the original Kraken2.
--params-csvfor the complex parameters of the original Kraken2.

#Scenario 2 
cd /saw/datasets/reference

saw makeRef \
    --mode=Kraken2 \
    --params-csv=/path/to/parameter/setting/Kraken2_build.csv

More about parameter setting CSV.

Kraken2_build.csv

Parameter,Value
--add-to-library,/path/to/fasta
--db,/path/to/db

68B

Kraken2_build.csv

Reference libraries

After the construction of index files for STAR, Bowtie2 and Kraken2, a CSV of --ref-libraries can be built to combine all needed references for microorganism analysis.

Reference,Type
/saw/datasets/reference/transcriptome,STAR
/saw/datasets/reference/Bowtie2,Bowtie2
/saw/datasets/reference/Kraken2_db1,Kraken2

--ref-libraries is not compatible with --reference.

SAW checkGTF

SAW accepts the annotation files suffix withgtf/gtf.gz, gff/gff.gz, gff3/gff3.gz.

If the file has the following formatting issues, which are common errors in annotation files, SAW checkGTF will fulfill the correction, to ensure the file can be used properly.

Issue

Solution

In the seventh column indicating the sense and antisense strands, "-" and "_" symbols are mistakenly mixed.

Check each row of the annotation file and correct the error symbol "_" to "-".

Any of "transcript_id", "transcription_name", "gene_id", "gene_name" is missed in GTF.

For each row, use the existing information of ID and name to fill in the missing items.

Part of gene or transcript rows are absent in GTF.

According to the attributes of exon rows, including gene, transcript, id and name, add the missing gene and transcript rows to the file.

Part of mRNA rows lack parent information in GFF.

Use the parent information of the previous neighboring record to fill in the missing one.

A simple check runs as:

saw checkGTF \
    --input-gtf=/path/to/input/GTF/or/GFF \
    --output-gtf=/path/to/output/GTF/or/GFF

If you want to extract specific annotations, like gene_biotype:protein_coding or gene_biotype:lincRNA, run as:

saw checkGTF \
    --input-gtf=/path/to/input/GTF/or/GFF \
    --attribute=key:value \
    --output-gtf=/path/to/output/GTF/or/GFF

If --attribute works, SAW checkGTF will extract specific annotation records but not perform a format check.

Last updated 6 months ago