Build index files

SAW makeRef

A complementary tool for genome reference builds index files needed by SAW count pipeline.

Because of the multiple uses of makeRef for three bioinformatical tools, --mode decides which one works.

STOmics R&D has pre-built reference genome index files that can be used directly. Download the references you need from Download Center.

Transcriptome

For STAR

Genome (FASTA) and annotation (GTF/GFF) files are needed to build index files for read alignment during SAW count run. Development teams from the STOmics and Intel made great efforts to reconstruct the structure of the index files, in order to enhance the efficiency of read searches and alignments.

Build transcriptome reference indexes for STAR alignment.

STAR

cd /saw/datasets/reference

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/FASTA \
    --gtf=/path/to/GTF/or/GFF \
    --genome=./transcriptome

After running the command lines, a standard, SAW-compatible directory structure is automatically generated. The output folder, named according to --genome, includes

all input FASTAs in ./transcriptome/fasta,
a checked annotation file in ./transcriptome/genes,
STAR index files optimized by STOmics Tech in ./transcriptome/STAR.

From SAW 8.2, checkGTF will be automatically performed when using makeRef to build STAR index files. To ensure that the output GTF/GFF under ./transcriptome/genes has been format-checked and meets the requirements for read annotation of SAW count. If there is an issue with the input GTF/GFF file, please modify it according to the processing log of checkGTF.

/saw/datasets/reference/transcriptome
├── fasta
│     └── genome.fa
├── genes
│     ├── checkGTF_YYYYMMDD_HHMMSS.log
│     └── genes.gtf
└── STAR
      ├── chrLength.txt
      ├── chrNameLength.txt
      ├── chrName.txt
      ├── chrStart.txt
      ├── exonGeTrInfo.tab
      ├── exonInfo.tab
      ├── FMindex
      ├── geneInfo.tab
      ├── Genome
      ├── genomeParameters.txt
      ├── SA
      ├── SAindex
      ├── SAindexAux
      ├── sjdbInfo.txt
      ├── sjdbList.fromGTF.out.tab
      ├── sjdbList.out.tab
      └── transcriptInfo.tab

What warrants special attention is that several parameters have been enhanced with richer function by the STOmics R&D team for usability and analysis.

--fastaaccepts one or more FASTA genome files and merge all input files.
--rRNA-fasta accepts rRNA information and add it to the basic genome of --fasta.
--gtf accepts GTF/GFF annotation file and call checkGTF module to check the file format.
--genome is required to construct a SAW-compatible reference folder. Please give a non-existent folder name to the parameter.

Please note that these four parameters should only be input in SAW format, not through --params-config.

With rRNA

If you plan to remove rRNA fragments during SAW analysis, use --rRNA-FASTA to mark the input rRNA information specifically, which will be added to --fasta after redundancy removal.

Key steps of the processing:

Step 1: given the rRNA fragments of --rRNA-fasta are short and highly repetitive so that the pipeline will remove their redundancy first.

Step 2: add rRNA information to --fasta file(s), with the suffix '_rRNA' on the chromosome, like '1_rRNA', to distinguish rRNA ones from the basic genome.

Step 3: build index files using the genome integrated with de-duplicated rRNA information.

STAR with rRNA

cd /saw/datasets/reference

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/FASTA \
    --rRNA-fasta=/path/to/rRNA/FASTA \
    --gtf=/path/to/GTF/or/GFF \
    --genome=./transcriptome_with_rRNA

Also, the output is similar to the last one.

/saw/datasets/reference/transcriptome_with_rRNA         
├── fasta
│     └── genome.fa
├── genes
│     ├── checkGTF_YYYYMMDD_HHMMSS.log
│     └── genes.gtf
└── STAR
      ├── chrLength.txt
      ├── chrNameLength.txt
      ├── chrName.txt
      ├── chrStart.txt
      ├── exonGeTrInfo.tab
      ├── exonInfo.tab
      ├── FMindex
      ├── geneInfo.tab
      ├── Genome
      ├── genomeParameters.txt
      ├── SA
      ├── SAindex
      ├── SAindexAux
      ├── sjdbInfo.txt
      ├── sjdbList.fromGTF.out.tab
      ├── sjdbList.out.tab
      └── transcriptInfo.tab

Special settings

If you are working with specific genome datasets, such as exceptionally large genomes, the default settings may lead to task failures. Or the default parameter settings of makeRef may be insufficient for further analysis when processing small genomic fragments or long intronic regions.

--params-config will help a lot for more detailed parameter adjustments. Simply enter the original arguments from STAR as a plain string.

For instance, when it comes to a genome containing an excessive number of chromosomes/scaffolds (e.g., exceeding 5,000), computing memory may be insufficient during the construction of the reference genome. To reduce RAM consumption, you can set --genomeChrBinNbits= min(18,log2[max(GenomeLength/number of references, ReadLength)]).

If the reference genome has a size of 14 GB and contains 90,000 chromosomes or scaffolds, calculate the value of --genomeChrBinNbitsusing the formula mentioned above. In this scenario, it is advisable to set the parameter value to 17.

For more information on index building parameters for specific genomes, please refer to the STAR User Manual.

with --params-config

saw/datasets/reference

saw makeRef \
    --mode=STAR \
    --fasta=/path/to/FASTA \
    --gtf=/path/to/GTF/or/GFF \
    --genome=./transcriptome \
    --params-config='--genomeChrBinNbits=17 --runThreadN=24'

Simple use

Because of the organized output directory, set the --reference for SAW count like this:

saw count \
...
--reference=/saw/datasets/reference/transcriptome

or

saw count \
...
--reference=/saw/datasets/reference/transcriptome_with_rRNA

Microorganism

Microorganism analysis is now supported during SAW count of FFPE tissue samples! If you focus on the microbes of your FFPE analysis, --microorganism-detect and --ref-libraries should be used together when running SAW count.

Before starting the pipeline, related index files should be built respectively, STAR for host transcriptome alignment, Bowtie2 for de-host alignment, and Kraken2 for a taxonomic classification of microbes.

For Bowtie2

In SAW count, microorganism analysis requires removing the host information from the unmapped reads. Bowtie2 plays an important role in the removal.

Bowtie2

#Scenario 1
cd /saw/datasets/reference

saw makeRef \
    --mode=Bowtie2 \
    --fasta=/path/to/host/FASTA1 \
    --basename=mouse_genome_rRNA \
    --genome=./Bowtie2

After running the command lines, the output directory includes such files:

/saw/dataset/reference/Bowtie2
├── mouse_genome_rRNA.fa  ##host FASTA
├── mouse_genome_rRNA.1.bt2  ##Bowtie2 index files, suffixed with .bt2
├── mouse_genome_rRNA.2.bt2
├── mouse_genome_rRNA.3.bt2
├── mouse_genome_rRNA.4.bt2
├── mouse_genome_rRNA.rev.1.bt2
└── mouse_genome_rRNA.rev.2.bt2

SAW makeRef provides simple and essential parameters from the bowtie2-build indexer, for basic microorganism analysis in SAW count.

Three ways to realize the full functionality of Bowtie2.

Use the original Bowtie2 software.
--params-config for complex parameters of the original software. Simply enter the original arguments from Bowtie2 as a plain string.
--params-csv for complex parameters of the original software.

Bowtie2

# Senario 2
cd /saw/datasets/reference

saw makeRef \
    --mode=Bowtie2 \
    --params-config="/path/to/host/FASTA <base_name>"

Bowtie2

# Senario 3
cd /saw/datasets/reference

saw makeRef \
    --mode=Bowtie2 \
    --params-csv=/path/to/parameter/setting/Bowtie2_build.csv

More about parameter setting CSV.

Bowtie2_build.csv

Parameter,Value
,/path/to/host/FASTA
,<basename>

For Kraken2

Kraken2 is specifically designed for the taxonomic classification of metagenomic sequences. In SAW count, microorganism analysis uses Kraken2 to quickly and accurately identify the microorganisms present in environmental samples or from complex microbial communities. Download the databases from Kraken2 database website.

#Scenario 1 
cd /saw/datasets/reference

##Step 1 (optional) if needed,add FASTAs needed for a customed database
saw makeRef \
    --mode=Kraken2 \
    --fasta=/path/to/host/FASTA \
    --database=/path/to/Kraken2/database

##Step 2 build
saw makeRef \
    --mode=Kraken2 \
    --database=/path/to/Kraken2/database

There is no need to input --genomefor Kraken2 index files, modifications and additions happen under the database folder.

Before building a customed database (Step 2), you should install a ./taxonomy/ under the database folder, which can be obtained from NCBI/Taxonomy.

After running the command lines, the output directory includes such files:

/saw/datasets/reference/Kraken2_db1
├── hash.k2d  ##Contains the minimizer to taxon mappings
├── opts.k2d  ##Contains information about the options used to build the database
├── taxo.k2d  ##Contains taxonomy information used to build the database
├── inspect.txt
├── seqid2taxid.map
├── database100mers.kmer_distrib
├── database150mers.kmer_distrib
├── database200mers.kmer_distrib
├── database250mers.kmer_distrib
├── database300mers.kmer_distrib
├── database50mers.kmer_distrib
└── database75mers.kmer_distrib

SAW makeRef provides simple and essential parameters, for basic microorganism analysis in SAW count.

Two ways to realize the full functionality of Kraken2.

Use the original Kraken2.
--params-config for complex parameters of the original software. Simply enter the original arguments from Kraken2 as a plain string.
--params-csvfor the complex parameters of the original Kraken2.

#Scenario 2 
cd /saw/datasets/reference

saw makeRef \
    --mode=Kraken2 \
    --params-csv=/path/to/parameter/setting/Kraken2_build.csv

More about parameter setting CSV.

Kraken2_build.csv

Parameter,Value
--add-to-library,/path/to/FASTA
--db,/path/to/db

Reference libraries

After the construction of index files for STAR, Bowtie2 and Kraken2, a CSV of --ref-libraries can be built to combine all needed references for microorganism analysis.

Reference,Type
/saw/datasets/reference/transcriptome,STAR
/saw/datasets/reference/Bowtie2,Bowtie2
/saw/datasets/reference/Kraken2_db1,Kraken2

--ref-libraries is not compatible with --reference.

PreviousPreparation of reference NextObtain rRNA information

Last updated 2 months ago

hashtagSAW makeRef

hashtagTranscriptome

hashtagFor STAR

hashtagWith rRNA

hashtagSpecial settings

hashtagSimple use

hashtagMicroorganism

hashtagFor Bowtie2

hashtagFor Kraken2

hashtagReference libraries

SAW makeRef

Transcriptome

For STAR

With rRNA

Special settings

Simple use

Microorganism

For Bowtie2

For Kraken2

Reference libraries