Check annotation files

SAW checkGTF

Annotation files in the standard format can be accepted by SAW count. The verification will be performed automatically before read alignment in SAW count. In addition to the usual format check, the extraction of specific annotations is also implemented.

circle-info

SAW accepts the annotation files suffix withgtf/gtf.gz, gff/gff.gz, gff3/gff3.gz.

If the file has the following formatting issues, which are common errors in annotation files, SAW checkGTF will fulfill some, to ensure the file can be used properly.

*Note that the program assumes the input annotation file is correctly sorted by default and identifies gene annotation information in gene block format.

Issue
Solution

In the seventh column indicating the sense and antisense strands, "-" and "_" symbols are mistakenly mixed.

Check each row of the annotation file and correct the error symbol "_" to "-".

A gene block lacks gene ID.

Discard the entire gene information and issue a warning.

A gene row is missing but the transcript row has information.

Discard the entire gene information and issue a warning.

A gene block lacks gene name.

Use gene ID to fill in the missing one.

A gene row exists but partial information of child rows is missing.

Child rows can inherit information from their gene rows.

A transcript ID is missing.

Fill it with its parent gene ID suffixed with a sequential number (e.g., XXX.1).

A row contains multiple attributes with the same name.

Only save the last <attribute:value> of the duplicated entries, for subsequent annotation.

A simple check runs as:

saw checkGTF \
    --input-gtf=/path/to/input/GTF/or/GFF \
    --output-gtf=/path/to/output/GTF/or/GFF
circle-exclamation

If you want to extract specific annotations, like gene_biotype:protein_coding or gene_biotype:lincRNA, run as:

circle-exclamation

Last updated