Matrices
Gene expression file (GEF) is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.
Gene expression matrix (GEM) stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.
File types
The feature expression matrices generated from SAW pipelines mainly include two types, bin and cellbin GEF. They can be identified by the file extension:
.gef
The feature expression matrix file in HDF5 format for visualization. It contains the MID count for each gene of each spot. A spot is a binning unit that has a fixed-sized square shape in which the expression value in this square is accumulated. By default, a visualization .gef
includes spot sizes of bin 1, 5, 10, 20, 50, 100, 150, 200.
.cellbin.gef
Only available when the cell segmentation was done based on an microscopy image.
Common matrices
Common output files of SAW count
and SAW realign
are listed:
<SN>.raw.gef
Feature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.gef
Feature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.tissue.gef
Feature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.cellbin.gef
Cellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area.
<SN>.adjusted.cellbin.gef
Cellbin expression matrix with cell border expanding, based on <<SN>_<stainType>_mask_edm_dis_<distance>.tif
.
Microorganism
If you perform SAW count
on the Stereo-seq N FFPE and set --microorganism-detect
to the analysis, its spatial expression matrices will be saved in /outs/feature_expression/microorganism
.
Output files are listed as:
<SN>.microorganism.raw.gef
Feature expression matrix of microorganisms includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.microorganism.gef
Feature expression matrix of microorganisms. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.host_microorganism.raw.gef
Feature expression matrix of microorganisms and the host includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.host_microorganism.gef
Feature expression matrix of microorganisms and the host. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.microorganism.<classification>.gem
Feature expression matrix of a specific classification of microbes.
Classifications of microorganisms include phylum, class, order, family, genus, and species.
Last updated