Matrices
Gene expression file (GEF) is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.
Gene expression matrix (GEM) stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.
File types
The feature expression matrices generated from SAW pipelines mainly include two types, bin and cellbin GEF. They can be identified by the file extension:
.gef
The feature expression matrix file in HDF5 format for visualization. It contains the MID count for each gene of each spot. A spot is a binning unit that has a fixed-sized square shape in which the expression value in this square is accumulated. By default, a visualization .gef includes spot sizes of bin 1, 5, 10, 20, 50, 100, 150, 200.

.cellbin.gef
The cellbin feature expression matrix file in HDF5 format. It contains the spatial location and area of each cell, the MID count for each gene of each cell, and the cluster the cell belongs to. In .cellbin.gef, the cell is the smallest data unit.

Only available when the cell segmentation was done based on an microscopy image.
Transcriptome
Common output files of SAW count and SAW realign are listed:
<SN>.raw.gef
Feature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.gef
Feature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.tissue.gef
Feature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.cellbin.gef
Cellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area.
<SN>.adjusted.cellbin.gef
Cellbin expression matrix with cell border expanding, based on <<SN>_<stainType>_mask_edm_dis_<distance>.tif.
Tissue statistics
<SN>.tissue.gef is usually generated from a <SN>.raw.gef and a tissue segmentation image.
tissuecut.stat file can be found under /STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX and records statistics under the detected tissue area:
Tissue area in square nanometers
The physical tissue area of the sample slice, in square nanometers.
Contour area in pixel
The area of the tissue region on the tissue segmentation image, in pixels.
Number of DNB under tissue
The number of detected DNBs with RNA capture under the tissue region.
% of DNB under tissue
The proportion of detected DNBs with RNA capture under the tissue region relative to the total counts across the entire chip.
Total gene type under tissue
The total number of annotated gene types under the tissue region.
MID count under tissue
MID counts under the tissue region.
% of MID under tissue
The proportion of MID counts under the tissue region relative to the total counts across the entire chip.
Number of reads under tissue
The number of sequencing reads under the tissue region.
% of reads under tissue
The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip.
Mean reads per spot (binN)
Mean reads of each binN spot under the tissue region.
Median reads per spot (binN)
Median reads of each binN spot under the tissue region.
Mean gene type per spot (binN)
Mean gene type of each binN spot under the tissue region.
Median gene type per spot (binN)
Median gene type of each binN spot under the tissue region.
Mean MID per spot (binN)
Mean MID count of each binN spot under the tissue region.
Median MID per spot (binN)
Median MID count of each binN spot under the tissue region.
Microorganism
If you perform SAW count on the Stereo-seq N FFPE and set --microorganism-detect to the analysis, its spatial expression matrices will be saved in /outs/feature_expression/microorganism.
Output files are listed as:
<SN>.microorganism.raw.gef
Feature expression matrix of microorganisms includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.microorganism.gef
Feature expression matrix of microorganisms. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.host_microorganism.raw.gef
Feature expression matrix of microorganisms and the host includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.host_microorganism.gef
Feature expression matrix of microorganisms and the host. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.microorganism.<classification>.gem
Feature expression matrix of a specific classification of microbes.
Classifications of microorganisms include phylum, class, order, family, genus, and species.
Microbe classification information
After microbe classification via Kraken2, two files appear under /STEREO_ANALYSIS_WORKFLOW/MICROOGANISM/ANALYSIS are seq_complete_info.txt and seq_complete_info_dedup.txt. The difference between the two files is that the latter has undergone deduplication processing for microbe alignment. Each row represents a record of a read's alignment result, primarily including the read ID, spatial coordinate, MID count, taxonomic ID, scientific name, detailed biological classification, and read count.
Proteome
If you perform SAW count on the Stereo-CITE T FF analysis, its spatial protein expression matrices will be saved in /outs/feature_expression.
Output files are listed as:
<SN>.protein.raw.gef
Feature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts.
<SN>.protein.gef
Feature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.protein.tissue.gef
Feature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.
<SN>.protein.cellbin.gef
Cellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area.
<SN>.protein.adjusted.cellbin.gef
Cellbin expression matrix with cell border expanding, based on <SN>_<stainType>_mask_edm_dis_<distance>.tif.
<SN>.protein.tissue.rmbg.gem.gz
Feature expression matrix from automatic protein background removal. It shows bin1 expression counts.
Tissue statistics
<SN>.protein.tissue.gef is usually generated from a <SN>.protein.raw.gef and a tissue segmentation image.
protein.tissuecut.stat file can be found under /STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX and records statistics under the detected tissue area:
Tissue area in square nanometers
The physical tissue area of the sample slice, in square nanometers.
Contour area in pixel
The area of the tissue region on the tissue segmentation image, in pixels.
Number of DNB under tissue
The number of detected DNBs with ADT capture under the tissue region.
% of DNB under tissue
The proportion of detected DNBs with ADT capture under the tissue region relative to the total counts across the entire chip.
Total protein type under tissue
The total number of annotated protein types under the tissue region.
MID count under tissue
MID counts under the tissue region.
% of MID under tissue
The proportion of MID counts under the tissue region relative to the total counts across the entire chip.
Number of reads under tissue
The number of sequencing reads under the tissue region.
% of reads under tissue
The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip.
Mean reads per spot (binN)
Mean reads of each binN spot under the tissue region.
Median reads per spot (binN)
Median reads of each binN spot under the tissue region.
Mean protein type per spot (binN)
Mean protein type of each binN spot under the tissue region.
Median protein type per spot (binN)
Median protein type of each binN spot under the tissue region.
Mean MID per spot (binN)
Mean MID count of each binN spot under the tissue region.
Median MID per spot (binN)
Median MID count of each binN spot under the tissue region.
aad
Last updated