Expression matrix format
Gene Expression File (GEF)
Gene expression file (GEF) is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.
Bin GEF
The first level of GEF includes four group objects: "geneExp" (required), "wholeExp" (optional), "wholeExpExon" (optional), and "stat" (optional). Group "geneExp" contains groups of gene spatial expression data in one or multiple bin sizes. Group "wholeExp" contains datasets that record expression level and gene type count of each coordinate in one or multiple bin sizes. Group "wholeExpExon" contains datasets that record the exon level of each coordinate in one or multiple bin sizes. Group "stat" saves gene ID, gene names, total MID count and spatial pattern enrichment score of each gene. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get details.
Attributes
File Attributes
DataType
Example
Description
version
uint32
2
Gene expression file format version.
geftool_ver
uint32[3]
1,11,12
Geftool version. It can be used as an individual tool to manipulate GEF files.
omics
S32
b'Transcriptomics'
Omics name.
gef_area
float32
4.4410855E10
Tissue or labeled tissue area in square nanometers.
bin_type
S32
b'bin'
Bin type of the GEF file.
sn
S32
b'SS200000135TL_D1'
Stereo-seq chip SN
/geneExp/binN/expression:Dataset "expression" is a 1D array which stores coordinates and MID counts of each gene in the bin size of N, aggregated by gene name.
Dataset Attributes
DataType
Example (bin1)
Description
minX
int32
59820
Minimum x coordinate in bin N.
minY
int32
102086
Minimum y coordinate in bin N.
maxX
int32
73040
Maximum x coordinate in bin N.
maxY
int32
120539
Maximum y coordinate in bin N.
maxExp
uint32
28
Maximum MID count in a spot when the bin size is N. Data type for "maxExp" is dynamically changed for each sample.
resolution
uint32
500
Physical pitch (nm) between neighbor spots.
Dataset DataType:compound
DataType
Example (bin1)
Description
x
int32
71032
x coordinate in bin N.
y
int32
103180
y coordinate in bin N.
count
uint8/uint16/uint32
1
MID count at (x, y) when bin size is N. Data type for "count" is consistent with "maxExp" in the "Attributes."
[optional] /geneExp/binN/exon:Dataset "exon" is a 1D array which stores exon expression of each gene in the bin size of N, aggregated by gene name.
Dataset Attributes
DataType
Example (bin1)
Description
maxExon
int32
21
Max exon expression in binN.
Dataset DataType:1D array
DataType
Example (bin1)
Description
count
uint8/uint16/uint32
0
Exon expression in binN at coordinate (x,y), the index is same to the index in the "expression" dataset. Data type for "count" is dynamically changed for each sample.
/geneExp/binN/gene:Dataset "gene" is a 1D array which stores the gene names, the starting row indexes in dataset "expression", and row counts.
Dataset DataType:compound
DataType
Example (bin1)
Description
geneID
S64
b'ENSMUSG00000000001'
Gene ID.
geneName
S64
b'Gm16045'
Gene name.
offset
uint32
21
The starting row index in dataset "expression" for the gene.In this example, the gene expression data for gene "Gm16045" starts from row 21 in the dataset "expression."
count
uint32
2
Row count.In this example, expression data for gene "Gm16045" is recorded in row 21 and 22 (2 rows) in the dataset "expression."
[optional] /wholeExp/binN:Dataset "binN" is a 2D array (matrix) which stores the MID count and gene type count at each spot.
Dataset Attributes
DataType
Example (bin1)
Description
number
uint64
22879557
Number of non-zero spots in the dense matrix.
minX
int32
59820
Minimum x coordinate in bin N.
lenX
int32
13221
Length of x.
minY
int32
102086
Minimum y coordinate in bin N.
lenY
int32
18454
Length of y.
maxMID
uint32
2155
Maximum MID count in a spot.
maxGene
uint32
846
Maximum gene type count in a spot.
resolution
uint32
500
Pitch (nm) between neighbor spots.
Dataset DataType: 2D array (Xâ…¹Y), compound
DataType
Example (bin1)
Description
MIDcount
uint8/uint16/uint32
1
MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.
genecount
uint16
1
Gene count in the spot. The spot coordinate can be identified from "Attributes" and the indexes of the 2D array.
[optional] /wholeExpExon/binN:Dataset "binN" in "/wholeExpExon/" Group is a 2D array (matrix) which stores the exon expression count at each spot.
Dataset Attributes
DataType
Example (bin1)
Description
maxExon
uint32
21
Maximum exon expression count in a spot when the bin size is N.
Dataset DataType: 2D array
DataType
Example (bin1)
Description
MIDcount
uint8/uint16/uint32
0
MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.
[optional] /stat/gene:Dataset "gene" is a 1D array which stores the MID count and spatial pattern enrichment score (E10) of each gene. The array is order by the MID count in descending order.
Dataset Attributes
DataType
Example
Description
maxE10
float32
65.53
Maximum E10 score.
minE10
float32
0.
Minimum E10 score.
cutoff
float32
0.1
Threshold for filtering spots that will be used for computing E10.In this example, 0.1 means that the spots whose MID count is in the top 10% are used for calculating the spatial enrichment score.
Dataset DataType:compound
DataType
Example
Description
geneID
S64
b'ENSMUSG00000000001'
Gene ID.
geneName
S64
b'Ptgds'
Gene name.
MIDcount
uint32
229502
MID count for the gene.
E10
float32
65.53
The spatial pattern enrichment score (E10) for the gene.
Cell Bin GEF
The first layer of Cell Bin GEF contains one required group "cellBin" and multiple optional datasets. The second layer "codedCellBlock" is optional, which stores precomputed data used in the rendering of StereoMap. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get more details.
Attributes
File Attributes
DataType
Example
Description
geftool_ver
uint32[3]
0,7,11
geftool version. It can be used as an individual tool to manipulate GEF files.
offsetX
int32
0
Minimum x coordinate in bin 1.
offsetY
int32
0
Minimum y coordinate in bin 1.
omics
S32
b‘Transcriptomcis’
Omics name.
resolution
uint32
500
Pitch (nm) between neighbor spots.
version
uint32
2
Gene expression file format version.
bin_type
S32
CellBin
Bin type of the GEF file.
sn
S32
b'SS200000135TL_D1'
Stereo-seq chip SN
/cellBin/cell:Dataset "cell" is a 1D array which stores basic information and indices information of cells and expression.
Dataset Attributes
DataType
Example
Description
averageArea
float32
494.666
Average area for cells in pixel.
averageDnbCount
float32
194.299
Average number of mRNA-captured DNBs in a cell.
averageExpCount
float32
541.715
Average MID count in cell.
averageGeneCount
float32
310.157
Average gene count in cell.
maxArea
uint16
1925
Maximum area for cells in pixel.
maxDnbCount
uint16
883
Maximum number of mRNA-captured DNBs in a cell.
maxExpCount
uint16
3018
Maximum MID count in cell.
maxGeneCount
uint16
1415
Maximum gene count in cell.
maxX
int32
17658
Maximum x coordinate of the cell’s center of mass.
maxY
int32
19422
Maximum y coordinate of the cell’s center of mass.
medianArea
float32
474.
Median area for cells in pixel.
medianDnbCount
float32
183.
Median number of mRNA-captured DNBs in a cell.
medianExpCount
float32
491.
Median MID count in cell.
medianGeneCount
float32
289.
Median gene count in cell.
minArea
uint16
2
Minimum area for cells in pixel.
minDnbCount
uint16
0
Minimum number of mRNA-captured DNBs in a cell.
minExpCount
uint16
0
Minimum MID count in cell.
minGeneCount
uint16
0
Minimum gene count in cell.
minX
int32
2933
Minimum x coordinate of the cell’s center of mass.
minY
int32
5568
Minimum y coordinate of the cell’s center of mass.
Dataset DataType:compound
DataType
Example
Description
id
uint32
10
Cell ID index, the start ID is 0.In the Example, 10 represents the 10th cell in the dataset.
x
int32
541
The x coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 541.
y
int32
190
The y coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 190.
offset
uint32
494
The start row index of the cell in the "/cellBin/cellExp" dataset.The example represents that the gene ID index and total MID count information of the 10th cell in the "/cellBin/cellExp" dataset start from the 494th row.
geneCount
uint16
100
Gene count in the cell.In the example, 100 represents that the 100 rows in the "/cellBin/cellExp", start from the 494th to the 593th row, contains the gene ID indices and total MID count of the gene for the 10th cell in "/cellBin/cell" dataset.
expCount
uint16
500
Cell MID count.
dnbCount
uint16
200
mRNA-captured DNBs of the cell.
area
uint16
474
Cell area in pixel.
cellTypeID
uint32
0
Cell type ID.
clusterID
uint32
20
Cell cluster ID.
/cellBin/cellBorder:Dataset "cellBorder" is a 3D array which stores the lists of points for the bounding polygons of the cell.
Dataset Attributes
DataType
Example
Description
maxX
int32
16127
Maximum x coordinate of the bounding box of the cell.
maxY
int32
16663
Maximum y coordinate of the bounding box of the cell.
minX
int32
11129
Minimum x coordinate of the bounding box of the cell.
minY
int32
12784
Minimum y coordinate of the bounding box of the cell.
Dataset DataType:3D array
DataType
Example
Description
32*(int16,int16)
[[-17,-11],[-15,-5]…[32767,32767]]
A list of 32 coordinates recording the differences between cell bounding points and the cell’s center of mass (0,0). The real coordinate of cell’s center of mass (x, y) can be obtained from "cell" dataset using cellID.
/cellBin/cellExp:Dataset "cellExp" is a 1D array which stores the expression information of each cell.
Dataset Attributes
DataType
Example
Description
maxCount
uint16
336
Maximum MID count of a gene in a cell.
Dataset DataType:compound
DataType
Example
Description
geneID
uint32
1610
Gene IDs of the genes detected in the cell. ID is the index of "gene" dataset.In the example, 1610 represents the 1610th item in the "gene" dataset, and the name of the gene can be acquired in "gene" dataset.
count
uint16
3
MID count for the gene.In the example, (assume this is the 0th item in the "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExp" belongs to the cell whose cellID=0) the MID count for the gene (geneID=1610) in the cell (cellID=0) is 3.
[optional] /cellBin/cellExon:Dataset "cellExon" is a 1D array which stores the exon information for each cell.
Dataset Attributes
DataType
Example
Description
maxExon
uint16
5793
Maximum exon count of a gene in all cells.
minExon
uint16
0
Minimum exon count of a gene in all cells.
Dataset DataType:1D array
DataType
Example
Description
uint16
16
Exon count in a cell, the index of the array is same to the cellID in the "cell" dataset.
[optional] /cellBin/cellExpExon:Dataset "cellExpExon" is a 1D array which stores exon expression information for each cell.
Dataset Attributes
DataType
Example
Description
maxExon
uint16
336
Maximum exon count of a gene in a cell.
Dataset DataType:1D array
DataType
Example
Description
uint16
3
Exon count (MID) for the gene. The index is same to the "cellExp" dataset.In the example, (assume this is the 0th item in the "cellExpExon" dataset, since the index is same to "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExpExon" belongs to the cell whose cellID=0) the exon count (MID) for the gene (geneID=1610) in the cell (cellID=0) is 3.
/cellBin/cellTypeList:Dataset "cellTypeList" is a 1D array which stores cell types of each cell.
Dataset DataType:1D array
DataType
Example
Description
S32
b'default'
Cell type, "default" stands for undefined cell type.
/cellBin/gene:Dataset "gene" is a 1D array which stores the indices of cell and expression information of each gene.
Dataset Attributes
DataType
Example
Description
maxCellCount
uint32
5718
Maximum number of cells a gene can be detected.
maxExpCount
uint32
55361
Maximum MID count of a gene.
minCellCount
uint32
1
Minimum number of cells a gene can be detected.
minExpCount
uint32
1
Minimum MID count of a gene.
Dataset DataType:compound
DataType
Example
Description
geneID
S32
b'ENSMUSG00000000001'
Gene ID.
geneName
S32
b'AC149090.1'
Gene name.
offset
uint32
0
The start row index of the gene in "/cellBin/geneExp" dataset.In the example, 0 means that start from the 0th item in "/cellBin/geneExp" dataset records the cellIDs and total MID count information of "AC149090.1".
cellcount
uint32
60
Number of cells a gene can be detected.In the example, 60 represents that start from the 0th item to the 59th item records the information of gene "AC149090.1".
expCount
uint32
100
Sum of MID count for the gene.In the example, the total MID count of "AC149090.1" is 100.
maxMIDcount
uint16
4
Maximum MID count of a gene in a cell.In this case, the maximum MID count of gene "AC149090.1" in a cell is 4.
/cellBin/geneExp:Dataset "geneExp" is a 1D array which stores cell and expression information of each gene.
Dataset Attributes
DataType
Example
Description
maxCount
uint16
10
Maximum MID count of a gene.
Dataset DataType:compound
DataType
Example
Description
cellID
uint32
1247
cellID that contains the gene whose index is same to the index in "gene" dataset.In the example, (assume we use the 0th item in "geneExp" dataset) 1247 shows that the gene "AC149090.1" appears in the cell whose cellID is 1247.
count
uint16
3
The MID count of the gene, whose index is same to the index in "gene" dataset, in the cellID.In the example, the MID count of gene "AC149090.1" in the cell (cellID=1247) is 3.
[optional] /cellBin/geneExon:Dataset "geneExon" is a 1D array which stores the exon expression information of each gene.
Dataset Attributes
DataType
Example
Description
maxExon
uint32
55361
Maximum exon count of a gene.
minExon
uint32
0
Minimum exon count of a gene.
Dataset DataType:1D array
DataType
Example
Description
uint32
97
Total exon count of a gene, the index of "geneExon" dataset is same to the "gene" dataset.In the example, (assume this is the 0th item in the "geneExon" dataset, and gene "AC149090.1" is the 0th item in the "gene" dataset) the exon count of gene "AC149090.1" is 97.
[optional] /cellBin/geneExpExon:Dataset "geneExpExon" is a 1D array which stores the exon expression information in cells of each gene.
Dataset Attributes
DataType
Example
Description
maxExon
uint16
336
Maximum exon expression of a gene in a cell.
Dataset DataType:1D array
DataType
Example
Description
uint16
3
Exon count of a gene in a cell. The index of "geneExpExon" dataset is same to the "geneExp" dataset.In the example, (assume this is the 0th item in the "geneExpExon" dataset, since the index is same to "geneExp" dataset, from the "offset" and "cellCount" record in the "gene" dataset we can know that the 0th item in the "geneExpExon" dataset belongs to the gene "AC149090.1") 3 stands for the exon count of gene "AC149090.1" in cell 1247 is 3.
/cellBin/bockIndex:Dataset "bockIndex" is a 1D array which stores the matrix block partition information.
Dataset DataType:1D array
DataType
Example
Description
uint32
0
Cell count in each partition block.cnt=blockIndex[i+1]-blockIndex[i]
/cellBin/bockSize:Dataset "bockSize" is a 1D array which stores the block size of partition.
Dataset DataType:1D array
DataType
Example
Description
uint32
256, 256, 104, 104
4-element array. The 4 items represent the block length in x-axis, block length in y-axis, block count in x-axis, and block count in y-axis, respectively.
[optional] /codedCellBlock:Group "codedCellBlock" stores pre-computed data for rendering in StereoMap.
Group Attributes
DataType
Example
Description
info
string
{"@type": "neuroglancer_annotations_v1", ...}
Metadata of encoded precomputed data in JSON.
[optional] /codedCellBlock/L0/0_1:Dataset "0_1" is an example chunk encoded pre-computed data, including id, geometry, and so on.
Dataset DataType:Bytes
DataType
Example
Description
H5T_OPAQUE
1F 8B 08 00 ...
Bytecode of the chunk.
Gene Expression Matrix (GEM)
Gene expression matrix (GEM) stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.
#FileFormat
string
GEMv0.2
Gene expression matrix file format version.
#SortedBy
string
None
Gene expression matrix sorting strategy. Valid values: "geneID", "x", "y", "MIDCount", "None".
#BinType
string
Bin
Bin type of the GEM file.
#BinSize
string
1
(Please check 1.3 Terminologies and Concepts Bin)
#Omics
string
Transcriptomics
Omics name.
#Stereo-seqChip
string
SS200000135TL_D1
Stereo-seq Chip T serial number.
#OffsetX
uint32
1
X coordinate of the origin before calibration.
#OffsetY
uint32
1
Y coordinate of the origin before calibration.
geneID
string
ENSMUSG00000000001
Gene ID
geneName
string
Gnai3
Gene name.
x
uint32
16809
X coordinate of the spot.
y
uint32
8546
Y coordinate of the spot.
MIDCount
uint32
1
Number of MIDs at (x, y) for the gene in the corresponding row.
ExonCount
uint32
0
[Optional] Number of exon count at (x, y) for the gene in the corresponding row.
CellID
uint32
55892
[Optional] CellID for (x, y).
An example of bin GEM:
An example of cellbin GEM:
Last updated