Expression matrix format

Gene Expression File (GEF)

Gene expression file (GEF) is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.

Bin GEF

The first level of GEF includes four group objects: "geneExp" (required), "wholeExp" (optional), "wholeExpExon" (optional), and "stat" (optional). Group "geneExp" contains groups of gene spatial expression data in one or multiple bin sizes. Group "wholeExp" contains datasets that record expression level and gene type count of each coordinate in one or multiple bin sizes. Group "wholeExpExon" contains datasets that record the exon level of each coordinate in one or multiple bin sizes. Group "stat" saves gene ID, gene names, total MID count and spatial pattern enrichment score of each gene. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get details.

Item
DataType
Example
Description

Attributes

File Attributes

DataType

Example

Description

version

uint32

2

Gene expression file format version.

geftool_ver

uint32[3]

1,11,12

Geftool version. It can be used as an individual tool to manipulate GEF files.

omics

S32

b'Transcriptomics'

Omics name.

gef_area

float32

4.4410855E10

Tissue or labeled tissue area in square nanometers.

bin_type

S32

b'bin'

Bin type of the GEF file.

sn

S32

b'SS200000135TL_D1'

Stereo-seq chip SN

/geneExp/binN/expression:Dataset "expression" is a 1D array which stores coordinates and MID counts of each gene in the bin size of N, aggregated by gene name.

Dataset Attributes

DataType

Example (bin1)

Description

minX

int32

59820

Minimum x coordinate in bin N.

minY

int32

102086

Minimum y coordinate in bin N.

maxX

int32

73040

Maximum x coordinate in bin N.

maxY

int32

120539

Maximum y coordinate in bin N.

maxExp

uint32

28

Maximum MID count in a spot when the bin size is N. Data type for "maxExp" is dynamically changed for each sample.

resolution

uint32

500

Physical pitch (nm) between neighbor spots.

Dataset DataType:compound

DataType

Example (bin1)

Description

x

int32

71032

x coordinate in bin N.

y

int32

103180

y coordinate in bin N.

count

uint8/uint16/uint32

1

MID count at (x, y) when bin size is N. Data type for "count" is consistent with "maxExp" in the "Attributes."

[optional] /geneExp/binN/exon:Dataset "exon" is a 1D array which stores exon expression of each gene in the bin size of N, aggregated by gene name.

Dataset Attributes

DataType

Example (bin1)

Description

maxExon

int32

21

Max exon expression in binN.

Dataset DataType:1D array

DataType

Example (bin1)

Description

count

uint8/uint16/uint32

0

Exon expression in binN at coordinate (x,y), the index is same to the index in the "expression" dataset. Data type for "count" is dynamically changed for each sample.

/geneExp/binN/gene:Dataset "gene" is a 1D array which stores the gene names, the starting row indexes in dataset "expression", and row counts.

Dataset DataType:compound

DataType

Example (bin1)

Description

geneID

S64

b'ENSMUSG00000000001'

Gene ID.

geneName

S64

b'Gm16045'

Gene name.

offset

uint32

21

The starting row index in dataset "expression" for the gene.In this example, the gene expression data for gene "Gm16045" starts from row 21 in the dataset "expression."

count

uint32

2

Row count.In this example, expression data for gene "Gm16045" is recorded in row 21 and 22 (2 rows) in the dataset "expression."

[optional] /wholeExp/binN:Dataset "binN" is a 2D array (matrix) which stores the MID count and gene type count at each spot.

Dataset Attributes

DataType

Example (bin1)

Description

number

uint64

22879557

Number of non-zero spots in the dense matrix.

minX

int32

59820

Minimum x coordinate in bin N.

lenX

int32

13221

Length of x.

minY

int32

102086

Minimum y coordinate in bin N.

lenY

int32

18454

Length of y.

maxMID

uint32

2155

Maximum MID count in a spot.

maxGene

uint32

846

Maximum gene type count in a spot.

resolution

uint32

500

Pitch (nm) between neighbor spots.

Dataset DataType: 2D array (XⅹY), compound

DataType

Example (bin1)

Description

MIDcount

uint8/uint16/uint32

1

MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.

genecount

uint16

1

Gene count in the spot. The spot coordinate can be identified from "Attributes" and the indexes of the 2D array.

[optional] /wholeExpExon/binN:Dataset "binN" in "/wholeExpExon/" Group is a 2D array (matrix) which stores the exon expression count at each spot.

Dataset Attributes

DataType

Example (bin1)

Description

maxExon

uint32

21

Maximum exon expression count in a spot when the bin size is N.

Dataset DataType: 2D array

DataType

Example (bin1)

Description

MIDcount

uint8/uint16/uint32

0

MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.

[optional] /stat/gene:Dataset "gene" is a 1D array which stores the MID count and spatial pattern enrichment score (E10) of each gene. The array is order by the MID count in descending order.

Dataset Attributes

DataType

Example

Description

maxE10

float32

65.53

Maximum E10 score.

minE10

float32

0.

Minimum E10 score.

cutoff

float32

0.1

Threshold for filtering spots that will be used for computing E10.In this example, 0.1 means that the spots whose MID count is in the top 10% are used for calculating the spatial enrichment score.

Dataset DataType:compound

DataType

Example

Description

geneID

S64

b'ENSMUSG00000000001'

Gene ID.

geneName

S64

b'Ptgds'

Gene name.

MIDcount

uint32

229502

MID count for the gene.

E10

float32

65.53

The spatial pattern enrichment score (E10) for the gene.

Cell Bin GEF

The first layer of Cell Bin GEF contains one required group "cellBin" and multiple optional datasets. The second layer "codedCellBlock" is optional, which stores precomputed data used in the rendering of StereoMap. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get more details.

Item
Data type
Example
Description

Attributes

File Attributes

DataType

Example

Description

geftool_ver

uint32[3]

0,7,11

geftool version. It can be used as an individual tool to manipulate GEF files.

offsetX

int32

0

Minimum x coordinate in bin 1.

offsetY

int32

0

Minimum y coordinate in bin 1.

omics

S32

b‘Transcriptomcis’

Omics name.

resolution

uint32

500

Pitch (nm) between neighbor spots.

version

uint32

2

Gene expression file format version.

bin_type

S32

CellBin

Bin type of the GEF file.

sn

S32

b'SS200000135TL_D1'

Stereo-seq chip SN

/cellBin/cell:Dataset "cell" is a 1D array which stores basic information and indices information of cells and expression.

Dataset Attributes

DataType

Example

Description

averageArea

float32

494.666

Average area for cells in pixel.

averageDnbCount

float32

194.299

Average number of mRNA-captured DNBs in a cell.

averageExpCount

float32

541.715

Average MID count in cell.

averageGeneCount

float32

310.157

Average gene count in cell.

maxArea

uint16

1925

Maximum area for cells in pixel.

maxDnbCount

uint16

883

Maximum number of mRNA-captured DNBs in a cell.

maxExpCount

uint16

3018

Maximum MID count in cell.

maxGeneCount

uint16

1415

Maximum gene count in cell.

maxX

int32

17658

Maximum x coordinate of the cell’s center of mass.

maxY

int32

19422

Maximum y coordinate of the cell’s center of mass.

medianArea

float32

474.

Median area for cells in pixel.

medianDnbCount

float32

183.

Median number of mRNA-captured DNBs in a cell.

medianExpCount

float32

491.

Median MID count in cell.

medianGeneCount

float32

289.

Median gene count in cell.

minArea

uint16

2

Minimum area for cells in pixel.

minDnbCount

uint16

0

Minimum number of mRNA-captured DNBs in a cell.

minExpCount

uint16

0

Minimum MID count in cell.

minGeneCount

uint16

0

Minimum gene count in cell.

minX

int32

2933

Minimum x coordinate of the cell’s center of mass.

minY

int32

5568

Minimum y coordinate of the cell’s center of mass.

Dataset DataType:compound

DataType

Example

Description

id

uint32

10

Cell ID index, the start ID is 0.In the Example, 10 represents the 10th cell in the dataset.

x

int32

541

The x coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 541.

y

int32

190

The y coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 190.

offset

uint32

494

The start row index of the cell in the "/cellBin/cellExp" dataset.The example represents that the gene ID index and total MID count information of the 10th cell in the "/cellBin/cellExp" dataset start from the 494th row.

geneCount

uint16

100

Gene count in the cell.In the example, 100 represents that the 100 rows in the "/cellBin/cellExp", start from the 494th to the 593th row, contains the gene ID indices and total MID count of the gene for the 10th cell in "/cellBin/cell" dataset.

expCount

uint16

500

Cell MID count.

dnbCount

uint16

200

mRNA-captured DNBs of the cell.

area

uint16

474

Cell area in pixel.

cellTypeID

uint32

0

Cell type ID.

clusterID

uint32

20

Cell cluster ID.

/cellBin/cellBorder:Dataset "cellBorder" is a 3D array which stores the lists of points for the bounding polygons of the cell.

Dataset Attributes

DataType

Example

Description

maxX

int32

16127

Maximum x coordinate of the bounding box of the cell.

maxY

int32

16663

Maximum y coordinate of the bounding box of the cell.

minX

int32

11129

Minimum x coordinate of the bounding box of the cell.

minY

int32

12784

Minimum y coordinate of the bounding box of the cell.

Dataset DataType:3D array

DataType

Example

Description

32*(int16,int16)

[[-17,-11],[-15,-5]…[32767,32767]]

A list of 32 coordinates recording the differences between cell bounding points and the cell’s center of mass (0,0). The real coordinate of cell’s center of mass (x, y) can be obtained from "cell" dataset using cellID.

/cellBin/cellExp:Dataset "cellExp" is a 1D array which stores the expression information of each cell.

Dataset Attributes

DataType

Example

Description

maxCount

uint16

336

Maximum MID count of a gene in a cell.

Dataset DataType:compound

DataType

Example

Description

geneID

uint32

1610

Gene IDs of the genes detected in the cell. ID is the index of "gene" dataset.In the example, 1610 represents the 1610th item in the "gene" dataset, and the name of the gene can be acquired in "gene" dataset.

count

uint16

3

MID count for the gene.In the example, (assume this is the 0th item in the "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExp" belongs to the cell whose cellID=0) the MID count for the gene (geneID=1610) in the cell (cellID=0) is 3.

[optional] /cellBin/cellExon:Dataset "cellExon" is a 1D array which stores the exon information for each cell.

Dataset Attributes

DataType

Example

Description

maxExon

uint16

5793

Maximum exon count of a gene in all cells.

minExon

uint16

0

Minimum exon count of a gene in all cells.

Dataset DataType:1D array

DataType

Example

Description

uint16

16

Exon count in a cell, the index of the array is same to the cellID in the "cell" dataset.

[optional] /cellBin/cellExpExon:Dataset "cellExpExon" is a 1D array which stores exon expression information for each cell.

Dataset Attributes

DataType

Example

Description

maxExon

uint16

336

Maximum exon count of a gene in a cell.

Dataset DataType:1D array

DataType

Example

Description

uint16

3

Exon count (MID) for the gene. The index is same to the "cellExp" dataset.In the example, (assume this is the 0th item in the "cellExpExon" dataset, since the index is same to "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExpExon" belongs to the cell whose cellID=0) the exon count (MID) for the gene (geneID=1610) in the cell (cellID=0) is 3.

/cellBin/cellTypeList:Dataset "cellTypeList" is a 1D array which stores cell types of each cell.

Dataset DataType:1D array

DataType

Example

Description

S32

b'default'

Cell type, "default" stands for undefined cell type.

/cellBin/gene:Dataset "gene" is a 1D array which stores the indices of cell and expression information of each gene.

Dataset Attributes

DataType

Example

Description

maxCellCount

uint32

5718

Maximum number of cells a gene can be detected.

maxExpCount

uint32

55361

Maximum MID count of a gene.

minCellCount

uint32

1

Minimum number of cells a gene can be detected.

minExpCount

uint32

1

Minimum MID count of a gene.

Dataset DataType:compound

DataType

Example

Description

geneID

S32

b'ENSMUSG00000000001'

Gene ID.

geneName

S32

b'AC149090.1'

Gene name.

offset

uint32

0

The start row index of the gene in "/cellBin/geneExp" dataset.In the example, 0 means that start from the 0th item in "/cellBin/geneExp" dataset records the cellIDs and total MID count information of "AC149090.1".

cellcount

uint32

60

Number of cells a gene can be detected.In the example, 60 represents that start from the 0th item to the 59th item records the information of gene "AC149090.1".

expCount

uint32

100

Sum of MID count for the gene.In the example, the total MID count of "AC149090.1" is 100.

maxMIDcount

uint16

4

Maximum MID count of a gene in a cell.In this case, the maximum MID count of gene "AC149090.1" in a cell is 4.

/cellBin/geneExp:Dataset "geneExp" is a 1D array which stores cell and expression information of each gene.

Dataset Attributes

DataType

Example

Description

maxCount

uint16

10

Maximum MID count of a gene.

Dataset DataType:compound

DataType

Example

Description

cellID

uint32

1247

cellID that contains the gene whose index is same to the index in "gene" dataset.In the example, (assume we use the 0th item in "geneExp" dataset) 1247 shows that the gene "AC149090.1" appears in the cell whose cellID is 1247.

count

uint16

3

The MID count of the gene, whose index is same to the index in "gene" dataset, in the cellID.In the example, the MID count of gene "AC149090.1" in the cell (cellID=1247) is 3.

[optional] /cellBin/geneExon:Dataset "geneExon" is a 1D array which stores the exon expression information of each gene.

Dataset Attributes

DataType

Example

Description

maxExon

uint32

55361

Maximum exon count of a gene.

minExon

uint32

0

Minimum exon count of a gene.

Dataset DataType:1D array

DataType

Example

Description

uint32

97

Total exon count of a gene, the index of "geneExon" dataset is same to the "gene" dataset.In the example, (assume this is the 0th item in the "geneExon" dataset, and gene "AC149090.1" is the 0th item in the "gene" dataset) the exon count of gene "AC149090.1" is 97.

[optional] /cellBin/geneExpExon:Dataset "geneExpExon" is a 1D array which stores the exon expression information in cells of each gene.

Dataset Attributes

DataType

Example

Description

maxExon

uint16

336

Maximum exon expression of a gene in a cell.

Dataset DataType:1D array

DataType

Example

Description

uint16

3

Exon count of a gene in a cell. The index of "geneExpExon" dataset is same to the "geneExp" dataset.In the example, (assume this is the 0th item in the "geneExpExon" dataset, since the index is same to "geneExp" dataset, from the "offset" and "cellCount" record in the "gene" dataset we can know that the 0th item in the "geneExpExon" dataset belongs to the gene "AC149090.1") 3 stands for the exon count of gene "AC149090.1" in cell 1247 is 3.

/cellBin/bockIndex:Dataset "bockIndex" is a 1D array which stores the matrix block partition information.

Dataset DataType:1D array

DataType

Example

Description

uint32

0

Cell count in each partition block.cnt=blockIndex[i+1]-blockIndex[i]

/cellBin/bockSize:Dataset "bockSize" is a 1D array which stores the block size of partition.

Dataset DataType:1D array

DataType

Example

Description

uint32

256, 256, 104, 104

4-element array. The 4 items represent the block length in x-axis, block length in y-axis, block count in x-axis, and block count in y-axis, respectively.

[optional] /codedCellBlock:Group "codedCellBlock" stores pre-computed data for rendering in StereoMap.

Group Attributes

DataType

Example

Description

info

string

{"@type": "neuroglancer_annotations_v1", ...}

Metadata of encoded precomputed data in JSON.

[optional] /codedCellBlock/L0/0_1:Dataset "0_1" is an example chunk encoded pre-computed data, including id, geometry, and so on.

Dataset DataType:Bytes

DataType

Example

Description

H5T_OPAQUE

1F 8B 08 00 ...

Bytecode of the chunk.

Gene Expression Matrix (GEM)

Gene expression matrix (GEM) stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.

Fields
Data Type
Example
Description

#FileFormat

string

GEMv0.2

Gene expression matrix file format version.

#SortedBy

string

None

Gene expression matrix sorting strategy. Valid values: "geneID", "x", "y", "MIDCount", "None".

#BinType

string

Bin

Bin type of the GEM file.

#BinSize

string

1

(Please check 1.3 Terminologies and Concepts Bin)

#Omics

string

Transcriptomics

Omics name.

#Stereo-seqChip

string

SS200000135TL_D1

Stereo-seq Chip T serial number.

#OffsetX

uint32

1

X coordinate of the origin before calibration.

#OffsetY

uint32

1

Y coordinate of the origin before calibration.

geneID

string

ENSMUSG00000000001

Gene ID

geneName

string

Gnai3

Gene name.

x

uint32

16809

X coordinate of the spot.

y

uint32

8546

Y coordinate of the spot.

MIDCount

uint32

1

Number of MIDs at (x, y) for the gene in the corresponding row.

ExonCount

uint32

0

[Optional] Number of exon count at (x, y) for the gene in the corresponding row.

CellID

uint32

55892

[Optional] CellID for (x, y).

An example of bin GEM:

An example of cellbin GEM:

Last updated