# Matrices

## [Gene Expression File (GEF)](/saw-user-manual-v8.2/advanced/expression-matrix-format.md#gene-expression-file-gef)

Gene expression file (GEF) is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.

## [Gene Expression Matrix (GEM)](/saw-user-manual-v8.2/advanced/expression-matrix-format.md#gene-expression-matrix-gem)

Gene expression matrix (GEM) stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.

## File types

The feature expression matrices generated from SAW pipelines mainly include two types, bin and cellbin GEF. They can be identified by the file extension:

<table><thead><tr><th width="208">File extension</th><th>Description</th></tr></thead><tbody><tr><td><code>.gef</code></td><td><p>The feature expression matrix file in HDF5 format for visualization. It contains the MID count for each gene of each spot. A spot is a binning unit that has a fixed-sized square shape in which the expression value in this square is accumulated. By default, a visualization <code>.gef</code> includes spot sizes of bin 1, 5, 10, 20, 50, 100, 150, 200.</p><p><img src="/files/5XOlm8pKGTy36ZJuS384" alt=""></p></td></tr><tr><td><code>.cellbin.gef</code></td><td><p>The cellbin feature expression matrix file in HDF5 format. It contains the spatial location and area of each cell, the MID count for each gene of each cell, and the cluster the cell belongs to. In <code>.cellbin.gef</code>, the cell is the smallest data unit.<br><img src="/files/IXcalsBYa22HDNxyC3GH" alt=""></p><p><br><em>Only available when the cell segmentation was done based on an microscopy image.</em></p></td></tr></tbody></table>

## Transcriptome

Common output files of `SAW count` and `SAW realign` are listed:

<table><thead><tr><th width="185">File</th><th>Description</th></tr></thead><tbody><tr><td><code>&#x3C;SN>.raw.gef</code></td><td>Feature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts. </td></tr><tr><td><code>&#x3C;SN>.gef</code></td><td>Feature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.</td></tr><tr><td><code>&#x3C;SN>.tissue.gef</code></td><td>Feature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.</td></tr><tr><td><code>&#x3C;SN>.cellbin.gef</code></td><td>Cellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area.</td></tr><tr><td><code>&#x3C;SN>.adjusted.cellbin.gef</code></td><td>Cellbin expression matrix with cell border expanding, based on <code>&#x3C;&#x3C;SN>_&#x3C;stainType>_mask_edm_dis_&#x3C;distance>.tif</code>.</td></tr></tbody></table>

### Tissue statistics

{% hint style="success" %}
`<SN>.tissue.gef` is usually generated from a `<SN>.raw.gef` and a tissue segmentation image.&#x20;
{% endhint %}

`tissuecut.stat` file can be found under `/STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX` and records statistics **under the detected tissue area**:

| Metric                           | Description                                                                                                                   |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| Tissue area in square nanometers | The physical tissue area of the sample slice, in square nanometers.                                                           |
| Contour area in pixel            | The area of the tissue region on the tissue segmentation image, in pixels.                                                    |
| Number of DNB under tissue       | The number of detected DNBs with RNA capture under the tissue region.                                                         |
| % of DNB under tissue            | The proportion of detected DNBs with RNA capture under the tissue region relative to the total counts across the entire chip. |
| Total gene type under tissue     | The total number of annotated gene types under the tissue region.                                                             |
| MID count under tissue           | MID counts under the tissue region.                                                                                           |
| % of MID under tissue            | The proportion of MID counts under the tissue region relative to the total counts across the entire chip.                     |
| Number of reads under tissue     | The number of sequencing reads under the tissue region.                                                                       |
| % of reads under tissue          | The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip.               |
| Mean reads per spot (binN)       | Mean reads of each binN spot under the tissue region.                                                                         |
| Median reads per spot (binN)     | Median reads of each binN spot under the tissue region.                                                                       |
| Mean gene type per spot (binN)   | Mean gene type of each binN spot under the tissue region.                                                                     |
| Median gene type per spot (binN) | Median gene type of each binN spot under the tissue region.                                                                   |
| Mean MID per spot (binN)         | Mean MID count of each binN spot under the tissue region.                                                                     |
| Median MID per spot (binN)       | Median MID count of each binN spot under the tissue region.                                                                   |

## Microorganism

If you perform `SAW count` on the Stereo-seq N FFPE and set `--microorganism-detect` to the analysis, its spatial expression matrices will be saved in `/outs/feature_expression/microorganism`. &#x20;

Output files are listed as:

<table><thead><tr><th width="266">File</th><th>Description</th></tr></thead><tbody><tr><td><code>&#x3C;SN>.microorganism.raw.gef</code></td><td>Feature expression matrix of microorganisms includes the whole information over a complete chip region. It only has bin1 expression counts. </td></tr><tr><td><code>&#x3C;SN>.microorganism.gef</code></td><td>Feature expression matrix of microorganisms. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.</td></tr><tr><td><code>&#x3C;SN>.host_microorganism.raw.gef</code></td><td>Feature expression matrix of microorganisms and the host includes the whole information over a complete chip region. It only has bin1 expression counts. </td></tr><tr><td><code>&#x3C;SN>.host_microorganism.gef</code></td><td>Feature expression matrix of microorganisms and the host. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.</td></tr><tr><td><code>&#x3C;SN>.microorganism.&#x3C;classification>.gem</code></td><td><p>Feature expression matrix of a specific classification of microbes. </p><p>Classifications of microorganisms include phylum, class, order, family, genus, and species.</p></td></tr></tbody></table>

### Microbe classification information

After microbe classification via Kraken2, two files appear under `/STEREO_ANALYSIS_WORKFLOW/MICROOGANISM/ANALYSIS` are `seq_complete_info.txt` and `seq_complete_info_dedup.txt`. The difference between the two files is that the latter has undergone deduplication processing for microbe alignment. Each row represents a record of a read's alignment result, primarily including the read ID, spatial coordinate, MID count, taxonomic ID, scientific name, detailed biological classification, and read count.

```sh
$ head ./STEREO_ANALYSIS_WORKFLOW_PROCESSING/MICROORGANISM/ANALYSIS/seq_complete_info.txt
seq	x	y	umi	taxid	Scientific_Name	kindom	phylum	class	order	family	genus	species	count
V350264949L2C001R01701342667	11662	10937	ABB	77643	Mycobacterium_tuberculosis_complex	k__Bacteria	p__Actinobacteria	c__Actinomycetia	o__Corynebacteriales	f__Mycobacteriaceae	g__Mycobacterium		1
V350264949L2C001R01701347399	2742	13877	2B7	1783272	Terrabacteria_group	k__Bacteria							1
V350264949L2C001R01900083155	18561	10644	C9E	5338	Agaricales	k__Fungi	p__Basidiomycota	c__Agaricomycetes	o__Agaricales		     1
V350264949L2C001R01900086639	3861	14913	810	1760	Actinomycetia	k__Bacteria	p__Actinobacteria	c__Actinomycetia				     1
V350264949L2C001R01800770541	4264	17661	C20	1783272	Terrabacteria_group	k__Bacteria							1
V350264949L2C001R01900181247	4396	16735	6EC	1762	Mycobacteriaceae	k__Bacteria	p__Actinobacteria	c__Actinomycetia	o__Corynebacteriales f__Mycobacteriaceae			1
V350264949L2C001R01900245227	18830	15878	D88	2	Bacteria	k__Bacteria							1
V350264949L2C001R01900248762	18671	15840	77E	2	Bacteria	k__Bacteria							1
V350264949L2C001R01900262154	15242	9034	4D0	1224	Proteobacteria	k__Bacteria	p__Proteobacteria						1
```

## Proteome

If you perform `SAW count` on the Stereo-CITE T FF analysis, its spatial protein expression matrices will be saved in `/outs/feature_expression`. &#x20;

Output files are listed as:

<table><thead><tr><th width="185">File</th><th>Description</th></tr></thead><tbody><tr><td><code>&#x3C;SN>.protein.raw.gef</code></td><td>Feature expression matrix includes the whole information over a complete chip region. It only has bin1 expression counts. </td></tr><tr><td><code>&#x3C;SN>.protein.gef</code></td><td>Feature expression matrix. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.</td></tr><tr><td><code>&#x3C;SN>.protein.tissue.gef</code></td><td>Feature expression matrix under the tissue coverage region. It is also a visualization GEF that includes expression counts for bin1, 5, 10, 20, 50, 100, 150, 200.</td></tr><tr><td><code>&#x3C;SN>.protein.cellbin.gef</code></td><td>Cellbin feature expression matrix records the information of cells individually, including the centroid coordinate, boundary coordinates, expression of genes, and cell area.</td></tr><tr><td><code>&#x3C;SN>.protein.adjusted.cellbin.gef</code></td><td>Cellbin expression matrix with cell border expanding, based on <code>&#x3C;SN>_&#x3C;stainType>_mask_edm_dis_&#x3C;distance>.tif</code>.</td></tr><tr><td><code>&#x3C;SN>.protein.tissue.rmbg.gem.gz</code></td><td>Feature expression matrix from automatic protein background removal.  It shows bin1 expression counts. </td></tr></tbody></table>

### Tissue statistics

{% hint style="success" %}
`<SN>.protein.tissue.gef` is usually generated from a `<SN>.protein.raw.gef` and a tissue segmentation image.&#x20;
{% endhint %}

`protein.tissuecut.stat` file can be found under `/STEREO_ANALYSIS_WORKFLOW_PROCESSING/EXPRESSION_MATRIX` and records statistics **under the detected tissue area**:

| Metric                              | Description                                                                                                                   |
| ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| Tissue area in square nanometers    | The physical tissue area of the sample slice, in square nanometers.                                                           |
| Contour area in pixel               | The area of the tissue region on the tissue segmentation image, in pixels.                                                    |
| Number of DNB under tissue          | The number of detected DNBs with ADT capture under the tissue region.                                                         |
| % of DNB under tissue               | The proportion of detected DNBs with ADT capture under the tissue region relative to the total counts across the entire chip. |
| Total protein type under tissue     | The total number of annotated protein types under the tissue region.                                                          |
| MID count under tissue              | MID counts under the tissue region.                                                                                           |
| % of MID under tissue               | The proportion of MID counts under the tissue region relative to the total counts across the entire chip.                     |
| Number of reads under tissue        | The number of sequencing reads under the tissue region.                                                                       |
| % of reads under tissue             | The proportion of sequencing reads under the tissue region relative to the total counts across the entire chip.               |
| Mean reads per spot (binN)          | Mean reads of each binN spot under the tissue region.                                                                         |
| Median reads per spot (binN)        | Median reads of each binN spot under the tissue region.                                                                       |
| Mean protein type per spot (binN)   | Mean protein type of each binN spot under the tissue region.                                                                  |
| Median protein type per spot (binN) | Median protein type of each binN spot under the tissue region.                                                                |
| Mean MID per spot (binN)            | Mean MID count of each binN spot under the tissue region.                                                                     |
| Median MID per spot (binN)          | Median MID count of each binN spot under the tissue region.                                                                   |

aad


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://stereotoolss-organization.gitbook.io/saw-user-manual-v8.2/analysis/outputs/matrices.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
