Spectre - simple discovery workflow

This protocol was constructed using Spectre version:

## [1] '0.4.1'

Links:

Back to Spectre home page: https://immunedynamics.io/spectre
Report an issue: https://github.com/ImmuneDynamics/Spectre/issues
Get assistance: https://github.com/ImmuneDynamics/Spectre/discussions

Introduction

Overview

Here we provide a worked example of a ‘simple’ discovery analysis workflow, where the entire process (data prep, clustering, dimensionality reduction, cluster annotation, plotting, summary data, and statistical analysis) is contained within a single script. The analytical workflow is described in our pre-print (Ashhurst TM, Marsh-Wakefield F, Putri GH et al., 2020).

The ‘simple’ workflow is most suitable for fast analysis of small datasets. For larger or more complex datasets, or datasets with multiple batches, we recommend the general discovery workflow, where the data preparation, batch alignment, clustering/dimensionality reduction, and quantitative analysis are separated into separate scripts. The demo dataset used for this worked example are cells extracted from mock- or virally-infected mouse brains, measured by flow cytometry.

Strategy

The ‘simple’ and ‘general’ discovery workflows are designed to facilitate the analysis of large and complex cytometry datasets using the Spectre R package. We’ve tested up to 30 million cells in a single analysis session so far. The workflow is designed to get around the cell number limitations of tSNE/UMAP. The analysis starts with clustering with FlowSOM – which is fast and scales well to large datasets. The clustered data is then downsampled, and dimensionality reduction is performed with tSNE/UMAP. This allows for visualisation of the data, and the clusters present in the dataset. Once the possible cell types in the datasets have been explored, the clusters can be labelled with the appropriate cellular identities.

Finally, we can use the clusters/populations to generate summary statistics (expression levels, frequencies, total counts etc), which allows us to create graphs and heatmaps, facilitating statistical analysis.

Batch alignment

The ‘simple’ discovery workflow does not include any batch alignment steps. If batch correction needs to be applied, we recommend using the general discovery workflow.

Citation

If you use Spectre in your work, please consider citing Ashhurst TM, Marsh-Wakefield F, Putri GH et al. (2020). bioRxiv. 2020.10.22.349563. To continue providing open-source tools such as Spectre, it helps us if we can demonstrate that our efforts are contributing to analysis efforts in the community. Please also consider citing the authors of the individual packages or tools (e.g. CytoNorm, FlowSOM, tSNE, UMAP, etc) that are critical elements of your analysis work. We have provided some generic text that you can use for your methods section with each protocol and on the ‘about’ page.

Sample methods blurb

Here is a sample methods blurb for this workflow. You may need to adapt this text to reflect any changes made in your analysis.

Computational analysis of data was performed using the Spectre R package (Ashhurst et al., 2020), with instructions and source code provided at https://github.com/ImmuneDynamics/spectre. Samples were initially prepared in FlowJo, and the population of interest was exported as raw value CSV files. Arcsinh transformation was performed on the data in R using a co-factor of 15 to redistribute the data on a linear scale and compress low end values near zero. The dataset was then merged into a single data.table, with keywords denoting the sample, group, and other factors added to each row (cell). The FlowSOM algorithm (Van Gassen et al., 2015) was then run on the merged dataset to cluster the data, where every cell is assigned to a specific cluster and metacluster. Subsequently, the data was downsampled and analysed by the dimensionality reduction algorithm Uniform Manifold Approximation and Projection (UMAP) (McInnes, Healy, Melville, 2018) for cellular visualisation.

Software and R script preparation

Software: for instructions downloading R, RStudio, and Spectre, please see this section on the home page.

Analysis script: Please visit https://github.com/ImmuneDynamics/Spectre, and download the repository:

You can then find the ‘simple discovery workflow’ script here:

Create a folder for your experiment, and place the script in that folder.

Data folder

Create a folder within your experiment folder called ‘data’, and place the exported files there.

Setup some sample metadata and place it in a folder called ‘metadata’ or similar

Create a CSV file (using Microsoft Excel or similar) – we have called the file ‘sample.details’ here. The first column should be called ‘Filename’ or similar, and should contain the name of one file per row. On a Mac, you can copy the files and ‘paste’ them into excel – it will copy the name of the file, including extensions (“.csv”, “.fcs”, etc). You can then add as many additional columns as you like, and these can be called whatever you like (e.g. “Sample” could be “SampleName”, or “Sample_Name” etc).

“Sample” is a recommended column, as this can be a more simplified name for each sample
“Group” is extremely useful for most analyses
“Batch” is helpful if you have prepared, stained, or run samples in multiple batches. If only a single batch is used, we still recommend entering a ‘Batch’ column with all rows containing ‘1’.
“Cells per sample” is a useful column to add if you intend to generate absolute counts of each population per sample during the generation of summary data, but is not required otherwise.

1. Load packages and set directories

Open the analysis script in RStudio and open the simple discovery workflow script.

############################################################################
#### 1. Load packages, and set working directory
############################################################################

Load the Spectre and other required libraries

Running library(Spectre) will load the Spectre package. We can then use package.check() to see if the standard dependency packages are installed, and package.load() to load those packages.

### Load libraries

    library(Spectre)
    Spectre::package.check()    # Check that all required packages are installed
    Spectre::package.load()     # Load required packages

Set ‘PrimaryDirectory’

Initially, we will set the location of the script as ‘PrimaryDirectory’. We’ll use this as a sort of ‘home page’ for where our analysis is going to be performed – including where to find our input data, metadata, and where our output data will go.

### Set PrimaryDirectory
    dirname(rstudioapi::getActiveDocumentContext()$path)            # Finds the directory where this script is located
    setwd(dirname(rstudioapi::getActiveDocumentContext()$path))     # Sets the working directory to where the script is located
    getwd()
    PrimaryDirectory <- getwd()

Set ‘InputDirectory’

Next we need to set the location of the ‘data’ folder – where our samples for analysis are stored. In this example they are stored in a sub-folder called ‘data’.

### Set 'input' directory
    setwd(PrimaryDirectory)
    setwd("data/")
    InputDirectory <- getwd()
    setwd(PrimaryDirectory)

Set ‘MetaDirectory’

We need to set the location of the ‘metadata’ folder. This is where we can store a CSV file that contains any relevant metadata that we want to embed in our samples. In this example, it is located in a sub-folder called ‘metadata’.

### Set 'metadata' directory
    setwd(PrimaryDirectory)
    setwd("metadata/")
    MetaDirectory <- getwd()
    setwd(PrimaryDirectory)

Create ‘OutputDirectory’

We need to create a folder where our output data can go once our analysis is finished. In this example we will call this ‘Output_Spectre’.

### Create output directory
    dir.create("Output_Spectre", showWarnings = FALSE)
    setwd("Output_Spectre")
    OutputDirectory <- getwd()
    setwd(PrimaryDirectory)

Data preparation

summary(cars)

##      speed           dist
##  Min.   : 4.0   Min.   :  2.00
##  1st Qu.:12.0   1st Qu.: 26.00
##  Median :15.0   Median : 36.00
##  Mean   :15.4   Mean   : 42.98
##  3rd Qu.:19.0   3rd Qu.: 56.00
##  Max.   :25.0   Max.   :120.00

dat <- Spectre::demo.clustered
dat

##                  FileName      NK11        CD3     CD45       Ly6G    CD11b
##      1:   CNS_Mock_01.csv   42.3719  40.098700  6885.08  -344.7830 14787.30
##      2:   CNS_Mock_01.csv   42.9586 119.014000  1780.29  -429.6650  5665.73
##      3:   CNS_Mock_01.csv   59.2366 206.238000 10248.30 -1603.8400 19894.30
##      4:   CNS_Mock_01.csv  364.9480  -0.233878  3740.04  -815.9800  9509.43
##      5:   CNS_Mock_01.csv  440.2470  40.035200  9191.38    40.5055  5745.82
##     ---
## 169000: CNS_WNV_D7_06.csv  910.8890  72.856100 31466.20  -316.5570 28467.80
## 169001: CNS_WNV_D7_06.csv  -10.2642  64.188700 45188.00  -540.5140 22734.00
## 169002: CNS_WNV_D7_06.csv -184.2910  -9.445650 11842.60   -97.9383 17237.00
## 169003: CNS_WNV_D7_06.csv  248.3860 229.986000 32288.20  -681.1630 19255.80
## 169004: CNS_WNV_D7_06.csv  738.9810  95.470300 46185.10 -1004.6000 22957.80
##              B220      CD8a       Ly6C       CD4  NK11_asinh    CD3_asinh
##      1:  -40.2399   83.7175   958.7000  711.0720  0.04235923  0.040087962
##      2:   86.6673   34.7219   448.2590  307.2720  0.04294540  0.118734817
##      3:  427.8310  285.8800  1008.8300  707.0940  0.05920201  0.204803270
##      4:  182.4200  333.6050   440.0710  249.7840  0.35729716 -0.000233878
##      5: -211.6940  149.2200    87.4815  867.5700  0.42713953  0.040024513
##     ---
## 169000:   -7.7972 -271.8040 12023.7000 1103.0500  0.81693878  0.072791800
## 169001:  202.4110 -936.4920  4188.3300  315.9400 -0.01026402  0.064144703
## 169002:  123.4760 -219.9320  8923.4000 -453.4640 -0.18326344 -0.009445510
## 169003: -656.0540 -201.5880 10365.7000   61.6765  0.24590035  0.228005328
## 169004: -661.6280   72.3356  9704.4700  -31.8532  0.68430866  0.095325863
##         CD45_asinh  Ly6G_asinh CD11b_asinh   B220_asinh  CD8a_asinh Ly6C_asinh
##      1:   2.627736 -0.33829345    3.388057 -0.040229048  0.08362002  0.8518665
##      2:   1.340828 -0.41743573    2.435282  0.086559169  0.03471493  0.4344615
##      3:   3.022631 -1.25101677    3.684212  0.415750122  0.28212257  0.8876036
##      4:   2.029655 -0.74509796    2.948184  0.181423123  0.32770787  0.4269784
##      5:   2.914359  0.04049443    2.449108 -0.210143906  0.14867171  0.0873703
##     ---
## 169000:   4.142314 -0.31149515    4.042229 -0.007797121 -0.26856390  3.1817517
## 169001:   4.504101 -0.51715205    3.817492  0.201053740 -0.83574631  2.1394053
## 169002:   3.166628 -0.09778240    3.541046  0.123164374 -0.21819650  2.8849492
## 169003:   4.168089 -0.63716643    3.651633 -0.616293228 -0.20024703  3.0339681
## 169004:   4.525922 -0.88462254    3.827279 -0.620947819  0.07227267  2.9683779
##           CD4_asinh     Sample Group Batch FlowSOM_cluster FlowSOM_metacluster
##      1:  0.66171351 01_Mock_01  Mock     A              23                   2
##      2:  0.30263135 01_Mock_01  Mock     A              55                   2
##      3:  0.65846851 01_Mock_01  Mock     A              64                   2
##      4:  0.24725691 01_Mock_01  Mock     A              53                   2
##      5:  0.78456678 01_Mock_01  Mock     A             110                   4
##     ---
## 169000:  0.95239703  12_WNV_06   WNV     A              72                   3
## 169001:  0.31090687  12_WNV_06   WNV     A              46                   3
## 169002: -0.43920651  12_WNV_06   WNV     A             133                   3
## 169003:  0.06163746  12_WNV_06   WNV     A             133                   3
## 169004: -0.03184782  12_WNV_06   WNV     A             103                   3
##                Population     UMAP_X    UMAP_Y
##      1:         Microglia -2.3603757  6.201213
##      2:         Microglia  2.7505242  7.119595
##      3:         Microglia -2.9486033  4.012670
##      4:         Microglia  0.6482904  6.481466
##      5:          NK cells -2.3941295  6.975885
##     ---
## 169000: Infil Macrophages -2.9640724 -5.058265
## 169001: Infil Macrophages -1.2644785 -3.555824
## 169002: Infil Macrophages -2.3592682 -2.429467
## 169003: Infil Macrophages -1.9531062 -4.049705
## 169004: Infil Macrophages -0.7404098 -4.686928

## Non-numeric values detected in col.axis -- using col.type = 'factor'

## Check your working directory for a new .png called 'Multi plot.png'

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.