Spectre - simple discovery workflow
Thomas Ashhurst, Felix Marsh-Wakefield, Givanna Putri
14/03/2021
This protocol was constructed using Spectre version:
## [1] '0.4.1'
Links:
- Back to Spectre home page: https://immunedynamics.io/spectre
- Report an issue: https://github.com/ImmuneDynamics/Spectre/issues
- Get assistance: https://github.com/ImmuneDynamics/Spectre/discussions
Introduction
Overview
Here we provide a worked example of a ‘simple’ discovery analysis workflow, where the entire process (data prep, clustering, dimensionality reduction, cluster annotation, plotting, summary data, and statistical analysis) is contained within a single script. The analytical workflow is described in our pre-print (Ashhurst TM, Marsh-Wakefield F, Putri GH et al., 2020).
The ‘simple’ workflow is most suitable for fast analysis of small datasets. For larger or more complex datasets, or datasets with multiple batches, we recommend the general discovery workflow, where the data preparation, batch alignment, clustering/dimensionality reduction, and quantitative analysis are separated into separate scripts. The demo dataset used for this worked example are cells extracted from mock- or virally-infected mouse brains, measured by flow cytometry.
Strategy
The ‘simple’ and ‘general’ discovery workflows are designed to facilitate the analysis of large and complex cytometry datasets using the Spectre R package. We’ve tested up to 30 million cells in a single analysis session so far. The workflow is designed to get around the cell number limitations of tSNE/UMAP. The analysis starts with clustering with FlowSOM – which is fast and scales well to large datasets. The clustered data is then downsampled, and dimensionality reduction is performed with tSNE/UMAP. This allows for visualisation of the data, and the clusters present in the dataset. Once the possible cell types in the datasets have been explored, the clusters can be labelled with the appropriate cellular identities.
Finally, we can use the clusters/populations to generate summary statistics (expression levels, frequencies, total counts etc), which allows us to create graphs and heatmaps, facilitating statistical analysis.
Batch alignment
The ‘simple’ discovery workflow does not include any batch alignment steps. If batch correction needs to be applied, we recommend using the general discovery workflow.
Citation
If you use Spectre in your work, please consider citing Ashhurst TM, Marsh-Wakefield F, Putri GH et al. (2020). bioRxiv. 2020.10.22.349563. To continue providing open-source tools such as Spectre, it helps us if we can demonstrate that our efforts are contributing to analysis efforts in the community. Please also consider citing the authors of the individual packages or tools (e.g. CytoNorm, FlowSOM, tSNE, UMAP, etc) that are critical elements of your analysis work. We have provided some generic text that you can use for your methods section with each protocol and on the ‘about’ page.
Sample methods blurb
Here is a sample methods blurb for this workflow. You may need to adapt this text to reflect any changes made in your analysis.
Computational analysis of data was performed using the Spectre R package (Ashhurst et al., 2020), with instructions and source code provided at https://github.com/ImmuneDynamics/spectre. Samples were initially prepared in FlowJo, and the population of interest was exported as raw value CSV files. Arcsinh transformation was performed on the data in R using a co-factor of 15 to redistribute the data on a linear scale and compress low end values near zero. The dataset was then merged into a single data.table, with keywords denoting the sample, group, and other factors added to each row (cell). The FlowSOM algorithm (Van Gassen et al., 2015) was then run on the merged dataset to cluster the data, where every cell is assigned to a specific cluster and metacluster. Subsequently, the data was downsampled and analysed by the dimensionality reduction algorithm Uniform Manifold Approximation and Projection (UMAP) (McInnes, Healy, Melville, 2018) for cellular visualisation.
Software and R script preparation
Software: for instructions downloading R, RStudio, and Spectre, please see this section on the home page.
Analysis script: Please visit https://github.com/ImmuneDynamics/Spectre, and download the repository:
You can then find the ‘simple discovery workflow’ script here:
Create a folder for your experiment, and place the script in that folder.
Data folder
Create a folder within your experiment folder called ‘data’, and place the exported files there.
Setup some sample metadata and place it in a folder called ‘metadata’ or similar
Create a CSV file (using Microsoft Excel or similar) – we have called the file ‘sample.details’ here. The first column should be called ‘Filename’ or similar, and should contain the name of one file per row. On a Mac, you can copy the files and ‘paste’ them into excel – it will copy the name of the file, including extensions (“.csv”, “.fcs”, etc). You can then add as many additional columns as you like, and these can be called whatever you like (e.g. “Sample” could be “SampleName”, or “Sample_Name” etc).
- “Sample” is a recommended column, as this can be a more simplified name for each sample
- “Group” is extremely useful for most analyses
- “Batch” is helpful if you have prepared, stained, or run samples in multiple batches. If only a single batch is used, we still recommend entering a ‘Batch’ column with all rows containing ‘1’.
- “Cells per sample” is a useful column to add if you intend to generate absolute counts of each population per sample during the generation of summary data, but is not required otherwise.
1. Load packages and set directories
Open the analysis script in RStudio and open the simple discovery workflow script.
############################################################################
#### 1. Load packages, and set working directory
############################################################################
Load the Spectre and other required libraries
Running library(Spectre) will load the Spectre package. We can then use package.check() to see if the standard dependency packages are installed, and package.load() to load those packages.
### Load libraries
library(Spectre)
Spectre::package.check() # Check that all required packages are installed
Spectre::package.load() # Load required packages
Set ‘PrimaryDirectory’
Initially, we will set the location of the script as ‘PrimaryDirectory’. We’ll use this as a sort of ‘home page’ for where our analysis is going to be performed – including where to find our input data, metadata, and where our output data will go.
### Set PrimaryDirectory
dirname(rstudioapi::getActiveDocumentContext()$path) # Finds the directory where this script is located
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # Sets the working directory to where the script is located
getwd()
PrimaryDirectory <- getwd()
Set ‘InputDirectory’
Next we need to set the location of the ‘data’ folder – where our samples for analysis are stored. In this example they are stored in a sub-folder called ‘data’.
### Set 'input' directory
setwd(PrimaryDirectory)
setwd("data/")
InputDirectory <- getwd()
setwd(PrimaryDirectory)
Set ‘MetaDirectory’
We need to set the location of the ‘metadata’ folder. This is where we can store a CSV file that contains any relevant metadata that we want to embed in our samples. In this example, it is located in a sub-folder called ‘metadata’.
### Set 'metadata' directory
setwd(PrimaryDirectory)
setwd("metadata/")
MetaDirectory <- getwd()
setwd(PrimaryDirectory)
Create ‘OutputDirectory’
We need to create a folder where our output data can go once our analysis is finished. In this example we will call this ‘Output_Spectre’.
### Create output directory
dir.create("Output_Spectre", showWarnings = FALSE)
setwd("Output_Spectre")
OutputDirectory <- getwd()
setwd(PrimaryDirectory)
Data preparation
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
dat <- Spectre::demo.clustered
dat
## FileName NK11 CD3 CD45 Ly6G CD11b
## 1: CNS_Mock_01.csv 42.3719 40.098700 6885.08 -344.7830 14787.30
## 2: CNS_Mock_01.csv 42.9586 119.014000 1780.29 -429.6650 5665.73
## 3: CNS_Mock_01.csv 59.2366 206.238000 10248.30 -1603.8400 19894.30
## 4: CNS_Mock_01.csv 364.9480 -0.233878 3740.04 -815.9800 9509.43
## 5: CNS_Mock_01.csv 440.2470 40.035200 9191.38 40.5055 5745.82
## ---
## 169000: CNS_WNV_D7_06.csv 910.8890 72.856100 31466.20 -316.5570 28467.80
## 169001: CNS_WNV_D7_06.csv -10.2642 64.188700 45188.00 -540.5140 22734.00
## 169002: CNS_WNV_D7_06.csv -184.2910 -9.445650 11842.60 -97.9383 17237.00
## 169003: CNS_WNV_D7_06.csv 248.3860 229.986000 32288.20 -681.1630 19255.80
## 169004: CNS_WNV_D7_06.csv 738.9810 95.470300 46185.10 -1004.6000 22957.80
## B220 CD8a Ly6C CD4 NK11_asinh CD3_asinh
## 1: -40.2399 83.7175 958.7000 711.0720 0.04235923 0.040087962
## 2: 86.6673 34.7219 448.2590 307.2720 0.04294540 0.118734817
## 3: 427.8310 285.8800 1008.8300 707.0940 0.05920201 0.204803270
## 4: 182.4200 333.6050 440.0710 249.7840 0.35729716 -0.000233878
## 5: -211.6940 149.2200 87.4815 867.5700 0.42713953 0.040024513
## ---
## 169000: -7.7972 -271.8040 12023.7000 1103.0500 0.81693878 0.072791800
## 169001: 202.4110 -936.4920 4188.3300 315.9400 -0.01026402 0.064144703
## 169002: 123.4760 -219.9320 8923.4000 -453.4640 -0.18326344 -0.009445510
## 169003: -656.0540 -201.5880 10365.7000 61.6765 0.24590035 0.228005328
## 169004: -661.6280 72.3356 9704.4700 -31.8532 0.68430866 0.095325863
## CD45_asinh Ly6G_asinh CD11b_asinh B220_asinh CD8a_asinh Ly6C_asinh
## 1: 2.627736 -0.33829345 3.388057 -0.040229048 0.08362002 0.8518665
## 2: 1.340828 -0.41743573 2.435282 0.086559169 0.03471493 0.4344615
## 3: 3.022631 -1.25101677 3.684212 0.415750122 0.28212257 0.8876036
## 4: 2.029655 -0.74509796 2.948184 0.181423123 0.32770787 0.4269784
## 5: 2.914359 0.04049443 2.449108 -0.210143906 0.14867171 0.0873703
## ---
## 169000: 4.142314 -0.31149515 4.042229 -0.007797121 -0.26856390 3.1817517
## 169001: 4.504101 -0.51715205 3.817492 0.201053740 -0.83574631 2.1394053
## 169002: 3.166628 -0.09778240 3.541046 0.123164374 -0.21819650 2.8849492
## 169003: 4.168089 -0.63716643 3.651633 -0.616293228 -0.20024703 3.0339681
## 169004: 4.525922 -0.88462254 3.827279 -0.620947819 0.07227267 2.9683779
## CD4_asinh Sample Group Batch FlowSOM_cluster FlowSOM_metacluster
## 1: 0.66171351 01_Mock_01 Mock A 23 2
## 2: 0.30263135 01_Mock_01 Mock A 55 2
## 3: 0.65846851 01_Mock_01 Mock A 64 2
## 4: 0.24725691 01_Mock_01 Mock A 53 2
## 5: 0.78456678 01_Mock_01 Mock A 110 4
## ---
## 169000: 0.95239703 12_WNV_06 WNV A 72 3
## 169001: 0.31090687 12_WNV_06 WNV A 46 3
## 169002: -0.43920651 12_WNV_06 WNV A 133 3
## 169003: 0.06163746 12_WNV_06 WNV A 133 3
## 169004: -0.03184782 12_WNV_06 WNV A 103 3
## Population UMAP_X UMAP_Y
## 1: Microglia -2.3603757 6.201213
## 2: Microglia 2.7505242 7.119595
## 3: Microglia -2.9486033 4.012670
## 4: Microglia 0.6482904 6.481466
## 5: NK cells -2.3941295 6.975885
## ---
## 169000: Infil Macrophages -2.9640724 -5.058265
## 169001: Infil Macrophages -1.2644785 -3.555824
## 169002: Infil Macrophages -2.3592682 -2.429467
## 169003: Infil Macrophages -1.9531062 -4.049705
## 169004: Infil Macrophages -0.7404098 -4.686928
## Non-numeric values detected in col.axis -- using col.type = 'factor'
## Check your working directory for a new .png called 'Multi plot.png'
Including Plots
You can also embed plots, for example:
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.