In this workflow, we provide an example on how to run and make sense of a principal component analysis (PCA). PCA is capable of reducing the number of dimensions (i.e. parameters) with minimal effect on the variation of the given dataset. This function will run a PCA calculation (extremely fast) and generate plots (takes time). For individuals (such as samples or patients), a PCA can group them based on their similarities. A PCA is also capable of ranking variables/parameters (such as markers or cell counts) based on their contribution to the variability across a dataset in an extremely fast manner. In cytometry, this can be useful to identify marker(s) that can be used to differentiate between subset(s) of cells. This workflow is suitable for data with variables (such as markers or genes) and individuals (such as samples, patients or cells).
In this example, we will be using demo data that have undergone FlowSOM clustering. We will then calculate the level of each metacluster (as a proportion of total cells) for each mouse (individual). We will experiment with the different options available in
The PCA in
run.pca is done using
stats::prcomp. For plots, the package
factoextra was used. The colour scheme options are from the packages
More information on using a PCA and plot generation can be found here.
Here, we will simply load up the Spectre package and the demo dataset which has previously been analysed and annotated using our simple discovery workflow.
library(Spectre) library(data.table) cell.dat <- demo.clustered head(cell.dat)
We will first calculate the level of each
"FlowSOM_metacluster" column as a proportion for each individual,
# Select column to calculate proportions as.matrix(colnames(cell.dat)) #best not to run this if you've got lots of columns...
## [,1] ## [1,] "FileName" ## [2,] "NK11" ## [3,] "CD3" ## [4,] "CD45" ## [5,] "Ly6G" ## [6,] "CD11b" ## [7,] "B220" ## [8,] "CD8a" ## [9,] "Ly6C" ## [10,] "CD4" ## [11,] "NK11_asinh" ## [12,] "CD3_asinh" ## [13,] "CD45_asinh" ## [14,] "Ly6G_asinh" ## [15,] "CD11b_asinh" ## [16,] "B220_asinh" ## [17,] "CD8a_asinh" ## [18,] "Ly6C_asinh" ## [19,] "CD4_asinh" ## [20,] "Sample" ## [21,] "Group" ## [22,] "Batch" ## [23,] "FlowSOM_cluster" ## [24,] "FlowSOM_metacluster" ## [25,] "Population" ## [26,] "UMAP_X" ## [27,] "UMAP_Y"
prop.col.by <- "FlowSOM_metacluster" # Calculate proportions for each metacluster (adds extra column 'prop') cell.dat[, sum := .N, by = FileName][, prop := .N, by = c("FileName", prop.col.by)][, prop := prop/sum][, sum := NULL] # Select unique rows (i.e. remove duplicates) cell.dat <- unique(cell.dat, by = c("FileName", prop.col.by)) # Convert prop column to percentage cell.dat[["prop"]] <- 100*cell.dat[["prop"]] # Check proportional data make sense sum(cell.dat[Sample == "01_Mock_01"][["prop"]])
##  100
##should = 100## # Select columns of interest col.interest <- c("FileName", prop.col.by, "Group", "prop") # Select columns cell.dat.prop <- cell.dat[, ..col.interest] # Reorder rows based on prop.col.by (useful for later rearrangement) cell.dat.prop <- dplyr::arrange(cell.dat.prop, group_by = cell.dat.prop[[prop.col.by]]) # Change cluster name (based on prop.col.by) ifelse(grepl("metacluster", prop.col.by), cell.dat.prop[[prop.col.by]] <- paste0("mc", cell.dat.prop[[prop.col.by]]), cell.dat.prop[[prop.col.by]] <- paste0("c", cell.dat.prop[[prop.col.by]]))
##  "mc1"
# Rearrange data to be based on proportion cell.dat.prop <- tidyr::pivot_wider(cell.dat.prop, names_from = prop.col.by, values_from = prop) #will add "NA" when values are missing, which happens when % == 0 (which is great!) # Replace "NA" with "0" cell.dat.prop[is.na(cell.dat.prop)] <- 0 # Make data.table (useful later on) cell.dat.prop <- data.table::as.data.table(cell.dat.prop) head(cell.dat.prop)
This table contains the level of each metacluster as a proportion of total cells for each individual. It includes each mouse (
"FileName"), the group (
"Group"), and each metacluster as a separate column. We will use this as it represents the overall immune repertoire for each mouse.
We need to make sure the data are compatible with a PCA, as there are some things that will be problematic if not correct. In particular, we need to make sure there are no
NA. Each metacluster (variable) must have variance, otherwise the PCA will not work. We will also remove any metacluster that that has the same value for each mouse.
# Remove samples that have NA present cell.dat.prop <- na.omit(cell.dat.prop) # Remove unwanted group (e.g. "Batch") # cell.dat.prop <- cell.dat.prop[!mclust.data.na$GroupName == "Batch"] # Remove columns with constant values (particularly common for markers that aren't expressed on a cluster) #https://stackoverflow.com/questions/15068981/removal-of-constant-columns-in-r #https://stackoverflow.com/questions/48253732/remove-constant-columns-with-or-without-nas/48253983 cell.dat.prop <- cell.dat.prop[ , lapply(.SD, function(v) if(data.table::uniqueN(v, na.rm = TRUE) > 1) v)] head(cell.dat.prop)
We then need to select columns that will be used to run the PCA. In short, all the columns with the metacluster data.
# Select columns for PCA as.matrix(colnames(cell.dat.prop))
## [,1] ## [1,] "FileName" ## [2,] "Group" ## [3,] "mc1" ## [4,] "mc2" ## [5,] "mc3" ## [6,] "mc4" ## [7,] "mc5" ## [8,] "mc6" ## [9,] "mc7"
use.cols <- colnames(cell.dat.prop)[3:9] use.cols
##  "mc1" "mc2" "mc3" "mc4" "mc5" "mc6" "mc7"
# Define columns by removing those not of interest # use.cols <- colnames(cell.dat.prop)[-grep("FileName|Group", colnames(cell.dat.prop))]
We may it! Now we can run the PCA. There are lots of options available, so we will start with the basics.
# The essential parameters to define are `dat` and `use.cols`, and there is no default. We recommend `scale = TRUE` (which is the default), so all metaclusters are considered equal (as opposed to larger metacluster % skewing the results). # We will be running `pca.lite`, which will add PCA coordinates to our dataset with `add.pca.col = TRUE`. By default, up to 50 principal components (PC) with be added. If more (or less) are desired, this number can be altered with `pca.col.no`. cell.dat.pca <- Spectre::run.pca(dat = cell.dat.prop, use.cols = use.cols, scale = TRUE, #this is the default add.pca.col = TRUE, pca.col.no = 50, #this is the default pca.lite = TRUE ) head(cell.dat.pca)
We ran a PCA! Well done. Extra columns have been added for each of the calculated PC (Dim.1, Dim.2, …). The number of columns calculated is the lesser number of metaclusers (variables) or mice (individuals). As there can be a huge range of PC (particularly if dealing with single-cell RNA sequencing data), up to 50 PC will be added to our data (as defined by
pca.col.no). In our case, there are 7 metaclusters and 12 mice, so 7 PC were calculated.
The data generated can be used to manually create PCA plots. Otherwise continue reading and you can let
run.pca do it for you!
We can automatically create PCA plots that will be saved into our working directory. By default, results are saved wherever this RMD file is saved.
# Check where plots will be saved getwd() # Create folder for files to be saved dir.create("Output_PCA_1") setwd("Output_PCA_1") # Create plots. This includes a scree, variability, contribution and PCA plot. # Note this will be calculating the PCA again, but because it's such a fast calculation it still doesn't take much time. Spectre::run.pca(dat = cell.dat.prop, use.cols = use.cols, scale = TRUE, add.pca.col = FALSE, #this is the default pca.col.no = 50, pca.lite = FALSE, scree.plot = TRUE, #creates scree plot comp.no = 2, #number of components to be used (will be discussed more later) variable.contribution = FALSE, plot.individuals = TRUE, plot.ind.label = c("point", "text"), pointsize.ind = 1.5, row.names = "FileName", plot.variables = TRUE, var.numb = 20, #maximum number of variables to be plotted path = getwd() #this is the default )
Results! You can have a quick look here at what you created.
Plots should now appear in the output folder. Check them out! Don’t worry, I’ll wait…
Let’s unpack the figures created here. “PCA plot-individuals.pdf” contains a PCA plot. Each dot represents a single mouse (individual). Using the option
plot.ind.label, we have included the name of each mouse based on the column
"FileName". The x-axis is PC1 (or “Dim.1”), and will summarise the largest amount of variance (in this case 70.1 %). The y-axis is PC2, summarising the second largest amount of variance (15.9 %).
“PCA plot-variables.pdf” shows the metaclusters (variables) that contribute most to the variance. The directions reflects how these metaclusters relate to each other. Metaclusters closer together positively correlate. Metaclusters pointing in the opposite direction negatively correlate. Metaclusters perpendicular to each other have no correlation. The size (and colour) of the arrow indicates the level of contribution that particular metacluster makes to the variance.
“PCA plot-combined.pdf” is a combination of both the PCA and variable plot.
“Scree plot.pdf” is a scree plot. This plot shows how much each PC (dimension) contributes to the overall variance. We will discuss this more in a bit…
We have also created two CSV files. “PCA-individuals.csv” is a printed file of the input data combined with the PCA coordinates (for each of the PC). “PCA-variables.csv” contains the coordinates of the metaclusters.
Great work! Let’s have a look at a few more options we can play around with. We will have at differentiating mice on the PCA plot based on the group they belong to.
# Check where plots will be saved getwd() # Create folder for files to be saved dir.create("Output_PCA_2") setwd("Output_PCA_2") Spectre::run.pca(dat = cell.dat.prop, use.cols = use.cols, scale = TRUE, add.pca.col = FALSE, pca.col.no = 50, pca.lite = FALSE, scree.plot = FALSE, comp.no = 2, variable.contribution = FALSE, plot.individuals = TRUE, plot.ind.label = "point", pointsize.ind = 1.5, # row.names = "FileName", plot.ind.group = TRUE, #need to state 'group.ind' group.ind = "Group", #need 'plot.ind.group = TRUE' colour.group = "viridis", pointsize.group = 3, ellipse.type = "confidence", ellipse.level = 0.95, #95 % confidence interval mean.point = TRUE, #large symbol representing middle of ellipse randomise.order = TRUE, #if there are multiple individuals, this makes them added to the plots in a random order (so single cells from one group don't cover up another) order.seed = 42, plot.variables = FALSE, plot.combined = FALSE, # var.numb = 20, path = getwd() )
Results again! You can have a quick look here at what you created.
More plots! Look at you go. The new one to look at is “PCA plot-ind-groups.pdf”. The colours and symbols represent the group each mouse belongs to. We can see there is clear separation between these groups. This suggests the overall immune cell repertoire of individual mice differ based on the group they belong. Furthermore, the difference between groups is divided by the x-axis alone (i.e., the PC that contributes most to the difference).
Now we’ll spend some time with the screen plot and variables. As mentioned, the scree plot shows the variance of each PC. Ideally, we try and find the smallest number of PC that summarise the greatest level of variance. This is commonly referred to as the “elbow point”. In the first scree plot created, there was a large drop after PC1. In this case, PC2 is the elbow point. This is not always the case, but sticking with two components is useful for visualising the data. (We will discuss what this selection can mean in the next set of plots we’ll be creating.) We can use
comp.no to change the number of components to be used for calculations (default
comp.no = 2). If
comp.no = NULL, a scree plot will be created and the selection can be made so we can select the elbow point.
Hopefully this will start making sense after these next plots.
# Check where plots will be saved getwd() # Create folder for files to be saved dir.create("Output_PCA_3") setwd("Output_PCA_3") Spectre::run.pca(dat = cell.dat.prop, use.cols = use.cols, scale = TRUE, add.pca.col = FALSE, pca.col.no = 50, pca.lite = FALSE, scree.plot = TRUE, comp.no = 3, #changing from the default = 2 variable.contribution = TRUE, plot.individuals = FALSE, # plot.ind.label = "point", # pointsize.ind = 1.5, # row.names = "FileName", # plot.ind.group = TRUE, # group.ind = "Group", # colour.group = "viridis", # pointsize.group = 3, # ellipse.type = "confidence", # ellipse.level = 0.95, # mean.point = TRUE, # randomise.order = TRUE, order.seed = 42, plot.variables = TRUE, plot.combined = FALSE, var.numb = 20, path = getwd() )
More results again! You can have a quick look here at what you created.
Even more plots! The important plot here to consider is “PCA plot-contribution.pdf”. This plot is showing the contribution of each metacluster to the overall variance in the first three PC. (This is where the
comp.no becomes important.) The metaclusters are ordered based on their level of contribution to the overall variance. In this case “mc7” is the largest contributor. The red dashed line is the average contribution if all metaclusters were equal. As we’re only dealing with seven metaclusters, the difference is not that great.
There were also three other new files created. “PCA-eigenvalues.csv” contains the calculations for the variables (including eigenvalues). These results were then used to calculate the contributions in the previous plot, with the values saved in “PCA-eig-contrib_3-dim.csv”. These values are based on the selected number of components with
comp.no. You can experiment to see how this will change the results. “PCA-contributions.csv” contains the contributions of each metacluster for each individual PC.
And that’s it! There’s a lot here, and a lot more you can do. More information on using a PCA and plot generation can be found here.
A common question to consider is whether you can do a statistical analysis to compare between groups. Although we don’t have anything implemented here, I recommend checking out the
vegan package to run a “PERMANOVA” (permutational multivariate analysis of variance), using the
adonis function. This website may be of help.
Good luck, have fun, and keep being awesome!