In this workflow, we provide an example on how to run and make sense of a principal component analysis (PCA). PCA is capable of reducing the number of dimensions (i.e. parameters) with minimal effect on the variation of the given dataset. This function will run a PCA calculation (extremely fast) and generate plots (takes time). For individuals (such as samples or patients), a PCA can group them based on their similarities. A PCA is also capable of ranking variables/parameters (such as markers or cell counts) based on their contribution to the variability across a dataset in an extremely fast manner. In cytometry, this can be useful to identify marker(s) that can be used to differentiate between subset(s) of cells. This workflow is suitable for data with variables (such as markers or genes) and individuals (such as samples, patients or cells).

In this example, we will be using demo data that have undergone FlowSOM clustering. We will then calculate the level of each metacluster (as a proportion of total cells) for each mouse (individual). We will experiment with the different options available in run.pca.

The PCA in run.pca is done using stats::prcomp. For plots, the package factoextra was used. The colour scheme options are from the packages RColorBrewer and viridis.

More information on using a PCA and plot generation can be found here.

Analysis setup

Here, we will simply load up the Spectre package and the demo dataset which has previously been analysed and annotated using our simple discovery workflow.

library(Spectre)
library(data.table)

cell.dat <- demo.clustered

head(cell.dat)

Calculate proportions

We will first calculate the level of each "FlowSOM_metacluster" column as a proportion for each individual, "FileName".

# Select column to calculate proportions
as.matrix(colnames(cell.dat)) #best not to run this if you've got lots of columns...
##       [,1]
##  [1,] "FileName"
##  [2,] "NK11"
##  [3,] "CD3"
##  [4,] "CD45"
##  [5,] "Ly6G"
##  [6,] "CD11b"
##  [7,] "B220"
##  [8,] "CD8a"
##  [9,] "Ly6C"
## [10,] "CD4"
## [11,] "NK11_asinh"
## [12,] "CD3_asinh"
## [13,] "CD45_asinh"
## [14,] "Ly6G_asinh"
## [15,] "CD11b_asinh"
## [16,] "B220_asinh"
## [17,] "CD8a_asinh"
## [18,] "Ly6C_asinh"
## [19,] "CD4_asinh"
## [20,] "Sample"
## [21,] "Group"
## [22,] "Batch"
## [23,] "FlowSOM_cluster"
## [24,] "FlowSOM_metacluster"
## [25,] "Population"
## [26,] "UMAP_X"
## [27,] "UMAP_Y"
prop.col.by <- "FlowSOM_metacluster"

# Calculate proportions for each metacluster (adds extra column 'prop')
cell.dat[, sum := .N, by = FileName][, prop := .N, by = c("FileName", prop.col.by)][, prop := prop/sum][, sum := NULL]

# Select unique rows (i.e. remove duplicates)
cell.dat <- unique(cell.dat, by = c("FileName", prop.col.by))

# Convert prop column to percentage
cell.dat[["prop"]] <- 100*cell.dat[["prop"]]

# Check proportional data make sense
sum(cell.dat[Sample == "01_Mock_01"][["prop"]])
## [1] 100
##should = 100##

# Select columns of interest
col.interest <- c("FileName", prop.col.by, "Group", "prop")

# Select columns
cell.dat.prop <- cell.dat[, ..col.interest]

# Reorder rows based on prop.col.by (useful for later rearrangement)
cell.dat.prop <- dplyr::arrange(cell.dat.prop, group_by = cell.dat.prop[[prop.col.by]])

# Change cluster name (based on prop.col.by)
ifelse(grepl("metacluster", prop.col.by),
cell.dat.prop[[prop.col.by]] <- paste0("mc", cell.dat.prop[[prop.col.by]]),
cell.dat.prop[[prop.col.by]] <- paste0("c", cell.dat.prop[[prop.col.by]]))
## [1] "mc1"
# Rearrange data to be based on proportion
cell.dat.prop <- tidyr::pivot_wider(cell.dat.prop, names_from = prop.col.by, values_from = prop)
#will add "NA" when values are missing, which happens when % == 0 (which is great!)

# Replace "NA" with "0"
cell.dat.prop[is.na(cell.dat.prop)] <- 0

# Make data.table (useful later on)
cell.dat.prop <- data.table::as.data.table(cell.dat.prop)

head(cell.dat.prop)

This table contains the level of each metacluster as a proportion of total cells for each individual. It includes each mouse ("FileName"), the group ("Group"), and each metacluster as a separate column. We will use this as it represents the overall immune repertoire for each mouse.

Prepare data for the PCA

We need to make sure the data are compatible with a PCA, as there are some things that will be problematic if not correct. In particular, we need to make sure there are no NA. Each metacluster (variable) must have variance, otherwise the PCA will not work. We will also remove any metacluster that that has the same value for each mouse.

# Remove samples that have NA present
cell.dat.prop <- na.omit(cell.dat.prop)

# Remove unwanted group (e.g. "Batch")
# cell.dat.prop <- cell.dat.prop[!mclust.data.na$GroupName == "Batch"]

# Remove columns with constant values (particularly common for markers that aren't expressed on a cluster)
#https://stackoverflow.com/questions/15068981/removal-of-constant-columns-in-r
#https://stackoverflow.com/questions/48253732/remove-constant-columns-with-or-without-nas/48253983
cell.dat.prop <- cell.dat.prop[ , lapply(.SD, function(v) if(data.table::uniqueN(v, na.rm = TRUE) > 1) v)]

head(cell.dat.prop)

We then need to select columns that will be used to run the PCA. In short, all the columns with the metacluster data.

# Select columns for PCA
as.matrix(colnames(cell.dat.prop))
##       [,1]
##  [1,] "FileName"
##  [2,] "Group"
##  [3,] "mc1"
##  [4,] "mc2"
##  [5,] "mc3"
##  [6,] "mc4"
##  [7,] "mc5"
##  [8,] "mc6"
##  [9,] "mc7"
use.cols <- colnames(cell.dat.prop)[3:9]
use.cols
## [1] "mc1" "mc2" "mc3" "mc4" "mc5" "mc6" "mc7"
# Define columns by removing those not of interest
# use.cols <- colnames(cell.dat.prop)[-grep("FileName|Group", colnames(cell.dat.prop))]

Run PCA!

We may it! Now we can run the PCA. There are lots of options available, so we will start with the basics.

# The essential parameters to define are `dat` and `use.cols`, and there is no default. We recommend `scale = TRUE` (which is the default), so all metaclusters are considered equal (as opposed to larger metacluster % skewing the results).
# We will be running `pca.lite`, which will add PCA coordinates to our dataset with `add.pca.col = TRUE`. By default, up to 50 principal components (PC) with be added. If more (or less) are desired, this number can be altered with `pca.col.no`.
cell.dat.pca <- Spectre::run.pca(dat = cell.dat.prop,
                                 use.cols = use.cols,
                                 scale = TRUE, #this is the default
                                 add.pca.col = TRUE,
                                 pca.col.no = 50, #this is the default
                                 pca.lite = TRUE
                                 )

head(cell.dat.pca)

We ran a PCA! Well done. Extra columns have been added for each of the calculated PC (Dim.1, Dim.2, …). The number of columns calculated is the lesser number of metaclusers (variables) or mice (individuals). As there can be a huge range of PC (particularly if dealing with single-cell RNA sequencing data), up to 50 PC will be added to our data (as defined by pca.col.no). In our case, there are 7 metaclusters and 12 mice, so 7 PC were calculated.

The data generated can be used to manually create PCA plots. Otherwise continue reading and you can let run.pca do it for you!

Generate PCA plots 1

We can automatically create PCA plots that will be saved into our working directory. By default, results are saved wherever this RMD file is saved.

# Check where plots will be saved
getwd()

# Create folder for files to be saved
dir.create("Output_PCA_1")
setwd("Output_PCA_1")

# Create plots. This includes a scree, variability, contribution and PCA plot.
# Note this will be calculating the PCA again, but because it's such a fast calculation it still doesn't take much time.

Spectre::run.pca(dat = cell.dat.prop,
                 use.cols = use.cols,
                 scale = TRUE,
                 add.pca.col = FALSE, #this is the default
                 pca.col.no = 50,
                 pca.lite = FALSE,
                 scree.plot = TRUE, #creates scree plot
                 comp.no = 2, #number of components to be used (will be discussed more later)
                 variable.contribution = FALSE,
                 plot.individuals = TRUE,
                 plot.ind.label = c("point", "text"),
                 pointsize.ind = 1.5,
                 row.names = "FileName",
                 plot.variables = TRUE,
                 var.numb = 20, #maximum number of variables to be plotted
                 path = getwd() #this is the default
                 )

Results 1

Results! You can have a quick look here at what you created.

PCA plot-individuals

PCA plot-variables

PCA plot-combined

Scree plot

Tables-individuals

Tables-variables

Plots should now appear in the output folder. Check them out! Don’t worry, I’ll wait…

Let’s unpack the figures created here. “PCA plot-individuals.pdf” contains a PCA plot. Each dot represents a single mouse (individual). Using the option plot.ind.label, we have included the name of each mouse based on the column "FileName". The x-axis is PC1 (or “Dim.1”), and will summarise the largest amount of variance (in this case 70.1 %). The y-axis is PC2, summarising the second largest amount of variance (15.9 %).

“PCA plot-variables.pdf” shows the metaclusters (variables) that contribute most to the variance. The directions reflects how these metaclusters relate to each other. Metaclusters closer together positively correlate. Metaclusters pointing in the opposite direction negatively correlate. Metaclusters perpendicular to each other have no correlation. The size (and colour) of the arrow indicates the level of contribution that particular metacluster makes to the variance.

“PCA plot-combined.pdf” is a combination of both the PCA and variable plot.

“Scree plot.pdf” is a scree plot. This plot shows how much each PC (dimension) contributes to the overall variance. We will discuss this more in a bit…

We have also created two CSV files. “PCA-individuals.csv” is a printed file of the input data combined with the PCA coordinates (for each of the PC). “PCA-variables.csv” contains the coordinates of the metaclusters.

Generate PCA plots 2

Great work! Let’s have a look at a few more options we can play around with. We will have at differentiating mice on the PCA plot based on the group they belong to.

# Check where plots will be saved
getwd()

# Create folder for files to be saved
dir.create("Output_PCA_2")
setwd("Output_PCA_2")

Spectre::run.pca(dat = cell.dat.prop,
                 use.cols = use.cols,
                 scale = TRUE,
                 add.pca.col = FALSE,
                 pca.col.no = 50,
                 pca.lite = FALSE,
                 scree.plot = FALSE,
                 comp.no = 2,
                 variable.contribution = FALSE,
                 plot.individuals = TRUE,
                 plot.ind.label = "point",
                 pointsize.ind = 1.5,
                 # row.names = "FileName",
                 plot.ind.group = TRUE, #need to state 'group.ind'
                 group.ind = "Group", #need 'plot.ind.group = TRUE'
                 colour.group = "viridis",
                 pointsize.group = 3,
                 ellipse.type = "confidence",
                 ellipse.level = 0.95, #95 % confidence interval
                 mean.point = TRUE, #large symbol representing middle of ellipse
                 randomise.order = TRUE, #if there are multiple individuals, this makes them added to the plots in a random order (so single cells from one group don't cover up another)
                 order.seed = 42,
                 plot.variables = FALSE,
                 plot.combined = FALSE,
                 # var.numb = 20,
                 path = getwd()
                 )

Results 2

Results again! You can have a quick look here at what you created.

PCA plot-groups

PCA plot-individuals

Tables-individuals

More plots! Look at you go. The new one to look at is “PCA plot-ind-groups.pdf”. The colours and symbols represent the group each mouse belongs to. We can see there is clear separation between these groups. This suggests the overall immune cell repertoire of individual mice differ based on the group they belong. Furthermore, the difference between groups is divided by the x-axis alone (i.e., the PC that contributes most to the difference).

Generate PCA plots 3

Now we’ll spend some time with the screen plot and variables. As mentioned, the scree plot shows the variance of each PC. Ideally, we try and find the smallest number of PC that summarise the greatest level of variance. This is commonly referred to as the “elbow point”. In the first scree plot created, there was a large drop after PC1. In this case, PC2 is the elbow point. This is not always the case, but sticking with two components is useful for visualising the data. (We will discuss what this selection can mean in the next set of plots we’ll be creating.) We can use comp.no to change the number of components to be used for calculations (default comp.no = 2). If comp.no = NULL, a scree plot will be created and the selection can be made so we can select the elbow point.

Hopefully this will start making sense after these next plots.

# Check where plots will be saved
getwd()

# Create folder for files to be saved
dir.create("Output_PCA_3")
setwd("Output_PCA_3")

Spectre::run.pca(dat = cell.dat.prop,
                 use.cols = use.cols,
                 scale = TRUE,
                 add.pca.col = FALSE,
                 pca.col.no = 50,
                 pca.lite = FALSE,
                 scree.plot = TRUE,
                 comp.no = 3, #changing from the default = 2
                 variable.contribution = TRUE,
                 plot.individuals = FALSE,
                 # plot.ind.label = "point",
                 # pointsize.ind = 1.5,
                 # row.names = "FileName",
                 # plot.ind.group = TRUE,
                 # group.ind = "Group",
                 # colour.group = "viridis",
                 # pointsize.group = 3,
                 # ellipse.type = "confidence",
                 # ellipse.level = 0.95,
                 # mean.point = TRUE,
                 # randomise.order = TRUE,
                 order.seed = 42,
                 plot.variables = TRUE,
                 plot.combined = FALSE,
                 var.numb = 20,
                 path = getwd()
                 )

Results 3

More results again! You can have a quick look here at what you created.

PCA plot-contribution

PCA plot-variables

Scree plot

Tables-eig_contrib_3-dim

Tables-eigenvalues

Tables-contributions

Tables-variables

Even more plots! The important plot here to consider is “PCA plot-contribution.pdf”. This plot is showing the contribution of each metacluster to the overall variance in the first three PC. (This is where the comp.no becomes important.) The metaclusters are ordered based on their level of contribution to the overall variance. In this case “mc7” is the largest contributor. The red dashed line is the average contribution if all metaclusters were equal. As we’re only dealing with seven metaclusters, the difference is not that great.

There were also three other new files created. “PCA-eigenvalues.csv” contains the calculations for the variables (including eigenvalues). These results were then used to calculate the contributions in the previous plot, with the values saved in “PCA-eig-contrib_3-dim.csv”. These values are based on the selected number of components with comp.no. You can experiment to see how this will change the results. “PCA-contributions.csv” contains the contributions of each metacluster for each individual PC.

And that’s it! There’s a lot here, and a lot more you can do. More information on using a PCA and plot generation can be found here.

More comments

A common question to consider is whether you can do a statistical analysis to compare between groups. Although we don’t have anything implemented here, I recommend checking out the vegan package to run a “PERMANOVA” (permutational multivariate analysis of variance), using the adonis function. This website may be of help.

Good luck, have fun, and keep being awesome!