Method to run a PCA dimensionality reduction algorithm. A principal component analysis (PCA) is capable of reducing the number of dimensions (i.e. parameters) with minimal effect on the variation of the given dataset. This function will run a PCA calculation (extremely fast) and generate plots (takes time). For individuals (such as samples or patients), a PCA can group them based on their similarities. A PCA is also capable of ranking variables/parameters (such as markers or cell counts) based on their contribution to the variability across a dataset in an extremely fast manner. In cytometry, this can be useful to identify marker(s) that can be used to differentiate between subset(s) of cells. Uses the base R package "stats" for PCA, "factoextra" for PCA and scree plots, "data.table" for saving .csv files, "ggplot2" for saving plots, "gtools" for rearranging data order, 'RColorBrewer' and 'viridis' for colour schemes. More information on PCA plots can be found here http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/.
Usage
run.pca(dat, use.cols, scale = TRUE, add.pca.col = FALSE,
pca.col.no = 50, pca.lite = FALSE, scree.plot = TRUE, comp.no = 2,
variable.contribution = TRUE, plot.individuals = TRUE,
plot.ind.label = "point", pointsize.ind = 1.5, row.names = NULL,
plot.ind.group = FALSE, group.ind = NULL, colour.group = "viridis",
pointsize.group = 1.5, ellipse.type = "confidence",
ellipse.level = 0.95, mean.point = TRUE,
randomise.order = TRUE, order.seed = 42,
plot.variables = TRUE, colour.var = "solid",
plot.combined = TRUE, repel = FALSE, var.numb = 20, path = getwd())
Arguments
- dat
NO DEFAULT. data.frame.
- use.cols
NO DEFAULT. Vector of numbers, reflecting the columns to use for dimensionality reduction (may not want parameters such as "Time" or "Sample").
- scale
DEFAULT = TRUE. A logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place.
- add.pca.col
DEFAULT = FALSE. Option to add PC coordinates to input data.
- pca.col.no
DEFULAT = 50. Number of PC to be added to input data.
- pca.lite
DEFAULT = FALSE. Will stop running the function after PCA coordinates have been added to input data (assuming add.pca.col = TRUE).
- scree.plot
DEFAULT = TRUE. Option to create scree plots. Will save generated scree plot. Note this will require the input of an elbow point during run if comp.no = NULL.
- comp.no
DEFAULT = 2. Select number of components to be saved. If NULL, user will be asked during run to select number based on scree plot.
- variable.contribution
DEFAULT = TRUE. Option to create plot showing the contribution of each variable. Horizontal red line represents the average variable contribution if all variables contributed equally. Requires scree.plot = TRUE.
- plot.individuals
DEFAULT = TRUE. Option to create PCA plots on individuals (samples/patients).
- plot.ind.label
DEFAULT = "point". Option to add text to PCA plots on individuals as an extra identifier. Use c("point", "text") to include both text and point.
- pointsize.ind
DEFAULT = 1.5. Numeric. Size of dots of individuals on PCA plot.
- row.names
DEFAULT = NULL. Column (as character) that defines individuals. Will be used to place name on plot.individuals.
- plot.ind.group
DEFAULT = FALSE. Option to group individuals with ellipses (which by default show the 95 % confidence interval). Must specify column that groups individuals with group.ind.
- group.ind
DEFAULT = NULL. Column (as character) that defines groups of individuals. Works with plot.ind.group which must be set to TRUE.
- colour.group
DEFAULT = "viridis". Colour scheme for each group. Options include "jet", "spectral", "viridis", "inferno", "magma".
- pointsize.group
DEFAULT = 1.5. Numeric. Size of shapes of group individuals on PCA plot.
- ellipse.type
DEFAULT = "confidence". Set type of ellipse. Options include "confidence", "convex", "concentration", "t", "norm", "euclid". See factoextra::fviz for more information.
- ellipse.level
DEFAULT = 0.95. Size of ellipses. By default 95 % (0.95).
- mean.point
DEFAULT = TRUE. Option to plot the mean on PCA plot with different groups.
- randomise.order
DEFAULT = TRUE. Option to randomise plotting order of individuals to control for overlap.
- order.seed
DEFAULT = 42. Set the seed for randomising plotting order of individuals.
- plot.variables
DEFAULT = TRUE. Option to create PCA plots on variables (markers/cell counts).
- colour.var
DEFAULT = "solid". Colour scheme for PCA plot with variables. Options include "solid", "jet", "spectral", "viridis", "inferno", "magma", "BuPu". Note some colours are pale and may not appear clearly on plot.
- plot.combined
DEFAULT = TRUE. Option to create a combined PCA plot with both individuals and variables.
- repel
DEFAULT = FALSE. Option to avoid overlapping text in PCA plots. Can greatly increase plot time if there is a large number of samples.
- var.numb
DEFAULT = 20. Top number of variables to be plotted. Note the greater the number, the longer plots will take.
- path
DEFAULT = getwd(). The location to save plots. By default, will save to current working directory. Can be overidden.
Author
Felix Marsh-Wakefield, felix.marsh-wakefield@sydney.edu.au
Examples
# Set directory to save files. By default it will save files at get()
# Run PCA on demonstration dataset, adding PC to dataset
dat <- Spectre::demo.clustered
# Run PCA on demonstration dataset
Spectre::run.pca(dat = Spectre::demo.clustered,
use.cols = c(11:19),
repel = TRUE
)
#> Warning: Ignoring empty aesthetic: `width`.
#> Saving 6.67 x 6.67 in image
#> Saving 6.67 x 6.67 in image
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the factoextra package.
#> Please report the issue at <https://github.com/kassambara/factoextra/issues>.
#> Saving 6.67 x 6.67 in image
#> Saving 6.67 x 6.67 in image
# Compare between groups
if (FALSE) { # \dontrun{
Spectre::run.pca(dat = Spectre::demo.clustered,
use.cols = c(11:19),
comp.no = NULL,
plot.ind.label = c("point", "text"), #individual cells will be labelled as numbers
plot.ind.group = TRUE,
group.ind = "Group",
mean.point = FALSE,
randomise.order = TRUE
)
} # }
# When prompted, type in "5" and click enter to continue function
# (this selects the elbow point based off the scree plot)
## Possible issues ##
# Remove any NA present
na.omit(dat)
#> Index: <Sample>
#> FileName NK11 CD3 CD45 Ly6G CD11b
#> <char> <num> <num> <num> <num> <num>
#> 1: CNS_Mock_01.csv 42.3719 40.098700 6885.08 -344.7830 14787.30
#> 2: CNS_Mock_01.csv 42.9586 119.014000 1780.29 -429.6650 5665.73
#> 3: CNS_Mock_01.csv 59.2366 206.238000 10248.30 -1603.8400 19894.30
#> 4: CNS_Mock_01.csv 364.9480 -0.233878 3740.04 -815.9800 9509.43
#> 5: CNS_Mock_01.csv 440.2470 40.035200 9191.38 40.5055 5745.82
#> ---
#> 169000: CNS_WNV_D7_06.csv 910.8890 72.856100 31466.20 -316.5570 28467.80
#> 169001: CNS_WNV_D7_06.csv -10.2642 64.188700 45188.00 -540.5140 22734.00
#> 169002: CNS_WNV_D7_06.csv -184.2910 -9.445650 11842.60 -97.9383 17237.00
#> 169003: CNS_WNV_D7_06.csv 248.3860 229.986000 32288.20 -681.1630 19255.80
#> 169004: CNS_WNV_D7_06.csv 738.9810 95.470300 46185.10 -1004.6000 22957.80
#> B220 CD8a Ly6C CD4 NK11_asinh CD3_asinh
#> <num> <num> <num> <num> <num> <num>
#> 1: -40.2399 83.7175 958.7000 711.0720 0.04235923 0.040087962
#> 2: 86.6673 34.7219 448.2590 307.2720 0.04294540 0.118734817
#> 3: 427.8310 285.8800 1008.8300 707.0940 0.05920201 0.204803270
#> 4: 182.4200 333.6050 440.0710 249.7840 0.35729716 -0.000233878
#> 5: -211.6940 149.2200 87.4815 867.5700 0.42713953 0.040024513
#> ---
#> 169000: -7.7972 -271.8040 12023.7000 1103.0500 0.81693878 0.072791800
#> 169001: 202.4110 -936.4920 4188.3300 315.9400 -0.01026402 0.064144703
#> 169002: 123.4760 -219.9320 8923.4000 -453.4640 -0.18326344 -0.009445510
#> 169003: -656.0540 -201.5880 10365.7000 61.6765 0.24590035 0.228005328
#> 169004: -661.6280 72.3356 9704.4700 -31.8532 0.68430866 0.095325863
#> CD45_asinh Ly6G_asinh CD11b_asinh B220_asinh CD8a_asinh Ly6C_asinh
#> <num> <num> <num> <num> <num> <num>
#> 1: 2.627736 -0.33829345 3.388057 -0.040229048 0.08362002 0.8518665
#> 2: 1.340828 -0.41743573 2.435282 0.086559169 0.03471493 0.4344615
#> 3: 3.022631 -1.25101677 3.684212 0.415750122 0.28212257 0.8876036
#> 4: 2.029655 -0.74509796 2.948184 0.181423123 0.32770787 0.4269784
#> 5: 2.914359 0.04049443 2.449108 -0.210143906 0.14867171 0.0873703
#> ---
#> 169000: 4.142314 -0.31149515 4.042229 -0.007797121 -0.26856390 3.1817517
#> 169001: 4.504101 -0.51715205 3.817492 0.201053740 -0.83574631 2.1394053
#> 169002: 3.166628 -0.09778240 3.541046 0.123164374 -0.21819650 2.8849492
#> 169003: 4.168089 -0.63716643 3.651633 -0.616293228 -0.20024703 3.0339681
#> 169004: 4.525922 -0.88462254 3.827279 -0.620947819 0.07227267 2.9683779
#> CD4_asinh Sample Group Batch FlowSOM_cluster
#> <num> <char> <char> <char> <num>
#> 1: 0.66171351 01_Mock_01 Mock A 23
#> 2: 0.30263135 01_Mock_01 Mock A 55
#> 3: 0.65846851 01_Mock_01 Mock A 64
#> 4: 0.24725691 01_Mock_01 Mock A 53
#> 5: 0.78456678 01_Mock_01 Mock A 110
#> ---
#> 169000: 0.95239703 12_WNV_06 WNV A 72
#> 169001: 0.31090687 12_WNV_06 WNV A 46
#> 169002: -0.43920651 12_WNV_06 WNV A 133
#> 169003: 0.06163746 12_WNV_06 WNV A 133
#> 169004: -0.03184782 12_WNV_06 WNV A 103
#> FlowSOM_metacluster Population UMAP_X UMAP_Y
#> <fctr> <char> <num> <num>
#> 1: 2 Microglia -2.3603757 6.201213
#> 2: 2 Microglia 2.7505242 7.119595
#> 3: 2 Microglia -2.9486033 4.012670
#> 4: 2 Microglia 0.6482904 6.481466
#> 5: 4 NK cells -2.3941295 6.975885
#> ---
#> 169000: 3 Infil Macrophages -2.9640724 -5.058265
#> 169001: 3 Infil Macrophages -1.2644785 -3.555824
#> 169002: 3 Infil Macrophages -2.3592682 -2.429467
#> 169003: 3 Infil Macrophages -1.9531062 -4.049705
#> 169004: 3 Infil Macrophages -0.7404098 -4.686928
# Remove columns that have zero variance (e.g. if MFI is the same for all
# samples for a marker)
dat <- data.table::as.data.table(dat)
dat <- dat[ , lapply(.SD, function(v) if(data.table::uniqueN(v, na.rm = TRUE) > 1) v)]
# Ellipses are only generated in 'plot.ind.group' when there are at least
# 2 samples per group ('group.ind')