Run the PCA algorithm (using stats::prcomp)

Method to run a PCA dimensionality reduction algorithm. A principal component analysis (PCA) is capable of reducing the number of dimensions (i.e. parameters) with minimal effect on the variation of the given dataset. This function will run a PCA calculation (extremely fast) and generate plots (takes time). For individuals (such as samples or patients), a PCA can group them based on their similarities. A PCA is also capable of ranking variables/parameters (such as markers or cell counts) based on their contribution to the variability across a dataset in an extremely fast manner. In cytometry, this can be useful to identify marker(s) that can be used to differentiate between subset(s) of cells. Uses the base R package "stats" for PCA, "factoextra" for PCA and scree plots, "data.table" for saving .csv files, "ggplot2" for saving plots, "gtools" for rearranging data order, 'RColorBrewer' and 'viridis' for colour schemes. More information on PCA plots can be found here http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/.

Usage

run.pca(dat, use.cols, scale = TRUE, add.pca.col = FALSE, 
pca.col.no = 50, pca.lite = FALSE, scree.plot = TRUE, comp.no = 2, 
variable.contribution = TRUE, plot.individuals = TRUE, 
plot.ind.label = "point", pointsize.ind = 1.5, row.names = NULL, 
plot.ind.group = FALSE, group.ind = NULL, colour.group = "viridis", 
pointsize.group = 1.5, ellipse.type = "confidence", 
ellipse.level = 0.95, mean.point = TRUE, 
randomise.order = TRUE, order.seed = 42, 
plot.variables = TRUE, colour.var = "solid", 
plot.combined = TRUE, repel = FALSE, var.numb = 20, path = getwd())

Arguments

dat: NO DEFAULT. data.frame.
use.cols: NO DEFAULT. Vector of numbers, reflecting the columns to use for dimensionality reduction (may not want parameters such as "Time" or "Sample").
scale: DEFAULT = TRUE. A logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place.
add.pca.col: DEFAULT = FALSE. Option to add PC coordinates to input data.
pca.col.no: DEFULAT = 50. Number of PC to be added to input data.
pca.lite: DEFAULT = FALSE. Will stop running the function after PCA coordinates have been added to input data (assuming add.pca.col = TRUE).
scree.plot: DEFAULT = TRUE. Option to create scree plots. Will save generated scree plot. Note this will require the input of an elbow point during run if comp.no = NULL.
comp.no: DEFAULT = 2. Select number of components to be saved. If NULL, user will be asked during run to select number based on scree plot.
variable.contribution: DEFAULT = TRUE. Option to create plot showing the contribution of each variable. Horizontal red line represents the average variable contribution if all variables contributed equally. Requires scree.plot = TRUE.
plot.individuals: DEFAULT = TRUE. Option to create PCA plots on individuals (samples/patients).
plot.ind.label: DEFAULT = "point". Option to add text to PCA plots on individuals as an extra identifier. Use c("point", "text") to include both text and point.
pointsize.ind: DEFAULT = 1.5. Numeric. Size of dots of individuals on PCA plot.
row.names: DEFAULT = NULL. Column (as character) that defines individuals. Will be used to place name on plot.individuals.
plot.ind.group: DEFAULT = FALSE. Option to group individuals with ellipses (which by default show the 95 % confidence interval). Must specify column that groups individuals with group.ind.
group.ind: DEFAULT = NULL. Column (as character) that defines groups of individuals. Works with plot.ind.group which must be set to TRUE.
colour.group: DEFAULT = "viridis". Colour scheme for each group. Options include "jet", "spectral", "viridis", "inferno", "magma".
pointsize.group: DEFAULT = 1.5. Numeric. Size of shapes of group individuals on PCA plot.
ellipse.type: DEFAULT = "confidence". Set type of ellipse. Options include "confidence", "convex", "concentration", "t", "norm", "euclid". See factoextra::fviz for more information.
ellipse.level: DEFAULT = 0.95. Size of ellipses. By default 95 % (0.95).
mean.point: DEFAULT = TRUE. Option to plot the mean on PCA plot with different groups.
randomise.order: DEFAULT = TRUE. Option to randomise plotting order of individuals to control for overlap.
order.seed: DEFAULT = 42. Set the seed for randomising plotting order of individuals.
plot.variables: DEFAULT = TRUE. Option to create PCA plots on variables (markers/cell counts).
colour.var: DEFAULT = "solid". Colour scheme for PCA plot with variables. Options include "solid", "jet", "spectral", "viridis", "inferno", "magma", "BuPu". Note some colours are pale and may not appear clearly on plot.
plot.combined: DEFAULT = TRUE. Option to create a combined PCA plot with both individuals and variables.
repel: DEFAULT = FALSE. Option to avoid overlapping text in PCA plots. Can greatly increase plot time if there is a large number of samples.
var.numb: DEFAULT = 20. Top number of variables to be plotted. Note the greater the number, the longer plots will take.
path: DEFAULT = getwd(). The location to save plots. By default, will save to current working directory. Can be overidden.

Author

Felix Marsh-Wakefield, felix.marsh-wakefield@sydney.edu.au

Examples

# Set directory to save files. By default it will save files at get()

# Run PCA on demonstration dataset, adding PC to dataset
dat <- Spectre::demo.clustered
                        
# Run PCA on demonstration dataset
Spectre::run.pca(dat = Spectre::demo.clustered,
                use.cols = c(11:19),
                repel = TRUE
                )
#> Warning: Ignoring empty aesthetic: `width`.
#> Saving 6.67 x 6.67 in image
#> Saving 6.67 x 6.67 in image
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the factoextra package.
#>   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
#> Saving 6.67 x 6.67 in image
#> Saving 6.67 x 6.67 in image

# Compare between groups
if (FALSE) { # \dontrun{
Spectre::run.pca(dat = Spectre::demo.clustered,
                 use.cols = c(11:19),
                 comp.no = NULL,
                 plot.ind.label = c("point", "text"), #individual cells will be labelled as numbers
                 plot.ind.group = TRUE,
                 group.ind = "Group",
                 mean.point = FALSE,
                 randomise.order = TRUE
                 )
} # }
        
# When prompted, type in "5" and click enter to continue function 
# (this selects the elbow point based off the scree plot)

## Possible issues ##
# Remove any NA present
na.omit(dat)
#> Index: <Sample>
#>                  FileName      NK11        CD3     CD45       Ly6G    CD11b
#>                    <char>     <num>      <num>    <num>      <num>    <num>
#>      1:   CNS_Mock_01.csv   42.3719  40.098700  6885.08  -344.7830 14787.30
#>      2:   CNS_Mock_01.csv   42.9586 119.014000  1780.29  -429.6650  5665.73
#>      3:   CNS_Mock_01.csv   59.2366 206.238000 10248.30 -1603.8400 19894.30
#>      4:   CNS_Mock_01.csv  364.9480  -0.233878  3740.04  -815.9800  9509.43
#>      5:   CNS_Mock_01.csv  440.2470  40.035200  9191.38    40.5055  5745.82
#>     ---                                                                    
#> 169000: CNS_WNV_D7_06.csv  910.8890  72.856100 31466.20  -316.5570 28467.80
#> 169001: CNS_WNV_D7_06.csv  -10.2642  64.188700 45188.00  -540.5140 22734.00
#> 169002: CNS_WNV_D7_06.csv -184.2910  -9.445650 11842.60   -97.9383 17237.00
#> 169003: CNS_WNV_D7_06.csv  248.3860 229.986000 32288.20  -681.1630 19255.80
#> 169004: CNS_WNV_D7_06.csv  738.9810  95.470300 46185.10 -1004.6000 22957.80
#>              B220      CD8a       Ly6C       CD4  NK11_asinh    CD3_asinh
#>             <num>     <num>      <num>     <num>       <num>        <num>
#>      1:  -40.2399   83.7175   958.7000  711.0720  0.04235923  0.040087962
#>      2:   86.6673   34.7219   448.2590  307.2720  0.04294540  0.118734817
#>      3:  427.8310  285.8800  1008.8300  707.0940  0.05920201  0.204803270
#>      4:  182.4200  333.6050   440.0710  249.7840  0.35729716 -0.000233878
#>      5: -211.6940  149.2200    87.4815  867.5700  0.42713953  0.040024513
#>     ---                                                                  
#> 169000:   -7.7972 -271.8040 12023.7000 1103.0500  0.81693878  0.072791800
#> 169001:  202.4110 -936.4920  4188.3300  315.9400 -0.01026402  0.064144703
#> 169002:  123.4760 -219.9320  8923.4000 -453.4640 -0.18326344 -0.009445510
#> 169003: -656.0540 -201.5880 10365.7000   61.6765  0.24590035  0.228005328
#> 169004: -661.6280   72.3356  9704.4700  -31.8532  0.68430866  0.095325863
#>         CD45_asinh  Ly6G_asinh CD11b_asinh   B220_asinh  CD8a_asinh Ly6C_asinh
#>              <num>       <num>       <num>        <num>       <num>      <num>
#>      1:   2.627736 -0.33829345    3.388057 -0.040229048  0.08362002  0.8518665
#>      2:   1.340828 -0.41743573    2.435282  0.086559169  0.03471493  0.4344615
#>      3:   3.022631 -1.25101677    3.684212  0.415750122  0.28212257  0.8876036
#>      4:   2.029655 -0.74509796    2.948184  0.181423123  0.32770787  0.4269784
#>      5:   2.914359  0.04049443    2.449108 -0.210143906  0.14867171  0.0873703
#>     ---                                                                       
#> 169000:   4.142314 -0.31149515    4.042229 -0.007797121 -0.26856390  3.1817517
#> 169001:   4.504101 -0.51715205    3.817492  0.201053740 -0.83574631  2.1394053
#> 169002:   3.166628 -0.09778240    3.541046  0.123164374 -0.21819650  2.8849492
#> 169003:   4.168089 -0.63716643    3.651633 -0.616293228 -0.20024703  3.0339681
#> 169004:   4.525922 -0.88462254    3.827279 -0.620947819  0.07227267  2.9683779
#>           CD4_asinh     Sample  Group  Batch FlowSOM_cluster
#>               <num>     <char> <char> <char>           <num>
#>      1:  0.66171351 01_Mock_01   Mock      A              23
#>      2:  0.30263135 01_Mock_01   Mock      A              55
#>      3:  0.65846851 01_Mock_01   Mock      A              64
#>      4:  0.24725691 01_Mock_01   Mock      A              53
#>      5:  0.78456678 01_Mock_01   Mock      A             110
#>     ---                                                     
#> 169000:  0.95239703  12_WNV_06    WNV      A              72
#> 169001:  0.31090687  12_WNV_06    WNV      A              46
#> 169002: -0.43920651  12_WNV_06    WNV      A             133
#> 169003:  0.06163746  12_WNV_06    WNV      A             133
#> 169004: -0.03184782  12_WNV_06    WNV      A             103
#>         FlowSOM_metacluster        Population     UMAP_X    UMAP_Y
#>                      <fctr>            <char>      <num>     <num>
#>      1:                   2         Microglia -2.3603757  6.201213
#>      2:                   2         Microglia  2.7505242  7.119595
#>      3:                   2         Microglia -2.9486033  4.012670
#>      4:                   2         Microglia  0.6482904  6.481466
#>      5:                   4          NK cells -2.3941295  6.975885
#>     ---                                                           
#> 169000:                   3 Infil Macrophages -2.9640724 -5.058265
#> 169001:                   3 Infil Macrophages -1.2644785 -3.555824
#> 169002:                   3 Infil Macrophages -2.3592682 -2.429467
#> 169003:                   3 Infil Macrophages -1.9531062 -4.049705
#> 169004:                   3 Infil Macrophages -0.7404098 -4.686928

# Remove columns that have zero variance (e.g. if MFI is the same for all 
# samples for a marker)
dat <- data.table::as.data.table(dat)
dat <- dat[ , lapply(.SD, function(v) if(data.table::uniqueN(v, na.rm = TRUE) > 1) v)] 


# Ellipses are only generated in 'plot.ind.group' when there are at least 
# 2 samples per group ('group.ind')