Function to run a UMAP dimensionality reduction algorithm. UMAP (uniform manifold approximation and projection) plot is a useful means to visualise data. As it is a dimensionality reduction algorithm, some data will be lost. It is good practice to validate any populations (namely through manual gating). For more information on parameter choices, see ?umap::umap.defaults. Uses the R package "umap" to calculate plots and "data.table" to handle data.
Usage
run.umap(dat, use.cols, umap.x.name = "UMAP_X",
umap.y.name = "UMAP_Y", umap.seed = 42, neighbours = 15, n_components = 2,
metric = "euclidean", n_epochs = 200, input = "data", init = "spectral",
min_dist = 0.1, set_op_mix_ratio = 1, local_connectivity = 1,
bandwidth = 1, alpha = 1, gamma = 1, negative_sample_rate = 5,
a_gradient = NA, b_gradient = NA, spread = 1, transform_state = 42,
knn.repeats = 1, verbose = TRUE, umap_learn_args = NA, fast = TRUE,
n_threads = "auto", n_sgd_threads = "auto", batch = TRUE)
Arguments
- dat
NO DEFAULT. Input data.table or data.frame.
- use.cols
NO DEFAULT. Vector of column names or numbers for clustering.
- umap.x.name
DEFAULT = "UMAP_X". Character. Name of UMAP x-axis.
- umap.y.name
DEFAULT = "UMAP_Y". Character. Name of UMAP y-axis.
- umap.seed
DEFAULT = 42. Numeric. Seed value for reproducibility.
- neighbours
DEFAULT = 15. Numeric. Number of nearest neighbours.
- n_components
DEFAULT = 2. Numeric. Number of dimensions for output results.
- metric
DEFAULT = "euclidean". Character or function. Determines how distances between data points are computed. Can also be "manhattan".
- n_epochs
DEFAULT = 200. Numeric. Number of iterations performed during layout optimisation.
- input
DEFAULT = "data". Character. Determines whether primary input argument is a data or distance matrix. Can also be "dist".
- init
DEFAULT = "spectral". Character or matrix. Default "spectral" computes an initial embedding using eigenvectors of the connectivity graph matrix. Can also use "random" (creates an initial layout based on random coordinates).
- min_dist
DEFAULT = 0.1. Numeric. Determines how close points appear in final layout.
- set_op_mix_ratio
DEFAULT = 1. Numeric in range 0,1. Determines who the knn-graph is used to create a fuzzy simplicial graph.
- local_connectivity
DEFAULT = 1. Numeric. Used during construction of fuzzy simplicial set.
- bandwidth
DEFAULT = 1. Numeric. Used during construction of fuzzy simplicial set.
- alpha
DEFAULT = 1. Numeric. Initial value of "learning rate" of layout optimisation.
- gamma
DEFAULT = 1. Numeric. Together with alpha, it determines the learning rate of layout optimisation.
- negative_sample_rate
DEFAULT = 5. Numeric. Determines how many non-neighbour points are used per point and per iteration during layout optimisation.
- a_gradient
DEFAULT = NA. Numeric. Contributes to gradient calculations during layout optimisation. When left at NA, a suitable value will be estimated automatically.
- b_gradient
DEFAULT = NA. Numeric. Contributes to gradient calculations during layout optimisation. When left at NA, a suitable value will be estimated automatically.
- spread
DEFAULT = 1. Numeric. Used during automatic estimation of a_gradient/b_gradient parameters.
- transform_state
DEFAULT = 42. Numeric. Seed for random number generation used during predict().
- knn.repeats
DEFAULT = 1. Numeric. Number of times to restart knn search.
- verbose
DEFAULT = TRUE. Logical. Determines whether to show progress messages.
- umap_learn_args
DEFAULT = NA. Vector. Vector of arguments to python package umap-learn.
- fast
DEFAULT TRUE Whether to run uwot implementation of UMAP which is much faster.
- n_threads
DEFAULT "auto". Numeric. Number of threads to use (except during stochastic gradient descent). For nearest neighbor search, only applies if
nn_method = "annoy"
. Ifn_threads > 1
, then the Annoy index will be temporarily written to disk in the location determined by tempfile. The default "auto" option will automatically set this to the maximum number of threads in the computer - 1.- n_sgd_threads
DEFAULT "auto". Number of threads to use during stochastic gradient descent. If set to > 1, then be aware that if
batch = FALSE
, results will not be reproducible, even ifset.seed
is called with a fixed seed before running. Set to "auto" to use the same value as n_threads.- batch
DEFAULT TRUE. If set to TRUE, then embedding coordinates are updated at the end of each epoch rather than during the epoch. In batch mode, results are reproducible with a fixed random seed even with n_sgd_threads > 1, at the cost of a slightly higher memory use. You may also have to modify learning_rate and increase n_epochs, so whether this provides a speed increase over the single-threaded optimization is likely to be dataset and hardware-dependent.
Author
Thomas Ashhurst, thomas.ashhurst@sydney.edu.au Felix Marsh-Wakefield, felix.marsh-wakefield@sydney.edu.au
Examples
# Run UMAP on a subset of the demonstration dataset
library(Spectre)
# Subsample the demo dataset to 10000 cells
cell.dat <- do.subsample(demo.clustered, 10000)
cell.dat$UMAP_X <- NULL
cell.dat$UMAP_Y <- NULL
cell.dat <- run.umap(dat = cell.dat,
use.cols = c("NK11_asinh", "CD3_asinh",
"CD45_asinh", "Ly6G_asinh", "CD11b_asinh",
"B220_asinh", "CD8a_asinh", "Ly6C_asinh",
"CD4_asinh"))
#> 00:51:20 UMAP embedding parameters a = 1.577 b = 0.8951
#> 00:51:20 Converting dataframe to numerical matrix
#> 00:51:20 Read 10000 rows and found 9 numeric columns
#> 00:51:20 Using Annoy for neighbor search, n_neighbors = 15
#> 00:51:21 Building Annoy index with metric = euclidean, n_trees = 50
#> 0% 10 20 30 40 50 60 70 80 90 100%
#> [----|----|----|----|----|----|----|----|----|----|
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> |
#> 00:51:21 Writing NN index file to temp file /tmp/RtmpfFypGg/file21fe5c0a46a6
#> 00:51:21 Searching Annoy index using 3 threads, search_k = 1500
#> 00:51:22 Annoy recall = 100%
#> 00:51:23 Commencing smooth kNN distance calibration using 3 threads
#> with target n_neighbors = 15
#> 00:51:24 Initializing from normalized Laplacian + noise (using RSpectra)
#> 00:51:24 Commencing optimization for 200 epochs, with 200182 positive edges using 3 threads
#> 00:51:24 Using rng type: pcg
#> 00:51:25 Optimization finished