Run UMAP — run.umap • Spectre

Function to run a UMAP dimensionality reduction algorithm. UMAP (uniform manifold approximation and projection) plot is a useful means to visualise data. As it is a dimensionality reduction algorithm, some data will be lost. It is good practice to validate any populations (namely through manual gating). For more information on parameter choices, see ?umap::umap.defaults. Uses the R package "umap" to calculate plots and "data.table" to handle data.

Usage

run.umap(dat, use.cols, umap.x.name = "UMAP_X", 
umap.y.name = "UMAP_Y", umap.seed = 42, neighbours = 15, n_components = 2, 
metric = "euclidean", n_epochs = 200, input = "data", init = "spectral", 
min_dist = 0.1, set_op_mix_ratio = 1, local_connectivity = 1, 
bandwidth = 1, alpha = 1, gamma = 1, negative_sample_rate = 5, 
a_gradient = NA, b_gradient = NA, spread = 1, transform_state = 42, 
knn.repeats = 1, verbose = TRUE, umap_learn_args = NA, fast = TRUE, 
n_threads = "auto", n_sgd_threads = "auto", batch = TRUE)

Arguments

dat: NO DEFAULT. Input data.table or data.frame.
use.cols: NO DEFAULT. Vector of column names or numbers for clustering.
umap.x.name: DEFAULT = "UMAP_X". Character. Name of UMAP x-axis.
umap.y.name: DEFAULT = "UMAP_Y". Character. Name of UMAP y-axis.
umap.seed: DEFAULT = 42. Numeric. Seed value for reproducibility.
neighbours: DEFAULT = 15. Numeric. Number of nearest neighbours.
n_components: DEFAULT = 2. Numeric. Number of dimensions for output results.
metric: DEFAULT = "euclidean". Character or function. Determines how distances between data points are computed. Can also be "manhattan".
n_epochs: DEFAULT = 200. Numeric. Number of iterations performed during layout optimisation.
input: DEFAULT = "data". Character. Determines whether primary input argument is a data or distance matrix. Can also be "dist".
init: DEFAULT = "spectral". Character or matrix. Default "spectral" computes an initial embedding using eigenvectors of the connectivity graph matrix. Can also use "random" (creates an initial layout based on random coordinates).
min_dist: DEFAULT = 0.1. Numeric. Determines how close points appear in final layout.
set_op_mix_ratio: DEFAULT = 1. Numeric in range 0,1. Determines who the knn-graph is used to create a fuzzy simplicial graph.
local_connectivity: DEFAULT = 1. Numeric. Used during construction of fuzzy simplicial set.
bandwidth: DEFAULT = 1. Numeric. Used during construction of fuzzy simplicial set.
alpha: DEFAULT = 1. Numeric. Initial value of "learning rate" of layout optimisation.
gamma: DEFAULT = 1. Numeric. Together with alpha, it determines the learning rate of layout optimisation.
negative_sample_rate: DEFAULT = 5. Numeric. Determines how many non-neighbour points are used per point and per iteration during layout optimisation.
a_gradient: DEFAULT = NA. Numeric. Contributes to gradient calculations during layout optimisation. When left at NA, a suitable value will be estimated automatically.
b_gradient: DEFAULT = NA. Numeric. Contributes to gradient calculations during layout optimisation. When left at NA, a suitable value will be estimated automatically.
spread: DEFAULT = 1. Numeric. Used during automatic estimation of a_gradient/b_gradient parameters.
transform_state: DEFAULT = 42. Numeric. Seed for random number generation used during predict().
knn.repeats: DEFAULT = 1. Numeric. Number of times to restart knn search.
verbose: DEFAULT = TRUE. Logical. Determines whether to show progress messages.
umap_learn_args: DEFAULT = NA. Vector. Vector of arguments to python package umap-learn.
fast: DEFAULT TRUE Whether to run uwot implementation of UMAP which is much faster.
n_threads: DEFAULT "auto". Numeric. Number of threads to use (except during stochastic gradient descent). For nearest neighbor search, only applies if nn_method = "annoy". If n_threads > 1, then the Annoy index will be temporarily written to disk in the location determined by tempfile. The default "auto" option will automatically set this to the maximum number of threads in the computer - 1.
n_sgd_threads: DEFAULT "auto". Number of threads to use during stochastic gradient descent. If set to > 1, then be aware that if batch = FALSE, results will not be reproducible, even if set.seed is called with a fixed seed before running. Set to "auto" to use the same value as n_threads.
batch: DEFAULT TRUE. If set to TRUE, then embedding coordinates are updated at the end of each epoch rather than during the epoch. In batch mode, results are reproducible with a fixed random seed even with n_sgd_threads > 1, at the cost of a slightly higher memory use. You may also have to modify learning_rate and increase n_epochs, so whether this provides a speed increase over the single-threaded optimization is likely to be dataset and hardware-dependent.

Author

Thomas Ashhurst, thomas.ashhurst@sydney.edu.au Felix Marsh-Wakefield, felix.marsh-wakefield@sydney.edu.au

Examples

# Run UMAP on a subset of the  demonstration dataset

library(Spectre)
# Subsample the demo dataset to 10000 cells
cell.dat <- do.subsample(demo.clustered, 10000)
cell.dat$UMAP_X <- NULL
cell.dat$UMAP_Y <- NULL

cell.dat <- run.umap(dat = cell.dat,
                              use.cols = c("NK11_asinh", "CD3_asinh", 
                              "CD45_asinh", "Ly6G_asinh", "CD11b_asinh", 
                              "B220_asinh", "CD8a_asinh", "Ly6C_asinh", 
                              "CD4_asinh"))
#> 00:51:20 UMAP embedding parameters a = 1.577 b = 0.8951
#> 00:51:20 Converting dataframe to numerical matrix
#> 00:51:20 Read 10000 rows and found 9 numeric columns
#> 00:51:20 Using Annoy for neighbor search, n_neighbors = 15
#> 00:51:21 Building Annoy index with metric = euclidean, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> *
#> |
#> 00:51:21 Writing NN index file to temp file /tmp/RtmpfFypGg/file21fe5c0a46a6
#> 00:51:21 Searching Annoy index using 3 threads, search_k = 1500
#> 00:51:22 Annoy recall = 100%
#> 00:51:23 Commencing smooth kNN distance calibration using 3 threads
#>  with target n_neighbors = 15
#> 00:51:24 Initializing from normalized Laplacian + noise (using RSpectra)
#> 00:51:24 Commencing optimization for 200 epochs, with 200182 positive edges using 3 threads
#> 00:51:24 Using rng type: pcg
#> 00:51:25 Optimization finished