In this workflow, we provide an example on how to use classification to transfer the cell annotation of a dataset to another. This workflow is most suitable those who have performed a simple discovery workflow on a small dataset and would like to annotate a larger dataset without having to repeat the entire workflow again. A step-by-step detailed simple discovery workflow is available on our wiki page: Simple Discovery Workflow.
Note that if you need to perform some batch alignment to align the small analysed and annotated dataset and the large to-be-annotated dataaset, please refer to our general discovery workflow available on our wiki page: Batch Alignment Workflow.
Here, we will simply load up the Spectre package and the demo dataset which has previously been analysed and annotated using our simple discovery workflow.
library(Spectre)
library(data.table)
cell.dat <- demo.clustered
head(cell.dat)
The population is depicted in the Population column, and there are 6 different cell types:
unique(cell.dat$Population)
## [1] "Microglia" "NK cells" "Neutrophils"
## [4] "CD4 T cells" "Infil Macrophages" "CD8 T cells"
For the purpose of demonstration, we’re going to split the demo dataset into 2; the small annotated dataset and the large un-annotated dataset. For the large dataset, we will simply remove the Population column. To split the dataset, we will employ our subsample function.
Please do not do this for your own dataset as you should already have 2 different datasets ready!
cell.dat$id <- c(1: nrow(cell.dat))
annot.cell.dat <- do.subsample(dat = cell.dat, targets = rep(3000, 6), divide.by = 'Population')
unannot.cell.dat <- cell.dat[!cell.dat$id %in% annot.cell.dat$id]
setnames(x = unannot.cell.dat, old = "Population", "ActualPopulation")
unannot.cell.dat$FlowSOM_cluster <- NULL
unannot.cell.dat$FlowSOM_metacluster <- NULL
unannot.cell.dat$UMAP_X <- NULL
unannot.cell.dat$UMAP_Y <- NULL
Let’s quickly inspect the data.
Annotated dataset:
head(annot.cell.dat)
Unannotated dataset, to be annotated using automated classifier:
head(unannot.cell.dat)
In this workflow, we will be using the simple (yet effective!) K-Nearest Neighbour (KNN) classifier. For those who are not familiar with KNN, here is a super brief 1 sentence explanation. For each new un-annotated cell, KNN will identify K number of cells from the annotated dataset that lie the closest to it. This is often called neighbours. Then it will determine the majority population label of the neighbours and assign it to the new un-annotated cell.
Setting this K number is often non-trivial. If you are not sure what is suitable, then maybe it is best to try a few and see each K value perform. To do this, Spectre provides the facility to try out KNN using a range of K values on the annotated dataset:
classify_cols <- c("NK11_asinh", "CD3_asinh", "CD45_asinh", "Ly6G_asinh", "CD11b_asinh", "B220_asinh", "CD8a_asinh", "Ly6C_asinh", "CD4_asinh")
knn.stats <- train.knn.classifier(dat = annot.cell.dat,
use.cols = classify_cols,
label.col = "Population",
max.num.neighbours = 20)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## Evaluating 1
## Loading required package: FNN
## Evaluating 2
## Evaluating 3
## Evaluating 4
## Evaluating 5
## Evaluating 6
## Evaluating 7
## Evaluating 8
## Evaluating 9
## Evaluating 10
## Evaluating 11
## Evaluating 12
## Evaluating 13
## Evaluating 14
## Evaluating 15
## Evaluating 16
## Evaluating 17
## Evaluating 18
## Evaluating 19
## Evaluating 20
A side note, the parameter classify_cols basically refer to the markers you want KNN to use when computing the distance between cells. Most of the time, these should be your cell type markers.
The max.num.neighours indicates the maximum number of neighbours the function will try. You can set this to any whole number > 1, but note that the larger this number is, the longer the function will take to run as it will basically run through every number between 1 to max.num.neighbours. You can set a minimum number of neighbours using the parameter min.num.neighbours.
The message basically tells you which K value is being trained.
Let’s inpsect the result:
knn.stats
From this table, it looks like K = 19 yielded the best accuracy score. It is worthwhile to note that larger K value does not imply better performance!
It is as simple as running the following function:
unannot.cell.dat <- Spectre::run.knn.classifier(train.dat = annot.cell.dat,
unlabelled.dat = unannot.cell.dat,
use.cols = classify_cols,
label.col = "Population",
num.neighbours = 19)
train.dat refers to the dataset which has previously been annotated.
unlabelled.dat refers to the dataset to be annotated.
num.neighbours refers to the number of neighbours.
Let’s inspect the result:
head(unannot.cell.dat)
There are several columns to be aware of:
That’s about it! Good luck!