In this workflow, we provide an example on how to use classification to transfer the cell annotation of a dataset to another. This workflow is most suitable those who have performed a simple discovery workflow on a small dataset and would like to annotate a larger dataset without having to repeat the entire workflow again. A step-by-step detailed simple discovery workflow is available on our wiki page: Simple Discovery Workflow.

Note that if you need to perform some batch alignment to align the small *analysed and annotated* dataset and the large *to-be-annotated* dataaset, please refer to our general discovery workflow available on our wiki page: Batch Alignment Workflow.

Here, we will simply load up the Spectre package and the demo dataset which has previously been analysed and annotated using our simple discovery workflow.

```
library(Spectre)
library(data.table)
cell.dat <- demo.clustered
head(cell.dat)
```

The population is depicted in the *Population* column, and there are 6 different cell types:

`unique(cell.dat$Population)`

```
## [1] "Microglia" "NK cells" "Neutrophils"
## [4] "CD4 T cells" "Infil Macrophages" "CD8 T cells"
```

For the purpose of demonstration, we’re going to split the demo dataset into 2; the small ** annotated** dataset and the large

*Please do not do this for your own dataset as you should already have 2 different datasets ready!*

```
cell.dat$id <- c(1: nrow(cell.dat))
annot.cell.dat <- do.subsample(dat = cell.dat, targets = rep(3000, 6), divide.by = 'Population')
unannot.cell.dat <- cell.dat[!cell.dat$id %in% annot.cell.dat$id]
setnames(x = unannot.cell.dat, old = "Population", "ActualPopulation")
unannot.cell.dat$FlowSOM_cluster <- NULL
unannot.cell.dat$FlowSOM_metacluster <- NULL
unannot.cell.dat$UMAP_X <- NULL
unannot.cell.dat$UMAP_Y <- NULL
```

Let’s quickly inspect the data.

Annotated dataset:

`head(annot.cell.dat)`

Unannotated dataset, to be annotated using automated classifier:

`head(unannot.cell.dat)`

In this workflow, we will be using the simple (yet effective!) K-Nearest Neighbour (KNN) classifier. For those who are not familiar with KNN, here is a super brief 1 sentence explanation. For each new un-annotated cell, KNN will identify *K* number of cells from the annotated dataset that lie the closest to it. This is often called *neighbours*. Then it will determine the majority population label of the neighbours and assign it to the new un-annotated cell.

Setting this *K* number is often non-trivial. If you are not sure what is suitable, then maybe it is best to try a few and see each *K* value perform. To do this, Spectre provides the facility to try out KNN using a range of *K* values on the annotated dataset:

```
classify_cols <- c("NK11_asinh", "CD3_asinh", "CD45_asinh", "Ly6G_asinh", "CD11b_asinh", "B220_asinh", "CD8a_asinh", "Ly6C_asinh", "CD4_asinh")
knn.stats <- train.knn.classifier(dat = annot.cell.dat,
use.cols = classify_cols,
label.col = "Population",
max.num.neighbours = 20)
```

`## Loading required package: caret`

`## Loading required package: lattice`

`## Loading required package: ggplot2`

`## Evaluating 1`

`## Loading required package: FNN`

`## Evaluating 2`

`## Evaluating 3`

`## Evaluating 4`

`## Evaluating 5`

`## Evaluating 6`

`## Evaluating 7`

`## Evaluating 8`

`## Evaluating 9`

`## Evaluating 10`

`## Evaluating 11`

`## Evaluating 12`

`## Evaluating 13`

`## Evaluating 14`

`## Evaluating 15`

`## Evaluating 16`

`## Evaluating 17`

`## Evaluating 18`

`## Evaluating 19`

`## Evaluating 20`

A side note, the parameter *classify_cols* basically refer to the markers you want KNN to use when computing the distance between cells. Most of the time, these should be your cell type markers.

The *max.num.neighours* indicates the maximum number of neighbours the function will try. You can set this to any whole number > 1, but note that the larger this number is, the longer the function will take to run as it will basically run through every number between 1 to *max.num.neighbours*. You can set a minimum number of neighbours using the parameter *min.num.neighbours*.

The message basically tells you which *K* value is being trained.

Let’s inpsect the result:

`knn.stats`

From this table, it looks like *K = 19* yielded the best accuracy score. It is worthwhile to note that larger *K* value does not imply better performance!

It is as simple as running the following function:

```
unannot.cell.dat <- Spectre::run.knn.classifier(train.dat = annot.cell.dat,
unlabelled.dat = unannot.cell.dat,
use.cols = classify_cols,
label.col = "Population",
num.neighbours = 19)
```

*train.dat* refers to the dataset which has previously been annotated.

*unlabelled.dat* refers to the dataset to be annotated.

*num.neighbours* refers to the number of neighbours.

Let’s inspect the result:

`head(unannot.cell.dat)`

There are several columns to be aware of:

- Prediction: this is the cell population assigned by the KNN classifier
- Neigbour_x: it tells you which cell in the annotated dataset is the x-th closest neighbour.

That’s about it! Good luck!