Identifying cell types from single-cell RNA sequencing data is now more efficient than ever
Single Cell Clustering Assessment Framework (SCCAF) is a new automated method to identify cell types. Credit: Spencer Phillips/ EMBL-EBI

Identifying different types of cells within a tissue or an organ can be very challenging and time-consuming. Methods to identify cell types from single-cell RNA sequencing data have been proposed, but they all fall short in discovering potentially new cell types. Researchers from the Wellcome Sanger Institute and EMBL’s European Bioinformatics Institute (EMBL-EBI) have created a new method called Single Cell Clustering Assessment Framework (SCCAF) that bridges this gap.

Published today (18th May) in Nature Methods, this automated method uses machine learning and can replicate manual, expert annotations that are normally used for this task, and can characterise new cell types.

All somatic cells in a multicellular organism have the same genome, yet they perform a variety of functions. This functional diversity occurs between cells of different types (skin cells and neurons, for instance), but also between states of the same cell lineage as it differentiates.

Historically, researchers have identified cell types or states based on visible features or the expression of a handful of genes. Single-cell RNA sequencing (scRNA-seq) has brought high-throughput gene expression data into the picture.

A cell’s gene expression pattern (which genes are expressed at what level) serves as a proxy for its function and allows scientists to classify or “cluster” that cell with others that have the same function. Until now, annotating cells from scRNA-seq data has required time-consuming human intervention, with automated methods unable to identify cell types or states that had not been previously annotated by human experts.

The researchers came up with a method that uses machine learning to address these challenges.

Single Cell Clustering Assessment Framework (SCCAF) starts by using a clustering algorithm to group the cells of a sample into many clusters, based on their gene expression patterns. Each cell cluster is split into a “training set” and a “testing set” for the second stage of the analysis. A classifying model then takes over, using the training set to learn to distinguish cell clusters, and predicting likely clusters in the testing set. The model’s accuracy is assessed by comparing its prediction with the original clusters.