Inference Methods, Pseudotime Analysis and RNA Velocity.
The complex tissues that form the adult body arise during development through cell differentiation. The blueprint of this astonishing process is encoded in the genome that is shared by all cells within an organism and serves as instruction for a single fertilized zygote to gives rise to complex tissues and organs of the adult body. During this process, the emerging cell populations adapt their transcriptional program to coordinate their behavior in adulthood. Similarly regulatory circuits govern the process of tissue homeostasis, regeneration, and repair and when going awry, initiate disease [1]. Mapping the developmental- or disease history of differentiated cells is a fundamental goal of developmental and stem cell biologist and has vast impacts in other fields such regenerative medicine, and cancer biology [2].
Historically, cell lineages were studied by observing individual cells with simple light microscopy by utilizing dyes for labelling of cells (e.g. surface proteins stained with antibodies) or inheritable markers (e.g. fluorescently labelled marker proteins or genetic barcodes) to perform lineage tracing [3]. Advancements in single cell (sc) methods, in particular RNAseq, enabled the mapping of clonal relationships between cells based on transcriptomes to infer cell trajectories [2]. In the context of cellular differentiation, a plethora of studies applied scRNA sequencing to decipher the molecular mechanisms of lineage specification in vivo and in vitro. Examples include specification of adipocytes from adipocyte stem and progenitor cells (ASPC) [4], hematopoiesis from hematopoietic stem cells [5] or the lineage allocation and tissue organization in the early embryo [6].
To extract meaningful biological information and map cell trajectories from complex, noisy, and high dimensional scRNA sequencing data, computational tools were developed. A scRNA dataset is generated at a defined moment and can be considered a snapshot collection of all the different cell states present within the sample at that given time that are represented in a high-dimensional Euclidean space [2], [5]. From this collection of cell states, a cell state manifold can be inferred using dimensionality reduction and clustering methods based on similarity in gene expression. Examples of the most common models used for this, include: Principal Component Analysis (PCA), t-distributed stochastic Neighbour Embedding (t-SNE) or Uniform Manifold Approximation and Projection (UMAP)).
>>> See Figure 1 for more details.
These models can provide information on lineage relationships between cells, in other words: cells with similar gene expressions profiles cluster close together in the cell state manifold. This can be used to classify different cell types and states present in the snapshot. However, this does not give any information of real-time temporal dynamics between cell states, nor direction of the transition from one cell state to another: what cell state came first? To account for this limitation, approaches like Pseudotime analysis or RNA velocity have been developed.
Figure 1: Making sense of scRNA-seq data A) scRNAseq data can be visualized as count matrix with individual cells and gene count as dimensions. B) This matrix can be plotted with cells as points and genes as axis in a high-dimensional Euclidean space. C) Dimensionality reduction methods facilitate visualization while preserving the local and global structure. Among the most prominent algorithms are Principal Component Analysis (PCA) (Ca) and nearest neighbour approaches (NNA) (Cb). PCA is a linear transformation that identifies the major axes of variability, whereas NNA measure the local distances between cells to generate clusters. D) NNAs include, among others, state tree building methods: (Da), t-distributed stochastic neighbour embedding (t-SNE) (Db) and unique manifold approximation (UMAP) (Dc). These state tree building methods connect individual cells or cell clusters along the shortest path(s) thereby generating a tree like structure. t-SNE locally clusters similar cells together based on the local Gaussian distribution to generate graphs where similar cells or clusters are depicted in proximity. UMAP builds a topological representation of the high-dimensional space by stitching together local clusters into more complex, simplicial structures called simplicial complexes.
Estimating dynamics in cell state trajectories using Pseudotime analysis and RNA velocity
To predict temporal dynamics from scRNA sequencing data, algorithms such as Pseudotime can be deployed, with the caveat that directionality of cell state transitions must be previously known or assumed. Alternatively, directionality can be accounted for by either taking transcriptional snapshots of cell states at defined time intervals (scRNA-seq time course experiments) or by incorporating independently measured state dynamics using methods such as RNA velocity.
Adding directionality: inferring cell trajectories from scRNA data using pseudotime analysis
There is a plethora of algorithms that attempt to infer the directionality of cell fate trajectories which are often summarized under the term “pseudotime”. It is up to the researcher to define a starting point of the trajectory based on previous knowledge on the experimental system [7]. Using this starting point the pseudotime algorithms start to order the cell state manifold based on the similarity of the transcriptome of the individual cells or clusters, to the cells defined as starting point [8]. However, this assumption could introduce bias into the results.
Figure 2: Inferring trajectories using Pseudotime algorithms. Pseudotime is an algorithm that identifies a one-dimensional, latent representation of cellular states. In more biological terms: it provides a distance function from the progenitor cell to all downstream cells based on sc expression profiles. A) A cell state manifold is constructed based on the expression profiles of individual cells (depicted here a knn-manifold). B) Pseudotime is inferred from a state manifold by finding a smooth and continuous curve/path that passes through the manifold, representing the most likely trajectory of cell transitions from a fixed starting point (here the progenitor cell). The pseudotime of each cell is then defined as its distance along the curve from the initial or root cell state. C) When looking at the expression patterns of individual genes or gene clusters along the pseudotemporal ordering, mechanistic insights into the specification of the individual branches can be gained.
Limitations of inferring dynamics into cell fate trajectories using Pseudotime analysis
The most prominent and critical assumption is that transcriptional similarity between cells directly implies developmental or functional relationships, which is not inherently true. A good example of this are the primitive endoderm and the definitive endoderm in early mammalian development. The primitive endoderm is specified at the blastocyst stage directly from the inner cell mass, whereas the definitive endoderm emerges during gastrulation from the epiblast. However, both the primitive and the definitive endoderm will contribute to the gut endoderm and have similar expression profiles [9]. Due to their transcriptional similarities, a method based on similarity clustering would place both cell states close together in a cell fate trajectory despite the fact that these cells have emerged from different precursors.
To obtain the transcriptome data, the cells must be destroyed and are therefore per definition an endpoint measurement consisting of a snapshot of cellular states. In other words, it is not possible to measure gene expression changes of the same cell over time with current methods. This fundamental restriction of current methodologies forces us to make the assumptions that:
- the sample from which the single cell transcriptomes are recorded capture all cell states (stem, progenitor and differentiated cells) at sufficient coverage.
- that all cells follow the same process that will collapse into one trajectory.
These assumptions hold true for samples where the tissue is in homeostasis and cell fate transitions are therefore continuous. A good example is the adult mammalian skin where stem cells, progenitor cells and differentiated cells are in balance to continuously renew the stratified epithelium [10]. Nevertheless, for a detailed reconstruction of this cell fate trajectory exhaustive sampling is required to ensure that rare cell types or fast cell state transitions are sufficiently represented in the data set.
In developmental processes, where cell state transitions are strictly temporally ordered and immediate, snapshot data will not capture the entire trajectory and minor cell states might be missed. Gastrulation is a good example to illustrate this limitation: Within a short period of time, the pluripotent cells of the inner cell mass form the epiblast cells and later, during gastrulation are specified as cells of the definitive endoderm, mesoderm, and ectoderm. Although this process takes only two days during mouse development, capturing all these cell states within one snapshot is impossible [6].
Once the cell state manifold is constructed from the scRNA-seq data by the methods introduced above, temporal directionality within the manifold is not given and must be inferred. This introduces several critical sources of insecurities. Firstly, and most prominently, inference of the trajectories requires previous knowledge. In other words, a starting or end point of the trajectory must be defined based on previous experiments thereby introducing a bias. This leads to the dilemma of finding the balance between restricting the trajectory by overfitting it by too much prior information or causing the inference process to collapse by providing too little restriction on the starting/end point of the trajectory. In addition, some of the inference methods have fixed trajectories that do not allow complex topologies such as loops, alternative path and cross-differentiation adding another layer of bias.
Inferring directionality through RNA velocity and Cell Barcoding methods
RNA velocity is a computational method that uses the presence of mature and unspliced RNA transcripts to estimate the future state of individual cells. By comparing the ratio of spliced and unspliced transcripts for each gene, RNA velocity can infer the direction and speed of gene expression changes along the inferred trajectory. The timescale of cellular development is comparable to the kinetics of the mRNA life cycle: transcription of precursor mRNA, production of mRNA via splicing, followed by mRNA degradation. Therefore, the ratio of unspliced to spliced mRNA can be leveraged to predict the rate and direction of change in gene expression. If the ratio is in balance this indicates homeostasis (steady state) whereas imbalance indicates future induction or repression in gene expression. Thus, RNA velocity can be used as a proxy on how mRNA levels might evolve over time.
This allows to predict the differentiation potential and fate decisions of cells, as well as to identify the key regulators of cell transitions. RNA velocity adds a temporal dimension to single-cell transcriptomics and can improve the accuracy and resolution of trajectory inference. However, RNA velocity also has some limitations. First, it relies on the assumption that splicing rates are constant and gene-specific, which may not hold true in dynamic biological processes. Second, it requires many cells and high sequencing depth to achieve reliable estimates of velocity vectors, which may not be feasible for some samples or systems.
Figure 3 Inferring directionality by superimposing additional data layers. A) By combining the cell state trajectory obtained from scRNA-seq data (left) with data from cell lineage tree from a barcoding experiment the true direction of the trajectory can be revealed (lower panel). Note that in the cell state trajectory is inferred based on transcriptional similarity, while the lineage relationships depicted in the cell lineage tree are clonal histories. B) mRNA splicing dynamics, RNA velocity can add a temporal dimension to scRNA-seq data. This additional dimension can be used to predict the future cell state. Each individual cell can then be projected into a low-dimensional embedding using the assigned RNA velocity vector and thereby predict its direction and speed along the trajectory. The probabilities for each possible cell transition are used to imbed the cell in the neighborhood graph. The arrows in an RNA velocity plot, therefore, show the directional flow of cells in the low-dimensional embedding.
Another approach to infer directionality and dynamics is to combine molecular barcoding with single-cell transcriptomics. Molecular barcoding is a technique that labels cells with unique DNA or RNA sequences, allowing to track their clonal history and lineage relationships over time. By integrating barcode information with gene expression data, it is possible to reconstruct fine-grained clonal trees with transcriptional dimensions. This can reveal the heterogeneity and plasticity of cell fate decisions, as well as the effects of external stimuli or perturbations on cell trajectories. However, this approach also has some drawbacks. First, it requires the introduction of exogenous barcodes into cells, which may interfere with their natural behavior or viability. Second, it depends on the availability and efficiency of barcode delivery methods, which may vary across different cell types and systems. Third, it faces technical challenges in barcode sequencing, such as errors, dropout, or amplification bias, which may affect the accuracy and robustness of the inferred clonal trees.
Live-cell Sequencing (Live-seq): a new kid on the block
Pseudotime analysis, RNA velocity and Barcoding methods offer ways to add directionality and dynamics into scRNA seq dataset(s). However, trajectory inference still faces the fundamental conundrum that we cannot measure the same cell along its trajectory, but only snapshots of different cells at different stages. This limits our ability to capture the full complexity and variability of cell dynamics and transitions.
This is particular challenging for rare cell types that exist within heterogeneous cell populations, as these are difficult to address with conventional methods. For example, when studying the transcriptional response of individual cells to a stimulus such as a cancer drug. Rare cell types could be more resistant, which remains difficult to track from snapshot data where these potential rare cells are not sufficiently present in the sample. The ability to link the single-cell phenotypic response of the cancer drug to the transcriptome in real time would provide great insights into the dynamics and heterogeneous response of the cell population. To overcome this limitation, a new technique has emerged that promises to revolutionize trajectory inference: live-cell sequencing (Live-seq). Live-seq is a novel approach for live sequencing of individual single cells over time [11]. The technique involves taking cytoplasmic biopsies from a single cell using a FluidFM® Nanosyringe without impairing the cell’s viability. The transcriptome is determined by using a low-input RNAseq protocol.
Figure 4: Live-cell sequencing (Live-seq): a novel approach to sequence a cell while keeping it alive. Transcriptome profiling on the same cell is now possible.
Unlike conventional single-cell RNAseq methods, Live-seq does not require cell lysis or fixation, which means that the cells stay alive and can be sequenced again at different time points. This overcomes the inherent limitation of trajectory inference methods that assume a continuous process from discrete data points. By tracking the same cell along its developmental pathway, Live-seq can achieve true temporal resolution at the single cell level and uncover the dynamic cellular processes in the real time domain. This is essential for elucidating the cellular behavior in its natural state, without perturbing or destroying it. By sequencing the same cell multiple times, Live-seq can generate time series data that reveal the temporal patterns and fluctuations of gene expression. This can shed light on the molecular mechanisms and regulatory networks that control cell fate decisions and transitions.