Estimating the number of true cell barcodes in single cell RNA-Seq (part 1)

Single Cell RNA-Seq (scRNA-Seq) took a major leap forward in 2015 when two Harvard groups published oil droplet-based methods for sequencing >10,000 cells in a single experiment for << 1$ / cell (Klein et alMacosko et al). Then in early 2016, 10X Genomics launched their commerical version of droplet scRNA-Seq, followed more recently by the illumina and Bio-Rad competitor technology. From my limited exposure, the 10X product seems to be the most popular right now but this is a very new field so there are likely more improved methods on the horizon which could shake things up. These techniques label cells in fundamentally in the same manner. Using microfluidic devices individual cells are encapsulated together with a random cell barcode in a droplet. The reverse transcription of the RNA and subsequent PCR amplification is then performed inside each droplet separately such that all cDNA from a single cell are labelled with the same cell barcode.

The techniques themselves are really exciting but I want to focus here on a technical question during the analysis. A crucial aspect of droplet scRNA-seq is that the number of possible cell barcodes is many orders of magnitude greater than the number of cells sequenced. Hence, if each sequenced cell gets one cell barcode at random, the probability of two cells having the same barcode is very low (if you’re interested, you can work out the probability using the solution to the famous ‘birthday problem’). However, when you get your sequencing data back and extract the cell barcodes detected, not all of these will be ‘true’ cell barcodes and some barcodes will contain sequencing errors or PCR errors. Furthermore, some cell barcodes may have been incorporated into cDNA that does not originate from your cells (more on this later). Hence, one of the first challenges in droplet scRNA-Seq is working out how many cells you’ve actually sequenced! In the supplementary material of the original drop-seq paper (Macosko et al) they suggest plotting the cumulative frequency and looking for an inflection point, or ‘knee’. They show this in supplementary Figure 3A (reproduced below in Figure 1 with permission). The 10X analysis pipeline also appears to take a similar approach by what I can gather from their website.

figs3_lrg-e1495127255597.jpg

Figure 1. Cumulative frequency for drop-seq data. Plot taken from Macosko et al supplementary figure 3 with permission.  Dashed line shows inferred position of point of inflection.

Now, if you’ve been on this blog before, you’ll probably have seen the posts about UMI-tools. This is a tool suite I’ve co-developed with Ian Sudbery (University of Sheffield) to accurately identify PCR duplicates in techniques utilising unique molecular identifiers (UMIs). UMIs are similar to cell barcodes in that they are just a sequence of bases used to barcode a read, though in this case, the barcode relates to the unique molecule prior to PCR amplification. The novel aspect of our tools is that we used networks to identify PCR duplicates in an error-aware manner. For background on the network methods, please see our publication or my previous blog post.  For the next major UMI-tools release (version 0.5), we’re planning to expand our support for single cell RNA-Seq analyses as this seems to be the major application of the tool to date according to our GitHub issues. As part of this, we wanted to expand the extract command. Currently this command extracts the UMI sequence from the reads and appends it to the read name so that the UMI can be accessed in the downstream BAM, We wanted to extend the extract command so that it can

  • extract cell barcodes
  • filtered cell barcodes against a whitelist of expected cell barcodes
  • or, where no whitelist exists (as in the case of droplet scRNA-Seq), create a whitelist of ‘true’ cell barcodes from the data and filter against this

For the final point, we needed to implement an automated version of the ‘knee’ method as described in the drop-seq paper. We also wondered whether our network based methods might be suitable for this purpose as they are designed to handle the situation where you have an unknown number of ‘true’ barcodes plus additional barcodes which have been generated via errors from the ‘true’ barcodes. I therefore set out to implement a ‘knee’ method and compare it with our network methods.

To start with, I took the SCRB-Seq data from Soumillon et al since I had it to hand and this includes 384 expected cell barcodes which I could use as a ground truth. All other barcodes observed should therefore have originated as errors from the 384 true barcodes. To be clear, this is not droplet scRNA-Seq but this give us a definitive ground truth for the purposes of initiation performance testing. Plotting the cumulative frequency we can clearly see an inflection near the 384th barcode as expected (Figure 2; blue line).

no_methods_cumsum

Figure 2. Cumulative frequency for SCRB-Seq cell barcodes. Blue line represents the position of the 384th cell which is close to the inflection point in the curve.

I initially attempted to identify the inflection point by fitting a spline function to the cumulative frequency but quickly realised it was much easier to work from the distribution of counts (log10) per cell barcode (Figure 3). Here the inflection point is represented by the local minima between the distribution of counts for ‘true’ and ‘false’ barcodes. We can see that the local minima is very close to the 384th cell barcode so this automated knee method seems to work well for this data. The obvious downside of this hard thresholding on read counts with the knee method is that there will likely be some overlap between the read counts for real and false barcodes (as we can see in Figure 3) . This is why we believed the network-based methods may be more accurate.

scrb_seq_one_min_dist

Figure 3. Density of reads per SCRB-Seq cell barcode. Blue line represents the position of the 384th cell which is close to the inflection point in the curve in figure 2. Orange line represents the local minima used in my implementation of the ‘knee’ method.

To compare with our network-based methods, I took increasing samples of reads from the SCRB-Seq sample and estimated the true cell barcodes using the knee method and our ‘adjacency’ and ‘directional’ methods and calculated the sensitivity and false positive rate (FPR) of the different approaches. As you can see in Figure 4, the ‘directional’ method has a slightly higher sensitivity and lower FPR than the ‘knee’ method, which suggests that this network-based method might be more suitable for identifying the true cell barcodes. The knee method slightly outperforms just selecting the top 384 most abundant barcodes (‘top 384’) in terms of higher sensitivity although this comes at the cost of increased FPR. The ‘adjacency’ method has a high sensitivity but tends towards overestimating the number of true cell barcodes as the number of sampled reads is increased. This is because adjacency only groups barcodes together if they are all separated from the same hub barcode and each by a single edge. Hence, any barcode with 2 or more errors will be always be in a separate group from the true barcode from which it originates. In contrast, the directional method is not limited to single edge paths.

pSens_zoom.pngpFPRpFPR_zoom

Figure 4. Sensitivity (Top) and False Positive Rate (FPR; Middle, Bottom) for 3 methods to detect the number of true cells sequenced. “Top 384” = select the top 384 barcodes. Shown for comparison with knee method. Middle = all samples. Bottom = adjacency excluded to show difference between knee and directional.

Next, I took a single sample from the drop-seq paper where the authors had inferred that there were ~570 cells using a visual assessment of the cumulative frequency plot (This is the sample shown in Figure 1). They also backed up this assertion with an assessment of the species-specificity of the reads for each cell barcode which suggested that most of the “cells” below this inflection point were actually droplets which did not contain a cell and simply contained ‘ambient’ RNA. So, 570 seemed a reasonable truth to test the method against. My ‘knee’ method worked great, using the first 50M reads, this method estimated that there were 573 cell barcodes (Figure 5).

indrop_cumsum

Figure 5. Cumulative frequency plot for cell barcodes in the first 50M drop-seq reads. The ‘knee’ method estimated 573 true cell barcodes and hence is plotted over the ‘top 570’ inflection point prediction. The X-axis has been limited to the 10 000 most abundant cell barcodes.

In contrast, the directional method failed miserably! For starters, the shear number of cell barcodes precluded the inclusion of all cell barcodes detected as the network building requires an all-vs-all comparison between the barcodes to calculate the edit distances. With 50M reads, the sensitivity was OK (95%), but the specificity was just 11% and the specificity unexpectedly increased as the number of reads parsed was decreased! Then came the realisation…

Image result for homer doh

As mentioned above, the network-based methods were designed to deal with a pool of barcodes, where some barcodes are true and the remaining barcodes are the result of errors from the true barcodes. However, here we have a situation where the ‘false’ barcodes could be due to either errors (I’ll call these ‘error barcodes’), or droplets without a cell which have amplified ambient RNA (‘ambient barcodes’). We can indentify these two separate types of false barcodes by looking at the minimum edit-distance between each barcode and the true barcodes. For error barcodes, we would expect this minimum edit-distance to be 1 in most cases. For ambient barcodes we expect them to have a similar distribution of minimum edit distances relative to the true barcodes as barcodes selected at random. Figure 6 below shows the distribution of counts per barcode, split into 4 bins. Figure 7 shows the distribution of edit-distances for each bin and a null distribution from a randomly generated set of cell barcodes. Bin 4 contains the true barcodes and shows a null-like distribution of edit-distances as expected. We can clearly see that bin 3, which has slightly lower counts per cell barcode compared to the true barcodes, is enriched in single edit-distances relative to the null. This strongly suggests many of these cell barcodes are error barcodes with a single error relative to the true cell barcode from which they originate. However, bin 2, and bin 1 especially, show a more null-like distribution, suggesting most of these cell barcodes are instead ambient barcodes.

indrop_bins

Figure 6. Distribution of reads per cell barcode (log 10) from first 50M reads, separated into 4 bins. Bin 4 represents the ‘true’ barcodes.

edit_distances_drop_seq

Figure 7. Distribution of minimum edit distance compared to true barcodes (bin 4) for each bin shown in figure 6. Random = 10 000 randomly generated cell barcodes.

 

In summary, the knee method as implemented here works very well with the samples tested so far so we can proceed with automatic detection of the number of true cell barcodes with UMI-tools extract. We will also enable the use of the directional method since this is more accurate when you know that all false cell barcodes are the result of errors. However, I’d recommend sticking with the knee method unless you are absolutely certain this is the case.

Finally, if you’re already using UMI-tools for scRNA-Seq, or thinking about doing so, keep a look out for version 0.5. Along with the added functionality discribed here, we’re adding a count commands to yield counts per cell per gene directly from featureCounts/HTSeq output and we hope to include a scRNA-Seq-specific tutorial for this release too.

EDIT: Following the comments below I’ve continued this analysis into a second post to look at INDELs in cell barcodes and comparing UMI-tools with sircel.

Speeding up python

I was recently contacted by Ilgar Abdullayev from the Ludwig Institute for Cancer Research, Karolinska Institutet with a query about why the dedup_umi.py script in our UMI-Tools repository was taking so long to run on some of his BAMs. Finding the bottleneck was relatively easy and the immediate solution was straightforward thanks to cython (and our technical director Andreas!). Thanks to Ilgar for sharing some of his data to let me optimise the code. Before the details though, I thought it might be a good time to take stock on why python is quick to write but slow to run. Skip to “Reducing the interpretation overhead” if you’re already a whiz at python.

Quick development

One of the things I really love about python is how quick it is to write function code.  Python is dynamically typed which means we don’t have to worry about the type of a variable until we want to do something with it. The following is perfectly valid.

variable = "hello world"

variable = 6

It appears to the user that the type of variable has changed from string to int. Actually, we first created a python string object and bound the reference “variable” to it, then we created a new int object and bound the reference “variable” to it. The first object no longer has any references bound to it so it’s garbage collected (e.g deleted).

Another great feature of python is duck typing by which we can perform the same operation on different python types to different ends. The classic example being the use of “+” for both concatenating strings and for adding numeric types.

variable = "hello world"
print variable + "I'm python"
hello world, I'm python

variable = 6
print variable + 2
8

Along with a comprehensive set of well-maintained modules for more complex classes, data manipulations and statistics, dynamic typing and duck typing makes writing a complex python script something that can be achieved in minutes rather than days.

Slower to run

The penalty you pay for dynamic typing is that the python code must be interpreted, and a section of code is re-interpreted each time its run. In the, admittedly contrived, example below we can clearly see why the code in the loop needs to be interpreted each time anew.

dict_1 = {1:"bar", 0:"foo"}
list_1 = ["bar", "foo"]

objects = (dict_1, list_1)
for obj in (objects):
    print obj[0]

foo
bar

In the first loop, obj is a dictionary. The “[0]” index in a dictionary returns the value for key “0”, which in this case is “foo”. In the second loop, obj is a list. The “[0]” operation on a list slices the list to the xth element (0-indexed), which in this case is the first element, “bar”.

Reducing the interpretation overhead

The reinterpretation of code takes a lot of time and python is therefore considerably slower than languages like C. And this is where cython (see what they did there) comes in.  Cython allows you to statically determine the type of an object reducing the work of the python interpreter.

Now to return to the original problem with our script. For a background on our UMI-tools see (CGAT blog – UMIs). The dedup_umi.py script in question takes the alignment file (SAM/BAM format) and returns an alignment with the duplicated reads removed. In my hands this was taking about 10 seconds for 1 million reads which seems perfectly reasonable. For Ilgar however, it was taking around 12 minutes per 1 million reads which is less than ideal when alignment files may contain >100M reads.

To identify the bottleneck I first used a bit of trial and error to identify a single read start position which was particularly troublesome. In hindsight it was hardly surprising this was an area of extreme coverage. With the use of of the python profiler module we can identify the part of the script which was taking the longest time.

python -m cProfile -s tottime dedup.umi.py -I subset.bam -S out.bam --method=adjacency > profile.log

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 26352342   47.813    0.000   76.867    0.000 dedup_umi.py:196(edit_dist)
 26352397   20.992    0.000   20.992    0.000 {zip}
 26352353    8.062    0.000    8.062    0.000 {sum}


So the majority of the time taken was the edit_dist function which was run 26352342 times! Looking back at the script, we can see this is because our adjacency-based methods take the list of UMIs observed and create a dictionary mapping each UMI to the UMIs which are a single edit distance apart (hamming distance = 1). This is an all vs. all operation so the time taken increases quadratically with the number of UMIs observed. The obvious solution was therefore to improve the speed of the adjacency dictionary building. Let’s look at relevant portion of the original code.

def edit_dist(first, second):
    ''' returns the edit distance/hamming distances between
    its two arguments '''

    dist = sum([not a == b for a, b in zip(first, second)])
    return dist

def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi2) <= threshold]
                for umi in umis}

The _get_adj_list_adjacency method uses a dictionary comprehension with the values generated by a list comprehension so the edit_distance function is called on every combination of UMIS. For a quick improvement we can put the code for the edit distance calculation inside the list comprehension so we avoid recalling the function, but the gains are minimal. Alternatively, as you may have noticed, we’re only interested in UMIs where the distance is >= threshold so we could optimise the edit_distance method by returning as soon as the distance is over the threshold:

def edit_dist(first, second, threshold):
    ''' returns True if hamming distance is less than threshold '''

    dist = 0
    for a, b in in zip(first, second):
        if a != b:
            dist += 1
        if dist > threshold:
            return 0
    return 1

def _get_adj_list_adjacency(self, umis, counts, threshold):
        '''identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_dist(umi, umi2, threshold)]
                for umi in umis}

This improves the speed ~2-fold which is nice but not really enough!

Speaking to Andreas about this problem, he quickly identified cython as simple solution since the hamming distance calculation is very simple. Here’s the cython code snippet and the pertinent code in dedup.py

_dedup_umi.pyx:

cpdef int edit_distance(a, b):
    cdef char * aa = a
    cdef char * bb = b

    cdef int k, l, c
    c = 0
    l = len(a)
    for k from 0 <= k < l:
        if aa[k] != bb[k]:
            c += 1
    return c

dedup_umi.py:

...
import pyximport
pyximport.install(build_in_temp=False)
from _dedup_umi import edit_distance
...
def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold '''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi) <= threshold]
                for umi in umis}

In the cython function we statically define the type for the two C strings a & b and the ints k, l & c. Then we count the positions in each string where the characters are different. In dedup_umi.py we just need to import the pyximport module and use it to compile _dedup_umi.py. Then we import the edit_distance function from _dedup_umi.py and use it like any other python function. This simple modification reduces the run time on the troublesome genomic coordinate from 91 secs to 11 secs – the most time-consuming function for this particular region is now remove_umis.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6432    5.902    0.001    5.902    0.001 dedup_umi.py:217(remove_umis)
       11    3.626    0.330    5.544    0.504 dedup_umi.py:296(<dictcomp>)
 26359658    1.918    0.000    1.918    0.000 {_dedup_umi.edit_distance}

The cython function could be further improved along the lines of the python function by returning as soon as the threshold is exceeded. I’m sure there are other improvements we could make to the run time of the script. For example, transforming the UMI strings to 2bit arrays would make the hamming distance calculation a very cheap xor bitwise operation. However, I’ve not tested yet if the cost of the transformation outweighs the gains in the edit distance calculation.

This is the first time I’ve used cython – something that Ian Sudbury (UMI-tools co-author) had suggested might be worth considering a while ago. To be honest, the solution was so straightforward that Andreas ended up implementing it himself whilst he demonstrated it. I think it might be time for me to read that cython documentation fully now!

Unique Molecular Identifiers – the problem, the solution and the proof

[EDIT: UMI-Tools open access publication is now out]

I’ve been working with Dr. Ian Sudbery (former CGAT fellow, currently Lecturer in Bioinformatics at the University of Sheffield) on a script called dedup_umi.py to correctly remove PCR duplicates from alignment files when using Unique Molecular Identifiers (UMIs). The script is available from GitHub as part of the UMI-tools repository. As part of the development I’ve been running some simulations to compare different methods which have proved quite illuminating. Before I go into the details though, some background.

PCR duplication artifacts

Many sequencing-based techniques rely upon an amplification step to increase the concentration of the library generated from the fragmented DNA prior to sequencing. Following alignment to the genome, it’s usually preferable to identify and remove PCR duplicates as there are inherent biases in the amplification step as some sequences become overrepresented in the finally library compared to their true abundance in the fragmented DNA. The brilliant Picard tool is commonly used to identify and remove these PCR duplicates using their genomic coordinates. Of course, it’s also possible that two sequence reads with the same genomic coordinates will have the originated from two independent fragments. However, where the alignment density is essentially random (as in DNA-Seq) this is unlikely, even more so if paired end sequencing has been performed. The human genome has ~3 x 109 bases and the insert size will vary so the combinations of possible genomic coordinates for the paired ends is so vast we don’t expect to sequence two fragments with the same genomic coordinates.

Conversely, in RNA-Seq, some cDNA sequences will be highly abundant and we are therefore much more likely to sequence multiple fragments with the exact same genomic coordinates. This problem is most acute where a high number of PCR cycles have been used to generate the sequencing library, for example where the concentration of starting material in low, such as in single cell RNA-Seq as this necessitates many PCR cycles.

Using unique molecular identifiers to correctly identify PCR duplicates

Unique molecular identifiers (UMIs; also called Random Molecular Tags (RMTs)) are random sequences of bases used to tag each molecule (fragment) prior to library amplification (see figure below and Kivioja et al (2012), thereby aiding in the identification of PCR duplicates.  If two reads align to the same location and have the same UMI, it is highly likely that they are PCR duplicates originating from the same fragment prior to amplification. We can therefore collapse all the reads with the same genomic coordinates and UMI into a single representative read and obtain an accurate estimate of the relative concentration of the fragments in the original sample.

nmeth.2772-F1This figure is adapted from Islam et al (2014)

The problem

The use of UMIs as described above would work perfectly if it were not for base-calling errors, which erroneously create sequencing reads with the same genomic coordinates and UMIs and that are identical except for the base at which the error occurred. Base-calling errors therefore inflate the apparent numbers of unique fragments sequenced (see below). An additional source of errors comes from the amplification itself which can introduce errors during the DNA duplication step.

schematic_1

The solution

Ian initially identified this issue in his own data and wrote a python script utilising pysam to parse an alignment BAM file and inspect the UMIs of reads with the same genomic coordinates. He reasoned that sequencing errors would produce UMIs which are a single edit distance away from another UMI. Therefore, by forming an adjacency matrix it should be possible to identify the sequencing errors and collapse the reads to the true number of unique UMIs present (see schematic below for a simple example). The first step is to identify connected components, where UMIs are represented by nodes and edges represent links between nodes which are a single edit distance apart. The counts of each node are also retained. Within each component, the most abundant node is selected and all nodes within a single edit distance are removed. This process is repeated using the second most abundant node and so on until all edges are removed and the number of nodes remaining is inferred to be the number of unique fragments sequenced. We refer to this method as the “adjacency” method.

schematic_2

It turns out this works pretty well, however, one can envisage situations where this may not correctly identify the true number of unique fragments. For example,  an error during the PCR amplication may create an apparently distinct UMI from which sequencing errors could created further UMIs, all of which are actually from a single unique fragment (see schematic below). A similar situation could also arise if one particular position in the reads was more likely to contain an error (this is quite common in illumina sequencing). In this case, one would see lots of UMIs with a single edit distance due to an error at the common position and further errors may be present off of this UMI.

schematic_3In order to deal with these more complicated networks of connected UMIs, I developed the “directional_adjacency” method. I reasoned that we could resolve these networks by forming edges between nodes where the count of node A was >= 2 x node B. Now, each connected component is inferred to originate from a single unique fragment.

In addition to these methods, we also wanted to try a modification of the “adjacency” method in which the whole connected component is inferred to originate from a single unique fragment (“cluster”).

This issue has been previously observed by Islam et al (2014) who proposed that UMIs with less than 1% of the average counts should be removed “to account for UMIs that stem from PCR-induced mutations or sequencing errors”. We implemented this method within our tool for comparison (“percentile”).

Finally, we implemented a “unique” method, where all unique UMIs were retained. Note, this appears to be the method which is most commonly used. The schematic below shows a comparison of the various methods (I’ve obviously rigged it to make directional adjacency look the best).

schematic_2

The proof

We’re taking two approaches to prove the value of improved detection of PCR duplicates using the adjacency methods implemented in dedup_umi.py. Firstly, we’ve conducted simulations to show they consistently estimate the true number of unique fragments more accurately. Secondly, we’ve applied dedup_umi.py to multiple single-cell RNA-Seq studies and show an improvement, with respect to the distribution of UMIs, the quantification of spike-in RNAs and the clustering of cells. I’ll focus on the simulation for now and leave the single-cell RNA-Seq to another post when we’ve submitted the manuscript.

In order to compare the various methods, we wanted to know how the methods fared across a range of UMI lengths and depths of sequencing.  For this I turned to python to replicate the process of amplification and sequencing in silico and model the impact of the various parameters on the efficacy of the methods. The whole process is contained in a standalone ipython notebook which you can view here. If you would like to download and run the notebook yourself you can do so from here.

In brief, the simulation starts with a set number of unique molecules (fragments), our “ground truth”. A UMI is then randomly assigned to each molecule and the pool of molecules amplified by duplicating each molecules for every in silico PCR cycle. The final pool of molecules are then “sequenced” to produce the reads. Within this process, base-calling errors are incorporated during the sequencing and substitutions are incorporated during the PCR cycles. Furthermore, amplification efficiency of each UMI varies to naively model the impact of amplification biases. Various methods are then applied to the reads generated to simulate the de-duplication process within dedup_umi.py and an estimate of the number of unique molecules is obtained. To compare the methods I used 2 metrics:

  • The difference between this estimate and the ground truth
  • The variability of the estimate for which I used the coefficient of variation (CV; sd/mean)

Each of the following plots below show the impact of changing one of the simulation parameters on theses two metrics. In each plot, the red dashed line shows the parameter value that was used in the other simulations.

We can see that under the assumptions of the simulations, the directional adjacency method is the least sensitive to variations in the input parameters.

The unique method (the method used in most publications) severely overestimates the number of unique molecules since each sequencing error or PCR error erroneously increases the estimate. As the number of starting molecules increase, all methods eventually struggle to identify the true number of unique molecules, with the adjacency, directional adjacency and percentile methods showing the best performance.  As we would expect, the percentile method tends to overestimate the number of unique molecules since it fails to remove sequencing errors unless they are very rare events. The adjacency method tends towards an overestimation too, as it can only identify erroneous UMIs within a single edit distance. Interestingly, the directional adjacency method tends towards an underestimation as it can link together a string of UMIs each separated by a single edit distance. The directional aspect of the network is designed to avoid this happening between truly unique UMIs, however occasionally these will be collapsed together. The cluster method tends towards a more severe underestimation as it does not take counts into account at all and collapses whole networks into a single UMI.

Overall, the directional adjacency method shows very little sensitivity to the number of starting molecules and outperforms all other methods. As expected, the percentile method becomes more consistent as the number of molecules increase, since the variability between the counts in each one of the individual iterations is reduced. In contrast, the cluster and adjacency methods become less consistent as the number of starting molecules increases, as the variability in the topology of the networks formed will also increase.

simulation_number_of_umis_xintercept_difference_cv_deduped

As the UMI length increases (from a initial very short 3bp), the methods which utilise a network quickly improve in terms of accuracy and variability. However, the percentile and unique methods actually become less accurate as a longer UMI obviously means more probability of errors accumulating. Perhaps counter intuitively then, if you’re not going to account for errors when using UMIs, you’re better off not using a very long UMI.

simulation_umi_length_xintercept_difference_cv_deduped

As the PCR cycles are increased, all methods except directional adjacency show a decrease in accuracy and an increase in variability, with adjacency being the second most robust to high PCR cycle numbers. We see the same as we increase the DNA polymerase error rate. This confirms that the directional adjacency method is better able to handle errors originating from the amplification stage.

simulation_pcr_cycles_xintercept_difference_cv_deduped

simulation_dna_pol_error_rate_xintercept_difference_cv_deduped

Is a non-heuristic solution possible?

The methods we propose are definitely an improvement on “unique” deduplication using UMIs within these simulations.  We’ve also shown that the application of our proposed methods can have a real impact on biological interpretation of single cell RNA-Seq data (more to come on that soon). However, all implemented methods are heuristic approaches to the problem. My initial expectation was that we would be able to apply some probabilistic approach to determine the most likely set of UMIs which would account for the UMIs observed at a given genomic location. So far, we’ve been unable to develop such a method. If anyone would like to discuss alternative probabilistic approaches, or the script in general, please get in touch.