Estimating the number of true cell barcodes in single cell RNA-Seq (part 1)

Single Cell RNA-Seq (scRNA-Seq) took a major leap forward in 2015 when two Harvard groups published oil droplet-based methods for sequencing >10,000 cells in a single experiment for << 1$ / cell (Klein et alMacosko et al). Then in early 2016, 10X Genomics launched their commerical version of droplet scRNA-Seq, followed more recently by the illumina and Bio-Rad competitor technology. From my limited exposure, the 10X product seems to be the most popular right now but this is a very new field so there are likely more improved methods on the horizon which could shake things up. These techniques label cells in fundamentally in the same manner. Using microfluidic devices individual cells are encapsulated together with a random cell barcode in a droplet. The reverse transcription of the RNA and subsequent PCR amplification is then performed inside each droplet separately such that all cDNA from a single cell are labelled with the same cell barcode.

The techniques themselves are really exciting but I want to focus here on a technical question during the analysis. A crucial aspect of droplet scRNA-seq is that the number of possible cell barcodes is many orders of magnitude greater than the number of cells sequenced. Hence, if each sequenced cell gets one cell barcode at random, the probability of two cells having the same barcode is very low (if you’re interested, you can work out the probability using the solution to the famous ‘birthday problem’). However, when you get your sequencing data back and extract the cell barcodes detected, not all of these will be ‘true’ cell barcodes and some barcodes will contain sequencing errors or PCR errors. Furthermore, some cell barcodes may have been incorporated into cDNA that does not originate from your cells (more on this later). Hence, one of the first challenges in droplet scRNA-Seq is working out how many cells you’ve actually sequenced! In the supplementary material of the original drop-seq paper (Macosko et al) they suggest plotting the cumulative frequency and looking for an inflection point, or ‘knee’. They show this in supplementary Figure 3A (reproduced below in Figure 1 with permission). The 10X analysis pipeline also appears to take a similar approach by what I can gather from their website.


Figure 1. Cumulative frequency for drop-seq data. Plot taken from Macosko et al supplementary figure 3 with permission.  Dashed line shows inferred position of point of inflection.

Now, if you’ve been on this blog before, you’ll probably have seen the posts about UMI-tools. This is a tool suite I’ve co-developed with Ian Sudbery (University of Sheffield) to accurately identify PCR duplicates in techniques utilising unique molecular identifiers (UMIs). UMIs are similar to cell barcodes in that they are just a sequence of bases used to barcode a read, though in this case, the barcode relates to the unique molecule prior to PCR amplification. The novel aspect of our tools is that we used networks to identify PCR duplicates in an error-aware manner. For background on the network methods, please see our publication or my previous blog post.  For the next major UMI-tools release (version 0.5), we’re planning to expand our support for single cell RNA-Seq analyses as this seems to be the major application of the tool to date according to our GitHub issues. As part of this, we wanted to expand the extract command. Currently this command extracts the UMI sequence from the reads and appends it to the read name so that the UMI can be accessed in the downstream BAM, We wanted to extend the extract command so that it can

  • extract cell barcodes
  • filtered cell barcodes against a whitelist of expected cell barcodes
  • or, where no whitelist exists (as in the case of droplet scRNA-Seq), create a whitelist of ‘true’ cell barcodes from the data and filter against this

For the final point, we needed to implement an automated version of the ‘knee’ method as described in the drop-seq paper. We also wondered whether our network based methods might be suitable for this purpose as they are designed to handle the situation where you have an unknown number of ‘true’ barcodes plus additional barcodes which have been generated via errors from the ‘true’ barcodes. I therefore set out to implement a ‘knee’ method and compare it with our network methods.

To start with, I took the SCRB-Seq data from Soumillon et al since I had it to hand and this includes 384 expected cell barcodes which I could use as a ground truth. All other barcodes observed should therefore have originated as errors from the 384 true barcodes. To be clear, this is not droplet scRNA-Seq but this give us a definitive ground truth for the purposes of initiation performance testing. Plotting the cumulative frequency we can clearly see an inflection near the 384th barcode as expected (Figure 2; blue line).


Figure 2. Cumulative frequency for SCRB-Seq cell barcodes. Blue line represents the position of the 384th cell which is close to the inflection point in the curve.

I initially attempted to identify the inflection point by fitting a spline function to the cumulative frequency but quickly realised it was much easier to work from the distribution of counts (log10) per cell barcode (Figure 3). Here the inflection point is represented by the local minima between the distribution of counts for ‘true’ and ‘false’ barcodes. We can see that the local minima is very close to the 384th cell barcode so this automated knee method seems to work well for this data. The obvious downside of this hard thresholding on read counts with the knee method is that there will likely be some overlap between the read counts for real and false barcodes (as we can see in Figure 3) . This is why we believed the network-based methods may be more accurate.


Figure 3. Density of reads per SCRB-Seq cell barcode. Blue line represents the position of the 384th cell which is close to the inflection point in the curve in figure 2. Orange line represents the local minima used in my implementation of the ‘knee’ method.

To compare with our network-based methods, I took increasing samples of reads from the SCRB-Seq sample and estimated the true cell barcodes using the knee method and our ‘adjacency’ and ‘directional’ methods and calculated the sensitivity and false positive rate (FPR) of the different approaches. As you can see in Figure 4, the ‘directional’ method has a slightly higher sensitivity and lower FPR than the ‘knee’ method, which suggests that this network-based method might be more suitable for identifying the true cell barcodes. The knee method slightly outperforms just selecting the top 384 most abundant barcodes (‘top 384’) in terms of higher sensitivity although this comes at the cost of increased FPR. The ‘adjacency’ method has a high sensitivity but tends towards overestimating the number of true cell barcodes as the number of sampled reads is increased. This is because adjacency only groups barcodes together if they are all separated from the same hub barcode and each by a single edge. Hence, any barcode with 2 or more errors will be always be in a separate group from the true barcode from which it originates. In contrast, the directional method is not limited to single edge paths.


Figure 4. Sensitivity (Top) and False Positive Rate (FPR; Middle, Bottom) for 3 methods to detect the number of true cells sequenced. “Top 384” = select the top 384 barcodes. Shown for comparison with knee method. Middle = all samples. Bottom = adjacency excluded to show difference between knee and directional.

Next, I took a single sample from the drop-seq paper where the authors had inferred that there were ~570 cells using a visual assessment of the cumulative frequency plot (This is the sample shown in Figure 1). They also backed up this assertion with an assessment of the species-specificity of the reads for each cell barcode which suggested that most of the “cells” below this inflection point were actually droplets which did not contain a cell and simply contained ‘ambient’ RNA. So, 570 seemed a reasonable truth to test the method against. My ‘knee’ method worked great, using the first 50M reads, this method estimated that there were 573 cell barcodes (Figure 5).


Figure 5. Cumulative frequency plot for cell barcodes in the first 50M drop-seq reads. The ‘knee’ method estimated 573 true cell barcodes and hence is plotted over the ‘top 570’ inflection point prediction. The X-axis has been limited to the 10 000 most abundant cell barcodes.

In contrast, the directional method failed miserably! For starters, the shear number of cell barcodes precluded the inclusion of all cell barcodes detected as the network building requires an all-vs-all comparison between the barcodes to calculate the edit distances. With 50M reads, the sensitivity was OK (95%), but the specificity was just 11% and the specificity unexpectedly increased as the number of reads parsed was decreased! Then came the realisation…

Image result for homer doh

As mentioned above, the network-based methods were designed to deal with a pool of barcodes, where some barcodes are true and the remaining barcodes are the result of errors from the true barcodes. However, here we have a situation where the ‘false’ barcodes could be due to either errors (I’ll call these ‘error barcodes’), or droplets without a cell which have amplified ambient RNA (‘ambient barcodes’). We can indentify these two separate types of false barcodes by looking at the minimum edit-distance between each barcode and the true barcodes. For error barcodes, we would expect this minimum edit-distance to be 1 in most cases. For ambient barcodes we expect them to have a similar distribution of minimum edit distances relative to the true barcodes as barcodes selected at random. Figure 6 below shows the distribution of counts per barcode, split into 4 bins. Figure 7 shows the distribution of edit-distances for each bin and a null distribution from a randomly generated set of cell barcodes. Bin 4 contains the true barcodes and shows a null-like distribution of edit-distances as expected. We can clearly see that bin 3, which has slightly lower counts per cell barcode compared to the true barcodes, is enriched in single edit-distances relative to the null. This strongly suggests many of these cell barcodes are error barcodes with a single error relative to the true cell barcode from which they originate. However, bin 2, and bin 1 especially, show a more null-like distribution, suggesting most of these cell barcodes are instead ambient barcodes.


Figure 6. Distribution of reads per cell barcode (log 10) from first 50M reads, separated into 4 bins. Bin 4 represents the ‘true’ barcodes.


Figure 7. Distribution of minimum edit distance compared to true barcodes (bin 4) for each bin shown in figure 6. Random = 10 000 randomly generated cell barcodes.


In summary, the knee method as implemented here works very well with the samples tested so far so we can proceed with automatic detection of the number of true cell barcodes with UMI-tools extract. We will also enable the use of the directional method since this is more accurate when you know that all false cell barcodes are the result of errors. However, I’d recommend sticking with the knee method unless you are absolutely certain this is the case.

Finally, if you’re already using UMI-tools for scRNA-Seq, or thinking about doing so, keep a look out for version 0.5. Along with the added functionality discribed here, we’re adding a count commands to yield counts per cell per gene directly from featureCounts/HTSeq output and we hope to include a scRNA-Seq-specific tutorial for this release too.

EDIT: Following the comments below I’ve continued this analysis into a second post to look at INDELs in cell barcodes and comparing UMI-tools with sircel.

7 thoughts on “Estimating the number of true cell barcodes in single cell RNA-Seq (part 1)

  1. Avi Srivastava

    Hi Tom,

    Really nice post. I wasn’t aware of the error in the form of ambient barcodes before, it was really an informative read.
    Digging more about the same I went to the Drop-seq paper, especially figure S3B

    If I assume we have used dataset from
    which had the concentration of 12.5 cells per microliter then by figure 3B, the cells should have a purity of 98.8%. So my question is why would we see such a huge density of reads coming from bin1 of figure6 in the blog post/data?

    Are they really ambient or there are some other forms of error?



    • Hi Avi. Thanks for your kind words. My understanding of what they mean by purity here is what proportion of the reads came from just a single species as they mixed rat and human cells in a single sample in order to detect doublets. So ~99% of droplets, that contained a cell, contained a single cell only.

      In terms of other errors, I neglected to mention INDELs which could well explain bin 2 as INDELs occur at ~ 1 order of magnitude less frequently than sequencing errors with illumina sequencing. I’ll definitely have a look at this…


      • Avi Srivastava

        Hi Tom,
        Thanks for clarifying the doubt on the purity of the cell, it does make sense.
        Now Wrt the indels, a recent paper has made similar claims of 25% error rate due to indels, but I can’t seem to validate the claim (I can be wrong) from the original (Macosko et al.) paper. Do you think Drop-seq specifically has this problem of high indel rate or in general all single-cell protocols face similar issues in barcodes?

        Thanks again!


  2. Hi Avi. Thanks for the link. I only became aware of Sircel after posting and I hadn’t realised there was a BioRiv manuscript out already. I agree it’s not clear where the 25% INDEL error rate claim comes from. My guess is that they are referring to Table S1 but I can’t actually understand exactly how these numbers were generated? For instance, how were substitution errors and INDELs identified and how was the merging done? Why is the reduction greater within genes than across genes? A 25% INDEL rate in a 12bp sequencing is alarmingly high. That would imply a per-base INDEL rate of 0.024 (0.976^12 ~ 0.24). INDELs are more frequent at the start of a read where the cell barcode usually resides but per-base INDEL rates rarely exceed 0.001 at even the worst position in the read (see Fig. 2/Table 2 here –

    Getting back to Sircel, as I’d expect from the Pachter lab, it’s an elegant solution and an interesting alternative approach. We initially developed our network-based approach in the framework of detecting the true number of unique molecules when using UMIs. As such, we try and account for all barcodes in the network. This is a reasonable approach for UMIs where all UMIs are either ‘true’ or contain errors relative to a ‘true’ UMI. However, as I said in the blog post, for cell barcodes with droplet scRNA-Seq, the ambient RNA amplification makes this approach unsuitable. Whether or not they considered this with Sircel, they get around this by identifying the ‘true’ barcodes determining a threshold on the ‘path capacity’ (Fig 2B) and then merging all remaining barcodes to the ‘true’ barcode with which its path overlaps the most. This makes me wonder if our network methods would be applicable to droplet scRNA-Seq cell barcodes if we apply a similar approach to threshold which true barcodes will be accepted.

    On a related note, we’ve been thinking about with how to deal with INDELs in UMIs for a while. To date, we’ve chosen to ignore them on the basis that they are at least an order of magnitude less frequent than sequencing errors. Furthermore, our current method relies upon considering all reads with the same start position together which precludes detection of INDELs. In contrast, when we’re dealing with cell barcodes, we’re considering all barcodes together so it’s much easier to detect INDELs. I’m going to try out extending our network-based methods to consider INDELs by replacing the hamming distance threshold with a fuzzy regex match and see how our results compare with Sircel. Hopefully I can write this up as a follow-up post soon.

    I had rather given up applying our network method to detect cell barcodes but thanks to your input and the Sircel manuscript, I think this is still worth pursing!


    • Avi Srivastava

      Hi Tom,

      First of all thanks for quick and elaborative replies!
      You might be right about the Table 1 from Drop-seq paper, Just to make sure we are on the same page
      I am talking about table 1 in the following PDF.

      However, I have following doubts about it:
      1. I think these numbers are for UMI, even if we assume the error rates are library-preparation dependent and we expect similar errors in barcodes (is it true?) then this brings to doubt second.
      2. If I understand it right, the table explains percentage of UMI’s collapsed then according to the table Drop-seq had ~75% error rate i.e. ~75% of all the UMI were collapsed whereas the Sircel paper claims 25% of all the barcodes has 1 indel, which makes me skeptical are they really talking about Table S1?
      3. Answer to your question of how the subs and indels identified could be, (IMO) the table is based on ERCC gene so that you can control the sampling rate of UMI and using cell-barcode you look for detected ERCC genes per cell. Now for each detected gene, you collapse the UMI within 1 edit distance (could be indels or subs) just like your `adjacency` model in Umitools.
      4. Is it ok to use the information from ERCC data about barcodes/UMI and use it as an evidence for the actual gene(endo) sequence? As we have evidence of ERCC following multiple types of biases disjoint to the actual gene.

      To summarize, I agree there could be indels in the barcodes but my question would be are they frequent enough to be considered for correction? Which I don’t have right answer of.



      • Hi Avi,
        You’re perfectly right about the table. This is of course just dealing with UMIs in the way you describe in point 3. So I’m none the wiser about where the 25% figure comes from. In answer to point 4, my further analysis suggests the rate of substitutions in the cell barcodes with the drop-seq (GSE66693) sample is ~7% and ~4% for INDELs. I’ll describe how I got to those numbers in the next post.

        I’m coming around to the idea that we really want a very conservative approach for identifying cell barcodes for droplet scRNA-Seq. In the sample above, the ‘true’ barcodes (bin 4) account for 60% of the reads. At first, it would seem a waste of sequencing costs to discard the remaining 40% of reads and we might want to ‘correct’ the erorrs in the remaining cell barcodes. However, bin 3, which definitely contains error barcodes, only accounts for 5.5% of reads and ~40% of the reads in bin 3 appear to have an ambient RNA cell barcode. So at best we might be able to rescue a few percent of reads by correcting errors in the cell barcodes. We could also try and correct errors in bins 3 & 4 but here the fraction of cell barcodes from ambient RNA is even greater. The obvious risk with any level of cell barcode correction is that we could merge an ‘ambient RNA’ cell into a real cell barcode, or else merge together two real cells which could have a dramatic effect on the analysis and lead to all sorts of potential misinterpretations, e.g the cell appears to be in between two discrete states.


  3. […] left off in the first part. If you haven’t already done so, I’d recommend reading the first part […]


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s