Estimating the number of true cell barcodes in single cell RNA-Seq (part 2)

This post starts where we left off in the first part. If you haven’t already done so, I’d recommend reading the first part now.

I hadn’t planned on writing another post but no sooner had I finished the first part of this blog post than I discovered that the Pachter group have developed a tool to perform the same task (sircel, pronounced ‘circle’). This appears to be in the early stages of development but there’s a bioAxiv manuscript out which happens to cover the same input data as I discussed in the previous post (thanks to Avi Srivastava for bringing this to my attention).

In the discussion with Avi in the comments of the previous post I also realised that I had failed to clarify that only base substitutions will be detectable when using our network methods. As it happens, one of the motivations for the development of sircel appears to be handling INDELs which they claim can be present in >25% of dropseq barcodes. I can’t see where they get this number from (In the manuscript they reference the original Macosko et al paper but I can’t see this 25% figure anywhere) so I’m assuming for now that this is based on additional analyses they’ve performed. This seems like an alarmingly high figure to me given that the rate of INDELs is ~1 order of magnitude less frequent than substituions (see Fig. 2/Table 2 here) and even at the start of the read where the cell barcode resides, the per-base rate of INDELs does not usually exceed 0.001. I would also assume that cell barcodes with INDELs would have far fewer reads and so drop below the threshold using the ‘knee’ method. Hence, while some cell barcodes may be discarded due to INDELs, I would be surprised if a INDEL-containing barcode was erroneously included in the accepted cell barcodes.

How common are INDELs?

Before I tested out sircel, I wanted to independently assess what proportion of cell barcodes might have a substitution or INDEL. To do this, I returned to the drop-seq data from the last post (GSE66693). Previously, I showed the distribution of reads per cell barcode separated into 4 bins, where bin 4 represents the accepted barcodes above the knee threshold and bins 1-3 are largely populated by different types of erroneous barcodes (reproduced below as Figure 1). I compared each barcode to all barcodes in bin 4 and looked for a match allowing for one error (substitution, insertion or deletion). The results are tabulated below.  These figures are an upper estimate of the number of barcodes/reads harbouring an error relative to the true cell barcodes since each barcode may contribute to all three error types if at least one match to a bin 4 barcode exists with each error type. We can see that ~1.5% of cell barcodes may contain an INDEL, far less than the 25% suggested in the sircel manuscript. Furthermore, when we present the figures according to the number of reads, it appears around 3% of reads have a cell barcode with an INDEL, whilst 7% have a substitution. Thus, we can be quite confident that we’re not losing lots of data if we don’t error correct cell barcodes or just correct substitution errors.


Figure 1. Distribution of reads per cell barcode (log 10) from first 50M reads, separated into 4 bins. Bin 4 represents the ‘true’ barcodes.

Untitled 2

Table 1: Columns 1 & 2: The number and fraction of cell barcodes which match a bin 4 barcode allowing one error. Columns 3 & 4: The number and fraction of reads with a cell barcodes which match a bin 4 barcode allowing one error.

Are INDEL-containing barcodes passing the knee threshold?

Next, I want to investigate whether the INDEL barcodes pass the knee threshold and become ‘accepted’. I therefore split the above data according to bin and focused on bin 4. As you can see in Figure 2 below,  there are indeed a small number of bin 4 barcodes which are a single error away from another bin 4 barcodes. Our null expectations (from 100 000 randomly generated barcodes) for the percentage of subsitution, deletion and insertion matches by chance are 0.14, 0.11 and 0.13%, respectively. In contrast, amongst the 575 bin 4 barcodes, there are 6 barcodes (1.0%) which match at least one other bin 4 barcode with one error. 4/6 of these barcodes have a match with an INDEL and all have relatively low counts suggesting these are indeed errors. It seems I may have been wrong to assume a INDEL-containing barcodes will not pass the knee threshold. The results for bins 1-3 minor my previous analysis using edit-distance and suggest the bin3 is mainly made up of error-containing barcodes whilst bins 1 & 2 appear to originate from amplification of ambient RNA in empty droplets.

error_detectionFigure 2. The frequency of cell barcodes in each bin which are one error away from at least one of the bin 4 barcodes. Random = 100 000 randomly generated barcodes

How does sircle compare to the knee method?

So, INDELs are observed in this drop-Seq data, although perhaps not 25% of all cell barcodes, and INDEL-containing barcodes can pass the hard knee threshold for acceptance. Now on to sircel. As you would expect from the Pachter group, sircel introduces a novel, elegant solution to detect substitutions and INDELs in barcodes. Rather than using edit-distance or fuzzy matching, they ‘circularise’ the barcodes and form a de Bruijn graph using the kmers from the circularised barcodes. The circularisation ensures that even with a large kmer size, barcodes containing an error will share at least a portion of their path through the de Bruijn graph with the cognate error-free barcodes. The edges are weighted according to the number of times each edge is observed and each circular path is given a score according to the minimal weight of all its edges. This score is used to filter the paths to select the high weight, correct paths and incorrect paths are merged to the correct paths. Notably, the thresholding is done using a method analogous to the knee method, whereby the threshold is set as the local maxima of the smoothed first derivative from the cumulative distribution of path scores. The neat thing about the sircel method is that it enables simultaneous identification of this inflection point to identify the threshold for true barcodes, and error correction to the true barcodes. To compare the results for sircel and my knee method, I ran sircel on the GSE66693 drop-seq sample where the authors had suggested there were ~570 true cell barcodes. Previously, using the first 50M reads I had estimated that there were 573 true barcodes using the knee method. As per the sircel manuscript recommendation, I used only 1M reads to run sircel, which yielded 567 true barcodes. This is slightly lower than the number quoted in the manuscript for 1M reads (582) but perhaps this is due to more recent updates? The methods agree on 563 cell barcodes but what about the cell barcodes where they agree. 4/10 out of the knee method-specific cell barcodes are a single error away from another barcode above the threshold, suggesting these may in fact be errors. The remaining 6 barcodes all have very low counts. Likewise, the 4 sircel-specific cell barcodes fall just below the knee threshold (6170-9576 reads per barcode; threshold = 9960). For these 10 barcodes, the difference appears to be due to differences in the number of reads sampled. I wanted to check this by using 50M reads with sircel but the run-time becomes excessively long. Unfortunately, going to other way and using just 1M reads with the knee method yields unstable estimates and since it takes <6 minutes to run the knee method with 50M reads we would never recommend using the knee method with just 1M reads. So in essence, the major difference between the knee method and sircel is that sircel effectively corrects the small number of error barcodes above the threshold.

Can we improve the knee method?

It does look like sircel does a good job of removing error barcodes but we could easily replicate this performance by performing an INDEL-aware error detection step after the knee threshold. During the development of UMI-tools we considered detecting INDELs as well as substitutions but decided against it because: a) as mentioned at the top, the rate of INDELs is usually ~1 order of magnitude lower than substituions (that this doesn’t appear to hold true for cell barcodes), and b) UMI-tools groups reads together by their alignment position so the current framework does not allow the detection of INDELs. However, when we’re dealing with cell barcodes, we consider all barcodes at once and hence, it’s much easier to identify INDELs. For our network methods, we can simply replace the edit-distance threshold with a fuzzy regex, allowing one error (substitution, insertion or deletion). The fuzzy regex matching is ~30-fold slower than our cythonised edit-distance function but this isn’t an issue here since we only need to build the network from the cell barcodes which pass the knee threshold. From the network we can identify the error barcodes and either correct these to the correct barcode or else discard them. This will now be the default method for UMI-tools extract with the option to discard, correct or ignore potential barcode errors above the knee threshold. We will also include the option to correct barcodes which fall below the threshold.

A final thought: To correct or not?

The cost of incorrectly merging two cell barcodes through cell barcode error correction could be drastic, e.g appearing to create cells ‘transitioning’ between discrete states. On the other hand, discarding reads from error barcodes rather than correcting them reduces the sequencing depth and hence increases the noise. In the drop-seq data above 55.7% of reads are retained without error correction with a further 9.2% of reads potentially correctable. If all barcodes above the threshold and within one error of another barcode above the threshold were removed, this would only mean the loss of 0.3% of reads. Given the potential dangers of incorrect error correction, it may be prudent in some cases to discard all cell barcodes with errors and work with 55.4% of the input reads rather than correcting errors in order to retain 64.9% or reads

Speeding up python

I was recently contacted by Ilgar Abdullayev from the Ludwig Institute for Cancer Research, Karolinska Institutet with a query about why the script in our UMI-Tools repository was taking so long to run on some of his BAMs. Finding the bottleneck was relatively easy and the immediate solution was straightforward thanks to cython (and our technical director Andreas!). Thanks to Ilgar for sharing some of his data to let me optimise the code. Before the details though, I thought it might be a good time to take stock on why python is quick to write but slow to run. Skip to “Reducing the interpretation overhead” if you’re already a whiz at python.

Quick development

One of the things I really love about python is how quick it is to write function code.  Python is dynamically typed which means we don’t have to worry about the type of a variable until we want to do something with it. The following is perfectly valid.

variable = "hello world"

variable = 6

It appears to the user that the type of variable has changed from string to int. Actually, we first created a python string object and bound the reference “variable” to it, then we created a new int object and bound the reference “variable” to it. The first object no longer has any references bound to it so it’s garbage collected (e.g deleted).

Another great feature of python is duck typing by which we can perform the same operation on different python types to different ends. The classic example being the use of “+” for both concatenating strings and for adding numeric types.

variable = "hello world"
print variable + "I'm python"
hello world, I'm python

variable = 6
print variable + 2

Along with a comprehensive set of well-maintained modules for more complex classes, data manipulations and statistics, dynamic typing and duck typing makes writing a complex python script something that can be achieved in minutes rather than days.

Slower to run

The penalty you pay for dynamic typing is that the python code must be interpreted, and a section of code is re-interpreted each time its run. In the, admittedly contrived, example below we can clearly see why the code in the loop needs to be interpreted each time anew.

dict_1 = {1:"bar", 0:"foo"}
list_1 = ["bar", "foo"]

objects = (dict_1, list_1)
for obj in (objects):
    print obj[0]


In the first loop, obj is a dictionary. The “[0]” index in a dictionary returns the value for key “0”, which in this case is “foo”. In the second loop, obj is a list. The “[0]” operation on a list slices the list to the xth element (0-indexed), which in this case is the first element, “bar”.

Reducing the interpretation overhead

The reinterpretation of code takes a lot of time and python is therefore considerably slower than languages like C. And this is where cython (see what they did there) comes in.  Cython allows you to statically determine the type of an object reducing the work of the python interpreter.

Now to return to the original problem with our script. For a background on our UMI-tools see (CGAT blog – UMIs). The script in question takes the alignment file (SAM/BAM format) and returns an alignment with the duplicated reads removed. In my hands this was taking about 10 seconds for 1 million reads which seems perfectly reasonable. For Ilgar however, it was taking around 12 minutes per 1 million reads which is less than ideal when alignment files may contain >100M reads.

To identify the bottleneck I first used a bit of trial and error to identify a single read start position which was particularly troublesome. In hindsight it was hardly surprising this was an area of extreme coverage. With the use of of the python profiler module we can identify the part of the script which was taking the longest time.

python -m cProfile -s tottime -I subset.bam -S out.bam --method=adjacency > profile.log

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 26352342   47.813    0.000   76.867    0.000
 26352397   20.992    0.000   20.992    0.000 {zip}
 26352353    8.062    0.000    8.062    0.000 {sum}

So the majority of the time taken was the edit_dist function which was run 26352342 times! Looking back at the script, we can see this is because our adjacency-based methods take the list of UMIs observed and create a dictionary mapping each UMI to the UMIs which are a single edit distance apart (hamming distance = 1). This is an all vs. all operation so the time taken increases quadratically with the number of UMIs observed. The obvious solution was therefore to improve the speed of the adjacency dictionary building. Let’s look at relevant portion of the original code.

def edit_dist(first, second):
    ''' returns the edit distance/hamming distances between
    its two arguments '''

    dist = sum([not a == b for a, b in zip(first, second)])
    return dist

def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi2) <= threshold]
                for umi in umis}

The _get_adj_list_adjacency method uses a dictionary comprehension with the values generated by a list comprehension so the edit_distance function is called on every combination of UMIS. For a quick improvement we can put the code for the edit distance calculation inside the list comprehension so we avoid recalling the function, but the gains are minimal. Alternatively, as you may have noticed, we’re only interested in UMIs where the distance is >= threshold so we could optimise the edit_distance method by returning as soon as the distance is over the threshold:

def edit_dist(first, second, threshold):
    ''' returns True if hamming distance is less than threshold '''

    dist = 0
    for a, b in in zip(first, second):
        if a != b:
            dist += 1
        if dist > threshold:
            return 0
    return 1

def _get_adj_list_adjacency(self, umis, counts, threshold):
        '''identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_dist(umi, umi2, threshold)]
                for umi in umis}

This improves the speed ~2-fold which is nice but not really enough!

Speaking to Andreas about this problem, he quickly identified cython as simple solution since the hamming distance calculation is very simple. Here’s the cython code snippet and the pertinent code in


cpdef int edit_distance(a, b):
    cdef char * aa = a
    cdef char * bb = b

    cdef int k, l, c
    c = 0
    l = len(a)
    for k from 0 <= k < l:
        if aa[k] != bb[k]:
            c += 1
    return c

import pyximport
from _dedup_umi import edit_distance
def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold '''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi) <= threshold]
                for umi in umis}

In the cython function we statically define the type for the two C strings a & b and the ints k, l & c. Then we count the positions in each string where the characters are different. In we just need to import the pyximport module and use it to compile Then we import the edit_distance function from and use it like any other python function. This simple modification reduces the run time on the troublesome genomic coordinate from 91 secs to 11 secs – the most time-consuming function for this particular region is now remove_umis.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6432    5.902    0.001    5.902    0.001
       11    3.626    0.330    5.544    0.504<dictcomp>)
 26359658    1.918    0.000    1.918    0.000 {_dedup_umi.edit_distance}

The cython function could be further improved along the lines of the python function by returning as soon as the threshold is exceeded. I’m sure there are other improvements we could make to the run time of the script. For example, transforming the UMI strings to 2bit arrays would make the hamming distance calculation a very cheap xor bitwise operation. However, I’ve not tested yet if the cost of the transformation outweighs the gains in the edit distance calculation.

This is the first time I’ve used cython – something that Ian Sudbury (UMI-tools co-author) had suggested might be worth considering a while ago. To be honest, the solution was so straightforward that Andreas ended up implementing it himself whilst he demonstrated it. I think it might be time for me to read that cython documentation fully now!

Road-testing Kallisto

In this blog post I’m going to discuss some testing I’ve been doing with Kallisto, a lightning-fast tool for RNA-Seq transcript expression quantification. First though, I’ll discuss very briefly how Kallisto works (skip to “Testing Kallisto” if you don’t want my very simplified explanation of Kallisto).

Alignment-free quantification in RNA-Seq

RNA-Seq bioinformatics can take many forms but typically involves alignment of sequence reads to a reference genome, in order to establish where the reads originate from, followed by estimation of transcript abundance using a reference gene set. These steps are computationally heavy and can be very time-consuming even with multi-threading. Last year, Rob Patro et al released Sailfish which demonstrated that it was not necessary to actually align each read to the genome in order to obtain accurate transcript abundance estimates, all you actually need to do is establish the most likely transcript for each read. They achieved this by shredding the transcriptome and reads into kmers (short overlapping sequences; see Fig. 1), and then matching the transcriptome and read kmers which is a very fast and has a low memory usage. We discussed this paper in our journal club last year and I think most of us took a bit of time to accept that this was actually a viable approach, it just felt like you were losing too much information by shredding the reads into kmers. The results were very convincing though and the Sailfish paper was quickly followed by further “alignment-free” tools including a modified method using only a subset of the read kmers called RNASkim, an update to Sailfish called Salmon, and most recently Kallisto. Kallisto is actually a slight deviation from the kmer shredding method as it forms a de Bruijn graph from the transcript kmers and then aligns the read kmers to the transcriptome de Bruijn graph to form what they describe as “pseudoalignments” (Fig.2). The accuracy of Kallisto is equivalent to the best in class methods whilst being >100 x faster! In fact Kallisto is so quick, it’s perfectly feasible to bootstrap the quantification estimates 100 times to obtain an estimate of the technical variance which is very useful when testing for differentially expressed transcripts (utilised by Sleuth).


Figure 1. Kmers. Sequences can be split into overlapping (k)mers where k is less than the length of the sequence. Typically for assembly purposes, a single nucleotide overlap is used and the kmers are transformed into a de Bruijn graph which links overlapping kmers. Figure from

kallisto method schematic

Figure 2. Schematic representation of pseudoalignment. a. Transcripts (shown in colours) are shredded into kmers and then used to form coloured de Bruijn graph (c) in which each transcript corresponds to a path through the graph and each node is a kmer. A read (shown in black at top) is aligned to the graph by matching kmers to find the compatibility of the read to the transcripts. In this case, the read could have originated from either the pink or blue transcripts. Figure adapted from Kallisto publication.

Testing Kallisto

As I mentioned briefly in a previous blog post, Highlights from Genome Science 2015,  Mick Watson gave a very engaging presentation of his recent paper with Christelle Robert at Genome Science 2015 in which they identified human genes that cannot be accurately quantified in isolation using standard RNA-Seq analyses. He explained that for some genes, many of the reads which originate from the gene will align to multiple places in the genome so the quantification of these genes is very inaccurate with many methods. Their proposal is to use the multi-mapping reads to define “multi-map groups” (MMGs) which can treated as a meta-gene for the downstream differential expression testing. Mick presented an example where this analysis of MMGs leads to an interesting finding which would be missed by standard analyses.

After Mick’s talk, I got thinking about a similar analysis for Kallisto to flag transcripts which it cannot accurately quantify. There’s been lots of excitement about the speed to the alignment-free methods, and with the introduction of Sleuth, a pipeline from sequence reads to differentially expressed transcripts can now easily analyse dozens of samples on a small cluster within just a few hours. However, I wanted to get a feel for the accuracy of Kallisto before I jumped in with any real data. My assumption was that transcripts with little unique sequence would be very difficult to quantify by Kallisto (or any other tool for that matter). Originally I followed the methodology from Robert and Watson and simulated 1000 reads per transcript – I quickly realised this was inappropriate. Because each transcript has the same number of reads assigned to it, the simulation wasn’t properly testing whether the expectation–maximization (EM) algorithm used by Kallisto is able to correctly resolve the number of reads per transcript. I therefore took a slightly different approach and simulated 1-100 reads at random from each gene in my reference gene set (hg38 Ensembl v.80 filtered to retain transcripts from protein-coding genes only) with 30 repeat simulations (x30). The two metrics I was interested in were the difference between the sum of ground truths and the sum of Kallisto estimated counts, and the correlation between the ground truth and the estimated counts per simulation. The first metric describes how accurately Kallisto quantifies the true abundance of a transcript within the transcriptome and the second metric described how accurate the Kallisto estimate is for differential expression purposes. N.B It’s possible that a transcript may show a very good correlation but the estimated counts are systematically over- or under-estimated. For the purposes of differential expression analysis this may not be so much of an issue so long as the fold changes in expression are accurately captured.

Figure 3 shows the relationship between the number of unique kmers (here shown as the fraction of all kmers across the transcript) and the fold difference between the sum of ground truths and the sum of Kallisto estimates. As expected, transcripts with a higher fraction of unique kmers are quantified more accurately by Kallisto. However, I was surprised to see Kallisto still seems to accurately quantify many of the transcripts which don’t have a single unique kmer(!). I’m not sure whether this might be an artefact of the simulation somehow?

Kallisto simulation accuracy fold difference

Figure 3. Increased fraction of unique kmers is associated with lower fold differences between Kallisto estimated counts and ground truth. Transcripts were binned by the fraction of kmers which are unique to the transcript. The log2-fold difference between the sum of ground truths and Kallisto estimated counts is shown, along with a boxplot per bin to show the distribution within each bin. Transcripts with absolute differences > 2-fold are shown in red at 1 or -1 respectively.

Figure 4 shows the overall correlation between the sum of ground truths and the sum of estimated counts. Transcripts where the estimated counts were 1.5-fold higher or lower than the ground truths are flagged as “Poor” accuracy. In total, just 701/182510 (0.46%) transcripts failed this threshold. With a even-tighter threshold of 1.2-fold, still only 2.8% of transcripts fail. There were a few transcripts which are considerably overestimated in each individual simulation, the most extreme of which is ENST00000449947, a transcripts of the Azoospermia factor DAZ2 which is ~9-fold overestimated on average across the simulations. When we look at the transcript table and model for DAZ2, (Fig. 5 & Fig. 6) we can see that there are 9 transcripts within this locus, and they appear to vary only in the inclusion/exclusion of particular exons so their sequences will be very similar. In fact, of the 9 transcripts, only ENST0000382440 contains any unique kmers (13/3223; 0.4%), whereas the rest contain 1784-3232 non-unique kmers. It’s no surprise then that Kallisto struggles to accurately quantify the abundance of all 9 transcripts (Fig. 7) with three of the transcripts having zero estimated counts across all simulated data sets. When we look back at the Ensembl table, we can see that a number of the transcripts are annotated with the flag TSL:5 (Transcript Support Level 5) which indicates that there are no mRNAs or ESTs which support this transcript, i.e it’s highly suspect. If all these TSL:5 transcripts were removed from the gene set, every transcript would have some unique kmers which would certainly improve the Kallisto quantification for the transcripts of this gene. Since these transcripts are likely to be miss-assembly artefacts, it seems sensible to suggest they should be removed from the gene set to allow accurate quantification at the other transcripts of this gene.

Kallisto simulation accuracy correlation

Figure 4. Correlation between sum of ground truths and estimated counts from Kallisto across all simulation (n=30). The number of reads per transcript per simulation was randomly sampled from a uniform distribution [0-100], hence the sum of ground truths is not the same for each transcript. Transcripts for which the Kallisto estimates are >1.5-fold higher or lower are flagged as “Poor” quality

Ensembl transcripts

Figure 5. ENSG00000205944 Transcript table from Ensembl.

Ensembl transcript models

Figure 6. ENSG00000205944 Transcript models from Ensembl.

Kallisto accuracy DAZ2

Figure 7. As per Figure 4 except transcripts from gene DAZ2 (ENSG00000205944) are highlighted.

So far, so good for most (99.54%) transcripts with regards to accurate overall quantification. The next thing I wanted to look at was the Pearson product-moment correlation coefficient (r) between the ground truth and estimated counts. Most of the time I’m interested in doing differential expression testing with RNA-Seq. I know with a real data set the estimates themselves will be systematically biased by all sorts of things, chiefly amplification biases and illumina sequencing biases, but I need to know the Kallisto estimates correlate well with the transcript abundance. Figures 8 & 9 shows the relationship between the fraction of unique kmers and the correlation between ground truth and estimted counts. Again as expected, Kallisto performs much better when the transcript contains a high fraction of unique kmers, with a clear drop off in performance below 5% unique kmers. Yet again, Kallisto is still performing very well for some transcripts which don’t have a single unique kmer. In fact, it does slightly better for transcripts with 0-1% unique kmers than those with 1-2% unique kmers(!), which I’m at a complete loss to explain. Perhaps someone from Lior Pachter’s group can explain how this is possible?

In this case it’s slightly more difficult to define a threshold to flag transcripts with a low correlation as the range of ground truths for each transcript will be different so the comparison between transcripts wouldn’t be on the same basis. For my purposes, I’ve just flagged any transcript with less than 5% kmers as being difficult to quantify for now.

If we were to set a correlation coefficient threshold of 0.75, then 10.3% of transcripts would be flagged as failing the threshold. We can see in Figure 10 that transcripts from loci with a larger number of transcripts show much worse correlation. Returning to the example of ENSG00000205944 from earlier, all the transcripts fail to meet this threshold, with ENST00000382440 (the only transcript with unique kmers) showing the highest correlation (0.729).

Kallisto accuracy unique kmers

Figure 8. Transcripts with more unique kmers have a better correlation between ground truth and Kallisto estimates. Transcripts were binned by the fraction of kmers which are unique to the transcript. The boxplot shows the distribution of correlation coefficients per bin.

Kallisto accuracy unique kmers zoom

Figure 9. As per Fig.8 except x-axis restricted to [0-0.1].

Kallisto accuracy correlation transcripts per gene

Figure 10. Transcripts from loci with more transcripts have lower correlations between ground truth and Kallisto estimates. Transcripts were binned by the number of transcripts for the gene which they are annotated to. Boxplots show the distribution of correlation values per bin


Kallisto appears to be highly accurate for the majority of transcripts, with estimated counts for 99.54% of transcripts within 1.5-fold of the true counts and 97.2% within 1.2-fold. This seems pretty fantastic to me. Most of the transcripts show a good correlation between the simulated read counts and the Kallisto estimates (~90% coefficient > 0.75) which seems pretty good to me for a transcript level analysis with a very complex simulated transcriptome. Right now, I’ve not got anything to compare this to but I’m happy Kallisto’s performing very well. I’m going to add Sailfish and Salmon to the pipeline soon for comparison and then compare to some alignment-based methods.

Finally, this simulation has reminded me of the importance of properly considering the gene set. Inclusion of the poorly supported transcripts here looks likely to have reduced the accuracy of well supported transcripts from the same locus. Before I go onto quantify expression in my samples, I clearly need to define a high confidence set of transcripts. One of the fantastic things about the speed of Kallisto is that it’s so quick that I can include this simulation with every run of the pipeline and retest my new geneset at run-time. This also means anyone else using it within the group will get a report on the suitability of their gene set and transcripts which can’t be accurately quantified will be flagged in the final results.

With thanks:

There have been a number of really informative blog posts on alignment-free quantification, including from the authors of the tools. The Kallisto team have also been very quick at responding helpfully to questions on their google groups forum which is fantastic to see. Any misunderstanding or misrepresentation of how these methods work is, however, entirely my fault!

Rob Patro – Salmon and Kallisto

Lior Pachter – Kallisto

Nextgenseek – Kallisto

Highlights from Genome Science 2015

Genome Science is an annual conference focusing on genome research in the UK, with this year’s conference hosted by Nick Loman in Birmingham. Big thanks out to Nick and the organising committee for bringing together such as diverse range of speakers, ranging all the way from functional genomics through to clinical genomics and from new technologies through to fundamental research. CGAT was out in force with 8 of us attending, probably only outdone in pure numbers by the might of TGAC!

I thought I’d put a few highlights up on the blog, partly because I’m so bad at remembering names and faces and this might help me remember who it was who gave that really interesting talk at Genome Science 2015 in a few months time! This is far from a comprehensive summary of all the talks though.

In the opening talk of the conference, Daniel MacArthur’s presented the results from aggregating 1000s of exome studies. This included one of those “why hadn’t I thought about this before” moments when he explained that we’ve now sequenced the exomes of so many individuals that one can start applying statistical tests to determine genes where the number of single nucleotide variants (SNVs) identified is significantly lower than we would expect by chance. The obvious explanation for this is that mutations in these gene lead to such a severe phenotype that the mutation is removed from the gene-pool by purifying selection. This is indeed the case for a number of the genes which are implicated in severe diseases. Interestingly some of the genes they’ve identified have never been implicated in a human disease before. However, we can now say that a mutation causing an amino-acid change would likely be fatal. So there’s value not just in what you observe but what you don’t observe too! What really excites me is the potential to turn this analysis to non-coding regions when we have enough full genomes sequenced – this could really help increase our understanding of the regions of the genome which are important in regulating gene expression. Ed Yong’s got an article up on explaining this much better than I have (How Data-Wranglers Are Building the Great Library of Genetic Variation)

For me, some of the most interesting talks at Genome Science 2015 were those which introduced novel technologies or analyses and prompted me to think about new ways to answer those burning biological questions. There were many talks about nanopore-based sequencing and two in particular gave me some food for thought.

Mike Akeson (University of California Santa Cruz) spoke about the development of nanopore sequencing technologies, from the early work of George Church and others through to the recent results coming out of Oxford Nanopore’s early access program. There’s obviously a lot of excitement about the rapid improvements in base calling, however, what really excited me was the mention of nanopore sequencing of proteins. Whisper it quietly but clinical applications of current sequencing technologies often use transcript abundance merely as a proxy for protein abundance. If we could start reliably quantifying proteins with nanopore-based technologies, this could be a very disruptive technology. On the research side, as someone who’s interested in the question of how much alternative splicing of transcripts is actually propagated to the protein level, being able to directly sequence both the transcriptome and proteome would be fantastic.

Keeping with the nanopore theme, Matt Loose (University of Nottingham) presented the latest incarnation of his minoTour platform for realtime minION analysis. It looks like a really slick interface but I must confess that up to this point I’d not really seen the value in the realtime aspect of nanopore analyses within research settings. Watching a minION generate reads in a live demonstration is very cool but apart from a constantly updated histogram of read length, what do you really gain from this? Of course in a clinical setting you could hook up the output directly to the downstream analysis and obtain your results 24 hours quicker, which could be of real benefit. But for me the downstream analysis is probably best measured in months not hours! What I hadn’t considered until Matt’s talk though was that the realtime analysis allows you to selectively focus on particular sequences. In the example he gave, if you barcode DNA from each sample, you can identify the sample source of the DNA in realtime. This then enables you to obtain roughly equal sequencing depth from all your samples in a multiplexed pool of DNA by rejecting pores which are occupied by an over-sequenced sample and switching to another pore. There’ll no doubt be other similar applications of the realtime sequence analysis to come. Interestingly, Matt mentioned that this is currently only possible because the speed which the DNA is being ratcheted through the nanopore is suppressed to improve accuracy, when the speed increases, it may start to become difficult to analyses the sequences quick enough!

Other highlights included Mick Watson’s very engaging presentation of his recent publication which defines genes which can’t be accurately quantified in isolation but can yield biologically relevant results when considered as a group of genes. Definitely worth a read: Errors in RNA-Seq quantification affect genes of relevance to human disease. I’ll return to this topic in more depth with some of my own simulation data using Kallisto soon.

Unique Molecular Identifiers – the problem, the solution and the proof

[EDIT: UMI-Tools open access publication is now out]

I’ve been working with Dr. Ian Sudbery (former CGAT fellow, currently Lecturer in Bioinformatics at the University of Sheffield) on a script called to correctly remove PCR duplicates from alignment files when using Unique Molecular Identifiers (UMIs). The script is available from GitHub as part of the UMI-tools repository. As part of the development I’ve been running some simulations to compare different methods which have proved quite illuminating. Before I go into the details though, some background.

PCR duplication artifacts

Many sequencing-based techniques rely upon an amplification step to increase the concentration of the library generated from the fragmented DNA prior to sequencing. Following alignment to the genome, it’s usually preferable to identify and remove PCR duplicates as there are inherent biases in the amplification step as some sequences become overrepresented in the finally library compared to their true abundance in the fragmented DNA. The brilliant Picard tool is commonly used to identify and remove these PCR duplicates using their genomic coordinates. Of course, it’s also possible that two sequence reads with the same genomic coordinates will have the originated from two independent fragments. However, where the alignment density is essentially random (as in DNA-Seq) this is unlikely, even more so if paired end sequencing has been performed. The human genome has ~3 x 109 bases and the insert size will vary so the combinations of possible genomic coordinates for the paired ends is so vast we don’t expect to sequence two fragments with the same genomic coordinates.

Conversely, in RNA-Seq, some cDNA sequences will be highly abundant and we are therefore much more likely to sequence multiple fragments with the exact same genomic coordinates. This problem is most acute where a high number of PCR cycles have been used to generate the sequencing library, for example where the concentration of starting material in low, such as in single cell RNA-Seq as this necessitates many PCR cycles.

Using unique molecular identifiers to correctly identify PCR duplicates

Unique molecular identifiers (UMIs; also called Random Molecular Tags (RMTs)) are random sequences of bases used to tag each molecule (fragment) prior to library amplification (see figure below and Kivioja et al (2012), thereby aiding in the identification of PCR duplicates.  If two reads align to the same location and have the same UMI, it is highly likely that they are PCR duplicates originating from the same fragment prior to amplification. We can therefore collapse all the reads with the same genomic coordinates and UMI into a single representative read and obtain an accurate estimate of the relative concentration of the fragments in the original sample.

nmeth.2772-F1This figure is adapted from Islam et al (2014)

The problem

The use of UMIs as described above would work perfectly if it were not for base-calling errors, which erroneously create sequencing reads with the same genomic coordinates and UMIs and that are identical except for the base at which the error occurred. Base-calling errors therefore inflate the apparent numbers of unique fragments sequenced (see below). An additional source of errors comes from the amplification itself which can introduce errors during the DNA duplication step.


The solution

Ian initially identified this issue in his own data and wrote a python script utilising pysam to parse an alignment BAM file and inspect the UMIs of reads with the same genomic coordinates. He reasoned that sequencing errors would produce UMIs which are a single edit distance away from another UMI. Therefore, by forming an adjacency matrix it should be possible to identify the sequencing errors and collapse the reads to the true number of unique UMIs present (see schematic below for a simple example). The first step is to identify connected components, where UMIs are represented by nodes and edges represent links between nodes which are a single edit distance apart. The counts of each node are also retained. Within each component, the most abundant node is selected and all nodes within a single edit distance are removed. This process is repeated using the second most abundant node and so on until all edges are removed and the number of nodes remaining is inferred to be the number of unique fragments sequenced. We refer to this method as the “adjacency” method.


It turns out this works pretty well, however, one can envisage situations where this may not correctly identify the true number of unique fragments. For example,  an error during the PCR amplication may create an apparently distinct UMI from which sequencing errors could created further UMIs, all of which are actually from a single unique fragment (see schematic below). A similar situation could also arise if one particular position in the reads was more likely to contain an error (this is quite common in illumina sequencing). In this case, one would see lots of UMIs with a single edit distance due to an error at the common position and further errors may be present off of this UMI.

schematic_3In order to deal with these more complicated networks of connected UMIs, I developed the “directional_adjacency” method. I reasoned that we could resolve these networks by forming edges between nodes where the count of node A was >= 2 x node B. Now, each connected component is inferred to originate from a single unique fragment.

In addition to these methods, we also wanted to try a modification of the “adjacency” method in which the whole connected component is inferred to originate from a single unique fragment (“cluster”).

This issue has been previously observed by Islam et al (2014) who proposed that UMIs with less than 1% of the average counts should be removed “to account for UMIs that stem from PCR-induced mutations or sequencing errors”. We implemented this method within our tool for comparison (“percentile”).

Finally, we implemented a “unique” method, where all unique UMIs were retained. Note, this appears to be the method which is most commonly used. The schematic below shows a comparison of the various methods (I’ve obviously rigged it to make directional adjacency look the best).


The proof

We’re taking two approaches to prove the value of improved detection of PCR duplicates using the adjacency methods implemented in Firstly, we’ve conducted simulations to show they consistently estimate the true number of unique fragments more accurately. Secondly, we’ve applied to multiple single-cell RNA-Seq studies and show an improvement, with respect to the distribution of UMIs, the quantification of spike-in RNAs and the clustering of cells. I’ll focus on the simulation for now and leave the single-cell RNA-Seq to another post when we’ve submitted the manuscript.

In order to compare the various methods, we wanted to know how the methods fared across a range of UMI lengths and depths of sequencing.  For this I turned to python to replicate the process of amplification and sequencing in silico and model the impact of the various parameters on the efficacy of the methods. The whole process is contained in a standalone ipython notebook which you can view here. If you would like to download and run the notebook yourself you can do so from here.

In brief, the simulation starts with a set number of unique molecules (fragments), our “ground truth”. A UMI is then randomly assigned to each molecule and the pool of molecules amplified by duplicating each molecules for every in silico PCR cycle. The final pool of molecules are then “sequenced” to produce the reads. Within this process, base-calling errors are incorporated during the sequencing and substitutions are incorporated during the PCR cycles. Furthermore, amplification efficiency of each UMI varies to naively model the impact of amplification biases. Various methods are then applied to the reads generated to simulate the de-duplication process within and an estimate of the number of unique molecules is obtained. To compare the methods I used 2 metrics:

  • The difference between this estimate and the ground truth
  • The variability of the estimate for which I used the coefficient of variation (CV; sd/mean)

Each of the following plots below show the impact of changing one of the simulation parameters on theses two metrics. In each plot, the red dashed line shows the parameter value that was used in the other simulations.

We can see that under the assumptions of the simulations, the directional adjacency method is the least sensitive to variations in the input parameters.

The unique method (the method used in most publications) severely overestimates the number of unique molecules since each sequencing error or PCR error erroneously increases the estimate. As the number of starting molecules increase, all methods eventually struggle to identify the true number of unique molecules, with the adjacency, directional adjacency and percentile methods showing the best performance.  As we would expect, the percentile method tends to overestimate the number of unique molecules since it fails to remove sequencing errors unless they are very rare events. The adjacency method tends towards an overestimation too, as it can only identify erroneous UMIs within a single edit distance. Interestingly, the directional adjacency method tends towards an underestimation as it can link together a string of UMIs each separated by a single edit distance. The directional aspect of the network is designed to avoid this happening between truly unique UMIs, however occasionally these will be collapsed together. The cluster method tends towards a more severe underestimation as it does not take counts into account at all and collapses whole networks into a single UMI.

Overall, the directional adjacency method shows very little sensitivity to the number of starting molecules and outperforms all other methods. As expected, the percentile method becomes more consistent as the number of molecules increase, since the variability between the counts in each one of the individual iterations is reduced. In contrast, the cluster and adjacency methods become less consistent as the number of starting molecules increases, as the variability in the topology of the networks formed will also increase.


As the UMI length increases (from a initial very short 3bp), the methods which utilise a network quickly improve in terms of accuracy and variability. However, the percentile and unique methods actually become less accurate as a longer UMI obviously means more probability of errors accumulating. Perhaps counter intuitively then, if you’re not going to account for errors when using UMIs, you’re better off not using a very long UMI.


As the PCR cycles are increased, all methods except directional adjacency show a decrease in accuracy and an increase in variability, with adjacency being the second most robust to high PCR cycle numbers. We see the same as we increase the DNA polymerase error rate. This confirms that the directional adjacency method is better able to handle errors originating from the amplification stage.



Is a non-heuristic solution possible?

The methods we propose are definitely an improvement on “unique” deduplication using UMIs within these simulations.  We’ve also shown that the application of our proposed methods can have a real impact on biological interpretation of single cell RNA-Seq data (more to come on that soon). However, all implemented methods are heuristic approaches to the problem. My initial expectation was that we would be able to apply some probabilistic approach to determine the most likely set of UMIs which would account for the UMIs observed at a given genomic location. So far, we’ve been unable to develop such a method. If anyone would like to discuss alternative probabilistic approaches, or the script in general, please get in touch.

An introduction to the CGAT group

Introducing CGAT

This blog follows the journey of a group of postdoc biologists into the world of computational genomics.  In this initial post we’ll introduce who we are and why we’re undertaking this adventure together.

It took more than ten years to sequence a single human genome using capillary sequencing technology.  We’ve come an even longer way since the dark days of painstakingly running a ruler down a sequencing gel (tales abound of even John Sulston taking his turn on the gels in the early stages of the human genome project).  Capillary sequencing allowed this process to be automated, but still it was slow and arduous, with relatively low throughput per sequencer.  The advent of sequencing by synthesis and massively parallel sequencing has hugely increased output whilst reducing the cost per nucleotide to fractions of a penny.  Initially reads were short, mapping was hard and error rates were high.  Tools were developed to align these short reads back to the genomes and transcriptomes they were derived from.  As the chemistry improved so did the read lengths and error rates, biases were minimised and more sophisticated aligners were developed to map short sequencing reads to genomes across the kingdoms of life on Earth.

With these advances came accessibility to large data sets as small labs started to generate gigabytes of sequencing data at relatively low costs.  However, analysis of these data sets requires large computational infrastructure and expert knowledge; a resource that is not available to all biology labs and institutes.  Additionally, analysis of NGS data needs to be driven by the biology, after all the sequencing data was generated to answer a biological or medical question.  This has led to a bottleneck in the analyses of these data and a need for more skilled individuals to bring together biological knowledge, statistical understanding and computational skills in handling and integrating large data sets.

The Computational Genomics Analysis and Training programme was setup by the UK Medical Research Council to train researchers in these skills and capitalise on the wealth of NGS data being generated.  CGAT fellows embark on a 3 year program which combines training and research.  Fellows train in statistics and computational biology whilst collaborating with UK scientists to perform data analysis and interpretation in wide ranging research projects.

This blog is about the fellows as we develop through this training program, our experiences as well as more technical information as befits the nature of our work.  Expect paper recommendations, brief highlights of analytical tools we’re using, lessons we’ve learned in programming and some guest posts from colleagues and collaborators.

As with all modern biological research, this blog is the result of a collaboration of individuals with a shared motivation.  We’ve all come from biological “wet lab” backgrounds to learn how to leverage computational tools to discover some cool biology.

If you’re interested in who the current fellows are then follow the links in the side bar to our Gravatar profiles or checkout the profiles and testimonials on the CGAT website.

That’s all for now folks.