Estimating the number of true cell barcodes in single cell RNA-Seq (part 2)

This post starts where we left off in the first part. If you haven’t already done so, I’d recommend reading the first part now.

I hadn’t planned on writing another post but no sooner had I finished the first part of this blog post than I discovered that the Pachter group have developed a tool to perform the same task (sircel, pronounced ‘circle’). This appears to be in the early stages of development but there’s a bioAxiv manuscript out which happens to cover the same input data as I discussed in the previous post (thanks to Avi Srivastava for bringing this to my attention).

In the discussion with Avi in the comments of the previous post I also realised that I had failed to clarify that only base substitutions will be detectable when using our network methods. As it happens, one of the motivations for the development of sircel appears to be handling INDELs which they claim can be present in >25% of dropseq barcodes. I can’t see where they get this number from (In the manuscript they reference the original Macosko et al paper but I can’t see this 25% figure anywhere) so I’m assuming for now that this is based on additional analyses they’ve performed. This seems like an alarmingly high figure to me given that the rate of INDELs is ~1 order of magnitude less frequent than substituions (see Fig. 2/Table 2 here) and even at the start of the read where the cell barcode resides, the per-base rate of INDELs does not usually exceed 0.001. I would also assume that cell barcodes with INDELs would have far fewer reads and so drop below the threshold using the ‘knee’ method. Hence, while some cell barcodes may be discarded due to INDELs, I would be surprised if a INDEL-containing barcode was erroneously included in the accepted cell barcodes.

How common are INDELs?

Before I tested out sircel, I wanted to independently assess what proportion of cell barcodes might have a substitution or INDEL. To do this, I returned to the drop-seq data from the last post (GSE66693). Previously, I showed the distribution of reads per cell barcode separated into 4 bins, where bin 4 represents the accepted barcodes above the knee threshold and bins 1-3 are largely populated by different types of erroneous barcodes (reproduced below as Figure 1). I compared each barcode to all barcodes in bin 4 and looked for a match allowing for one error (substitution, insertion or deletion). The results are tabulated below.  These figures are an upper estimate of the number of barcodes/reads harbouring an error relative to the true cell barcodes since each barcode may contribute to all three error types if at least one match to a bin 4 barcode exists with each error type. We can see that ~1.5% of cell barcodes may contain an INDEL, far less than the 25% suggested in the sircel manuscript. Furthermore, when we present the figures according to the number of reads, it appears around 3% of reads have a cell barcode with an INDEL, whilst 7% have a substitution. Thus, we can be quite confident that we’re not losing lots of data if we don’t error correct cell barcodes or just correct substitution errors.


Figure 1. Distribution of reads per cell barcode (log 10) from first 50M reads, separated into 4 bins. Bin 4 represents the ‘true’ barcodes.

Untitled 2

Table 1: Columns 1 & 2: The number and fraction of cell barcodes which match a bin 4 barcode allowing one error. Columns 3 & 4: The number and fraction of reads with a cell barcodes which match a bin 4 barcode allowing one error.

Are INDEL-containing barcodes passing the knee threshold?

Next, I want to investigate whether the INDEL barcodes pass the knee threshold and become ‘accepted’. I therefore split the above data according to bin and focused on bin 4. As you can see in Figure 2 below,  there are indeed a small number of bin 4 barcodes which are a single error away from another bin 4 barcodes. Our null expectations (from 100 000 randomly generated barcodes) for the percentage of subsitution, deletion and insertion matches by chance are 0.14, 0.11 and 0.13%, respectively. In contrast, amongst the 575 bin 4 barcodes, there are 6 barcodes (1.0%) which match at least one other bin 4 barcode with one error. 4/6 of these barcodes have a match with an INDEL and all have relatively low counts suggesting these are indeed errors. It seems I may have been wrong to assume a INDEL-containing barcodes will not pass the knee threshold. The results for bins 1-3 minor my previous analysis using edit-distance and suggest the bin3 is mainly made up of error-containing barcodes whilst bins 1 & 2 appear to originate from amplification of ambient RNA in empty droplets.

error_detectionFigure 2. The frequency of cell barcodes in each bin which are one error away from at least one of the bin 4 barcodes. Random = 100 000 randomly generated barcodes

How does sircle compare to the knee method?

So, INDELs are observed in this drop-Seq data, although perhaps not 25% of all cell barcodes, and INDEL-containing barcodes can pass the hard knee threshold for acceptance. Now on to sircel. As you would expect from the Pachter group, sircel introduces a novel, elegant solution to detect substitutions and INDELs in barcodes. Rather than using edit-distance or fuzzy matching, they ‘circularise’ the barcodes and form a de Bruijn graph using the kmers from the circularised barcodes. The circularisation ensures that even with a large kmer size, barcodes containing an error will share at least a portion of their path through the de Bruijn graph with the cognate error-free barcodes. The edges are weighted according to the number of times each edge is observed and each circular path is given a score according to the minimal weight of all its edges. This score is used to filter the paths to select the high weight, correct paths and incorrect paths are merged to the correct paths. Notably, the thresholding is done using a method analogous to the knee method, whereby the threshold is set as the local maxima of the smoothed first derivative from the cumulative distribution of path scores. The neat thing about the sircel method is that it enables simultaneous identification of this inflection point to identify the threshold for true barcodes, and error correction to the true barcodes. To compare the results for sircel and my knee method, I ran sircel on the GSE66693 drop-seq sample where the authors had suggested there were ~570 true cell barcodes. Previously, using the first 50M reads I had estimated that there were 573 true barcodes using the knee method. As per the sircel manuscript recommendation, I used only 1M reads to run sircel, which yielded 567 true barcodes. This is slightly lower than the number quoted in the manuscript for 1M reads (582) but perhaps this is due to more recent updates? The methods agree on 563 cell barcodes but what about the cell barcodes where they agree. 4/10 out of the knee method-specific cell barcodes are a single error away from another barcode above the threshold, suggesting these may in fact be errors. The remaining 6 barcodes all have very low counts. Likewise, the 4 sircel-specific cell barcodes fall just below the knee threshold (6170-9576 reads per barcode; threshold = 9960). For these 10 barcodes, the difference appears to be due to differences in the number of reads sampled. I wanted to check this by using 50M reads with sircel but the run-time becomes excessively long. Unfortunately, going to other way and using just 1M reads with the knee method yields unstable estimates and since it takes <6 minutes to run the knee method with 50M reads we would never recommend using the knee method with just 1M reads. So in essence, the major difference between the knee method and sircel is that sircel effectively corrects the small number of error barcodes above the threshold.

Can we improve the knee method?

It does look like sircel does a good job of removing error barcodes but we could easily replicate this performance by performing an INDEL-aware error detection step after the knee threshold. During the development of UMI-tools we considered detecting INDELs as well as substitutions but decided against it because: a) as mentioned at the top, the rate of INDELs is usually ~1 order of magnitude lower than substituions (that this doesn’t appear to hold true for cell barcodes), and b) UMI-tools groups reads together by their alignment position so the current framework does not allow the detection of INDELs. However, when we’re dealing with cell barcodes, we consider all barcodes at once and hence, it’s much easier to identify INDELs. For our network methods, we can simply replace the edit-distance threshold with a fuzzy regex, allowing one error (substitution, insertion or deletion). The fuzzy regex matching is ~30-fold slower than our cythonised edit-distance function but this isn’t an issue here since we only need to build the network from the cell barcodes which pass the knee threshold. From the network we can identify the error barcodes and either correct these to the correct barcode or else discard them. This will now be the default method for UMI-tools extract with the option to discard, correct or ignore potential barcode errors above the knee threshold. We will also include the option to correct barcodes which fall below the threshold.

A final thought: To correct or not?

The cost of incorrectly merging two cell barcodes through cell barcode error correction could be drastic, e.g appearing to create cells ‘transitioning’ between discrete states. On the other hand, discarding reads from error barcodes rather than correcting them reduces the sequencing depth and hence increases the noise. In the drop-seq data above 55.7% of reads are retained without error correction with a further 9.2% of reads potentially correctable. If all barcodes above the threshold and within one error of another barcode above the threshold were removed, this would only mean the loss of 0.3% of reads. Given the potential dangers of incorrect error correction, it may be prudent in some cases to discard all cell barcodes with errors and work with 55.4% of the input reads rather than correcting errors in order to retain 64.9% or reads

Estimating the number of true cell barcodes in single cell RNA-Seq (part 1)

Single Cell RNA-Seq (scRNA-Seq) took a major leap forward in 2015 when two Harvard groups published oil droplet-based methods for sequencing >10,000 cells in a single experiment for << 1$ / cell (Klein et alMacosko et al). Then in early 2016, 10X Genomics launched their commerical version of droplet scRNA-Seq, followed more recently by the illumina and Bio-Rad competitor technology. From my limited exposure, the 10X product seems to be the most popular right now but this is a very new field so there are likely more improved methods on the horizon which could shake things up. These techniques label cells in fundamentally in the same manner. Using microfluidic devices individual cells are encapsulated together with a random cell barcode in a droplet. The reverse transcription of the RNA and subsequent PCR amplification is then performed inside each droplet separately such that all cDNA from a single cell are labelled with the same cell barcode.

The techniques themselves are really exciting but I want to focus here on a technical question during the analysis. A crucial aspect of droplet scRNA-seq is that the number of possible cell barcodes is many orders of magnitude greater than the number of cells sequenced. Hence, if each sequenced cell gets one cell barcode at random, the probability of two cells having the same barcode is very low (if you’re interested, you can work out the probability using the solution to the famous ‘birthday problem’). However, when you get your sequencing data back and extract the cell barcodes detected, not all of these will be ‘true’ cell barcodes and some barcodes will contain sequencing errors or PCR errors. Furthermore, some cell barcodes may have been incorporated into cDNA that does not originate from your cells (more on this later). Hence, one of the first challenges in droplet scRNA-Seq is working out how many cells you’ve actually sequenced! In the supplementary material of the original drop-seq paper (Macosko et al) they suggest plotting the cumulative frequency and looking for an inflection point, or ‘knee’. They show this in supplementary Figure 3A (reproduced below in Figure 1 with permission). The 10X analysis pipeline also appears to take a similar approach by what I can gather from their website.


Figure 1. Cumulative frequency for drop-seq data. Plot taken from Macosko et al supplementary figure 3 with permission.  Dashed line shows inferred position of point of inflection.

Now, if you’ve been on this blog before, you’ll probably have seen the posts about UMI-tools. This is a tool suite I’ve co-developed with Ian Sudbery (University of Sheffield) to accurately identify PCR duplicates in techniques utilising unique molecular identifiers (UMIs). UMIs are similar to cell barcodes in that they are just a sequence of bases used to barcode a read, though in this case, the barcode relates to the unique molecule prior to PCR amplification. The novel aspect of our tools is that we used networks to identify PCR duplicates in an error-aware manner. For background on the network methods, please see our publication or my previous blog post.  For the next major UMI-tools release (version 0.5), we’re planning to expand our support for single cell RNA-Seq analyses as this seems to be the major application of the tool to date according to our GitHub issues. As part of this, we wanted to expand the extract command. Currently this command extracts the UMI sequence from the reads and appends it to the read name so that the UMI can be accessed in the downstream BAM, We wanted to extend the extract command so that it can

  • extract cell barcodes
  • filtered cell barcodes against a whitelist of expected cell barcodes
  • or, where no whitelist exists (as in the case of droplet scRNA-Seq), create a whitelist of ‘true’ cell barcodes from the data and filter against this

For the final point, we needed to implement an automated version of the ‘knee’ method as described in the drop-seq paper. We also wondered whether our network based methods might be suitable for this purpose as they are designed to handle the situation where you have an unknown number of ‘true’ barcodes plus additional barcodes which have been generated via errors from the ‘true’ barcodes. I therefore set out to implement a ‘knee’ method and compare it with our network methods.

To start with, I took the SCRB-Seq data from Soumillon et al since I had it to hand and this includes 384 expected cell barcodes which I could use as a ground truth. All other barcodes observed should therefore have originated as errors from the 384 true barcodes. To be clear, this is not droplet scRNA-Seq but this give us a definitive ground truth for the purposes of initiation performance testing. Plotting the cumulative frequency we can clearly see an inflection near the 384th barcode as expected (Figure 2; blue line).


Figure 2. Cumulative frequency for SCRB-Seq cell barcodes. Blue line represents the position of the 384th cell which is close to the inflection point in the curve.

I initially attempted to identify the inflection point by fitting a spline function to the cumulative frequency but quickly realised it was much easier to work from the distribution of counts (log10) per cell barcode (Figure 3). Here the inflection point is represented by the local minima between the distribution of counts for ‘true’ and ‘false’ barcodes. We can see that the local minima is very close to the 384th cell barcode so this automated knee method seems to work well for this data. The obvious downside of this hard thresholding on read counts with the knee method is that there will likely be some overlap between the read counts for real and false barcodes (as we can see in Figure 3) . This is why we believed the network-based methods may be more accurate.


Figure 3. Density of reads per SCRB-Seq cell barcode. Blue line represents the position of the 384th cell which is close to the inflection point in the curve in figure 2. Orange line represents the local minima used in my implementation of the ‘knee’ method.

To compare with our network-based methods, I took increasing samples of reads from the SCRB-Seq sample and estimated the true cell barcodes using the knee method and our ‘adjacency’ and ‘directional’ methods and calculated the sensitivity and false positive rate (FPR) of the different approaches. As you can see in Figure 4, the ‘directional’ method has a slightly higher sensitivity and lower FPR than the ‘knee’ method, which suggests that this network-based method might be more suitable for identifying the true cell barcodes. The knee method slightly outperforms just selecting the top 384 most abundant barcodes (‘top 384’) in terms of higher sensitivity although this comes at the cost of increased FPR. The ‘adjacency’ method has a high sensitivity but tends towards overestimating the number of true cell barcodes as the number of sampled reads is increased. This is because adjacency only groups barcodes together if they are all separated from the same hub barcode and each by a single edge. Hence, any barcode with 2 or more errors will be always be in a separate group from the true barcode from which it originates. In contrast, the directional method is not limited to single edge paths.


Figure 4. Sensitivity (Top) and False Positive Rate (FPR; Middle, Bottom) for 3 methods to detect the number of true cells sequenced. “Top 384” = select the top 384 barcodes. Shown for comparison with knee method. Middle = all samples. Bottom = adjacency excluded to show difference between knee and directional.

Next, I took a single sample from the drop-seq paper where the authors had inferred that there were ~570 cells using a visual assessment of the cumulative frequency plot (This is the sample shown in Figure 1). They also backed up this assertion with an assessment of the species-specificity of the reads for each cell barcode which suggested that most of the “cells” below this inflection point were actually droplets which did not contain a cell and simply contained ‘ambient’ RNA. So, 570 seemed a reasonable truth to test the method against. My ‘knee’ method worked great, using the first 50M reads, this method estimated that there were 573 cell barcodes (Figure 5).


Figure 5. Cumulative frequency plot for cell barcodes in the first 50M drop-seq reads. The ‘knee’ method estimated 573 true cell barcodes and hence is plotted over the ‘top 570’ inflection point prediction. The X-axis has been limited to the 10 000 most abundant cell barcodes.

In contrast, the directional method failed miserably! For starters, the shear number of cell barcodes precluded the inclusion of all cell barcodes detected as the network building requires an all-vs-all comparison between the barcodes to calculate the edit distances. With 50M reads, the sensitivity was OK (95%), but the specificity was just 11% and the specificity unexpectedly increased as the number of reads parsed was decreased! Then came the realisation…

Image result for homer doh

As mentioned above, the network-based methods were designed to deal with a pool of barcodes, where some barcodes are true and the remaining barcodes are the result of errors from the true barcodes. However, here we have a situation where the ‘false’ barcodes could be due to either errors (I’ll call these ‘error barcodes’), or droplets without a cell which have amplified ambient RNA (‘ambient barcodes’). We can indentify these two separate types of false barcodes by looking at the minimum edit-distance between each barcode and the true barcodes. For error barcodes, we would expect this minimum edit-distance to be 1 in most cases. For ambient barcodes we expect them to have a similar distribution of minimum edit distances relative to the true barcodes as barcodes selected at random. Figure 6 below shows the distribution of counts per barcode, split into 4 bins. Figure 7 shows the distribution of edit-distances for each bin and a null distribution from a randomly generated set of cell barcodes. Bin 4 contains the true barcodes and shows a null-like distribution of edit-distances as expected. We can clearly see that bin 3, which has slightly lower counts per cell barcode compared to the true barcodes, is enriched in single edit-distances relative to the null. This strongly suggests many of these cell barcodes are error barcodes with a single error relative to the true cell barcode from which they originate. However, bin 2, and bin 1 especially, show a more null-like distribution, suggesting most of these cell barcodes are instead ambient barcodes.


Figure 6. Distribution of reads per cell barcode (log 10) from first 50M reads, separated into 4 bins. Bin 4 represents the ‘true’ barcodes.


Figure 7. Distribution of minimum edit distance compared to true barcodes (bin 4) for each bin shown in figure 6. Random = 10 000 randomly generated cell barcodes.


In summary, the knee method as implemented here works very well with the samples tested so far so we can proceed with automatic detection of the number of true cell barcodes with UMI-tools extract. We will also enable the use of the directional method since this is more accurate when you know that all false cell barcodes are the result of errors. However, I’d recommend sticking with the knee method unless you are absolutely certain this is the case.

Finally, if you’re already using UMI-tools for scRNA-Seq, or thinking about doing so, keep a look out for version 0.5. Along with the added functionality discribed here, we’re adding a count commands to yield counts per cell per gene directly from featureCounts/HTSeq output and we hope to include a scRNA-Seq-specific tutorial for this release too.

EDIT: Following the comments below I’ve continued this analysis into a second post to look at INDELs in cell barcodes and comparing UMI-tools with sircel.

Why you should use alignment-independent quantification for RNA-Seq

[Edit] I’ve changed the title to better reflect the conclusions drawn herein.

This post follows on previous posts about the wonderful new world of alignment-free quantification (Road-testing Kallisto, Improving kallisto quantification accuracy by filtering the gene set). Previously I focused on the quantification accuracy of kallisto and how to improve this by removing poorly supported transcripts. Two questions came up frequently in the comments and discussions with colleagues.

  1. Which “alignment-independent” tool (kallisto, salmon and sailfish) is better?
  2. How do the “alignment-independent” compare to the more common place “alignment-dependent” tools for read counting (featureCounts/HTSeq) or isoform abundance estimation (cufflinks2)?

To some extent these questions been addressed in the respective publications for the alignment-independent tools (kallisto, salmon, sailfish). However, the comparisons presented don’t go into any real detail and unsurprisingly, the tool presented in the publication always shows the highest accuracy. In addition, all these tools have undergone fairly considerable updates since their initial release. I felt there was plenty of scope for a more detailed analysis, especially since I’ve not yet come across a detailed comparison between the alignment-independent tools at read counting methods. Before I go into the results, first some thoughts (and confessions) on RNA-Seq expression quantification

Gene-level vs. transcript-level quantification

When thinking about how to quantify expression from RNA-Seq data a crucial consideration is whether to quantify at the transcript-level or gene-level. Transcript-level quantification is much less accurate than gene-level quantification difficult since transcripts contain much less unique sequence, largely because different transcripts from the same gene may contain some of the same exons and untranslated regions (see Figure 1).


Figure 1. Distribution of unique sequence for genes and transcripts. The fraction of unique kmers (31mers) represents the “uniqueness” of the transcript. For transcripts, each kmer is classified as unique to that transcript or non-unique. For genes, a kmer is considered unique if it is only observed in transcripts of from a single gene. Genes are more unique than transcripts.

The biological question in hand will obviously largely dictate whether transcript-level quantification is required, but other factors are also important, including the accuracy of the resultant quantification and the availability of tools for downstream analyses. For my part, I’ve tended to focus on gene-level quantification unless there is a specific reason to consider individual transcripts, such as splicing factor mutation which will specifically impact expression of particular isoforms rather than genes as a whole. I think this is probably a common view in the field. In addition, every RNA-Seq dataset I’ve worked with has always been generated for the purposes of discovering differentially expressed genes/transcripts between particular conditions or cohorts. A considerable amount of effort has been made to decide how to best model read count gene expression data and, as such, differential expression analysis with read count data is a mature field with well supported R packages such as DESeq2 and EdgeR. In contrast, differential expression using isoform abundance quantification is somewhat of a work in progress. Of course, it’s always possible to derive gene-level quantification by simply aggregating the expression from all transcripts for each gene. However, until recently, transcript-level quantification tools did not return estimated transcript counts but rather some form of normalised expression such as fragments per kilobase per million reads (FPKM). Thus, to obtain gene-level counts for DESeq2/edgeR analysis, one needed to either count reads per gene or count reads per transcript and then aggregate. Transcript-level gene counting is not a simple task as when a read counting tool encounters a read which could derive from multiple transcripts it has to either disregard the read entirely, assign it to a transcript at random, or assign it to all transcripts. However, when performing read counting at the gene level the tool only needs to ensure that all the transcripts a given read may originate from are from the same gene in order to assign the read to the gene . As such, my standard method for basic RNA-Seq differential expression analysis has been to first align to the reference genome and then count reads aligning to genes using featureCounts. Since the alignment-independent tools all return counts per transcript, it’s now possible to count reads per transcript and aggregate to gene-level counts for DESeq2/edgeR analysis. Note: This requires rounding the gene-level counts to the nearest integer which results in an insignificant loss of accuracy.

Simulation results

Given my current RNA-Seq analysis method and my desire to perform more accurate gene-level and transcript-level analysis, the primary questions I was interested in were:

  1. How do the three alignment-free quanitifcation tools compare to each other for transcript-level and gene-level quantification?
  2. For transcript-level quantification, how does alignment-independent quantification compare to alignment-dependent (cufflinks2)?
  3. For gene-level quantification, how does alignment-dependent quantification compare to alignment-dependent (featureCounts)?

To test this, I simulated a random number of RNA-Seq reads from all annotated transcripts of human protein coding genes (hg38, Ensembl 82; see methods below for more detail) and repeated this 100 times to give me 100 simulated paired end RNA-Seq samples.

Quantification was performed using the alignment-independent tools directly from the fastq files, using the same set of annotated transcripts as the reference transcriptome. For the alignment-dependent methods, the alignment was performed with Hisat2 before quantification using Cufflinks2 (transcripts), featureCounts (genes) and an in-house read counting script, “gtf2table” (transcripts & genes). The major difference between featureCounts and gtf2table is how they deal with reads which could be assigned to multiple features (genes or transcripts). By default featureCounts ignores these reads whereas gtf2table counts the read for each feature.

In all cases, default or near-default settings were used (again, more detail in the methods). Transcript counts and TPM values from the alignment-independent tools were aggregated to gene counts. To maintain uniform units of expression, cufflinks2 transcript FPKMs and gtf2table transcript read counts were converted to transcript TPMs.

Gene-level results

A quick look at a single simulation indicated that the featureCounts method underestimates the abundance of genes with less than 90% unique sequence which is exactly what we’d expect as reads which could be assigned to multiple genes will be ignored. See Figure 2-5 for a comparison with salmon.

Figure 2. Correlation between ground truth and featureCounts estimates of read counts per gene

Figure 2. Correlation between ground truth and featureCounts estimates of read counts per gene for a single simulation. Each point represents a single gene. Point colour represents the fraction of unique kmers (see Figure 1).

Figure 3. Correlation between ground truth and salmon estimates of read counts per gene

Figure 3. Correlation between ground truth and salmon estimates of read counts per gene for a single simulation. Each point represents a single gene. Point colour represents the fraction of unique kmers (see Figure 1).


Figure 4. Impact of fraction unique sequence on featureCounts gene-level read count estimates. X-axis shows the upper end of the bin. Y-axis shows the difference between the log2 ground truth counts and log2 featureCounts estimate. A clear understimation of read counts is observed for genes with less unique sequence

Figure 5. Impact of fraction unique sequence on salmon gene-level read count estimates.

Figure 5. Impact of fraction unique sequence on salmon gene-level read count estimates. X-axis shows the upper end of the bin. Y-axis shows the difference between the log2 ground truth counts and log2 featureCounts estimate. Genes with less unique sequence are more likely to show a greater difference to the ground truth but in contrast to featureCounts (FIgure 4), there is no bias towards underestimation for genes with low unique sequence. The overall overestimation of counts is due to the additional pre-mRNA reads which were included in the simulation (see methods)

Since differential expression analyses are relative, the underestimation by featureCounts need not be a problem so long as it is consistent between samples as we might expect given the fraction of unique sequence for a given gene is obviously invariant. With this in mind, to assess the accuracy of quantification, I calculated the spearman’s rank correlation coefficient (Spearman’s rho) for each transcript or gene over the 100 simulations. Figure 6 shows the distribution of Spearman’s rho values across bins of genes by fraction unique sequence. As expected, the accuracy of all methods is highest for genes with a greater proportion of unique sequence. Strikingly, the alignment-independent methods outperform the alignment-dependent methods for genes with <80% unique sequence (11 % of genes). At the extreme end of the scale, for genes with 1-2% unique sequence, median spearman’s rho values for the alignment-independent methods are 0.93-0.94,  compared to 0.7-0.78 for the alignment-dependent methods. There was no consistent difference between featureCounts and gtf2table, however gtf2table tended to show a slightly higher correlation for more unique genes. There was no detectable difference between the three alignment-free method.

Figure 6. Spearman's correlation Rho for genes binned by sequence uniqueness as represented by fraction unique kmers (31mers).

Figure 6. Spearman’s correlation Rho for genes binned by sequence uniqueness as represented by fraction unique kmers (31mers). Boxes show inter-quartile range (IQR), with median as black line. Notches represent confidence interval around median (+/- 1.57 x IQR/sqrt(n)). Whiskers represent +/- 1.5 * IQR from box. Alignment-independent tools are indistinguishable and more accurate than alignment-dependent workflows. featureCounts and gtf2table do not show consistent differences. All methods show higher accuracy with greater fraction unique sequence


Transcript-level results

Having established that alignment-independent methods are clearly superior for gene-level quantification, I repeated the analysis with the transcript-level quantification estimates. First of all,  I confirmed that the correlation between ground truth and expression estimates is much worse at the transcript-level than gene-level (Figure 7) as we would expect due to the aforementioned reduction in unique sequences when considering transcripts.

Figure 7. Correlation coefficient histogram for transcript-level and gene-level salmon quantification

Figure 7. Correlation coefficient histogram for transcript-level and gene-level salmon quantification

Figure 8 shows the same boxplot representation as Figure 6 but for transcript-level qantification. This time only salmon is shown for the alignment-independent methods as kallisto and sailfish were again near identical to salmon. For the alignment-dependent methods, cufflinks2 and gtf2table are shown. Again, the alignment-independent methods are clearly more accurate for transcripts with <80% unique sequence (96% of transcripts), and more unique transcripts were more accurately quantified with alignment-independent tools or gtf2table. Oddly, cufflinks2 does not show a monotonic relationship between transcript uniqueness and quantification accuracy, although the most accurately quantified transcripts were those with 90-100% unique sequence. gtf2table is less accurate than cufflinks2 for transcripts with <15% unique sequence but as transcript uniqueness increases beyond 20%, gtf2table quickly begins to outperforms cufflinks2, with medium spearman’s rho >0.98 for transcripts with 70-80% unique sequence compared to 0.61 for cufflink2.


Figure 8. Spearman’s correlation Rho for transcripts binned by sequence uniqueness as represented by fraction unique kmers (31mers). Boxes show inter-quartile range (IQR), with median as black line. Notches represent confidence interval around median (+/- 1.57 x IQR/sqrt(n)). Whiskers represent +/- 1.5 * IQR from box. Alignment-independent tools are indistinguishable (only salmon shown) and more accurate than alignment-dependent workflows. gtf2table is more accurate than cufflinks for transcripts with >20% unique sequence. Alignment-independent methods and gtf2table show higher accuracy with greater fraction unique sequence. Cufflinks does not show a monotonic relationship between fraction unique kmers bin and median correlation coefficient.

The high accuracy of gtf2table for highly unique transcripts indicates that read counting is accurate for these transcripts. However, this method is far too simplistic to achieve accurate estimates for transcripts with less than 50% unique sequence (>86% of transcripts) for the reasons described above, and thus transcript read counting is not a sensible option for a complex transcriptome. The poor performance of cufflinks2 relative to the alignment-dependent method is not a surprise, not least because the alignment step introduces additional noise into the quantification from miss-aligned reads etc. However, the poor performance relative to simple read-counting suggests the inaccuracy is largely due to the underlying algorithm for estimating abundance which is less accurate than read-counting for transcripts with >15% unique sequence. This is really quite alarming if true (I’d be very happy if someone had any ideas where I might have gone wrong with the Cufflinks2 quantification?)

Overal conclusions and thoughts

As expected, accuracy was much higher where the transcript or gene contains a high proportion of unique sequence, and hence gene-level quantification is overall much more accurate. Unless one specifically requires transcript-level quantification, I would therefore always recommend using gene-level quantification.

From the simulations presented here, I’d go as far as saying there is no reason whatsoever to use read counting if you want gene counts and although I’ve not tested HTSeq, I have no reason to believe it should perform significantly better. From these simulations, it appears to be far better to use alignment-independent counting and aggregate to the gene level. Indeed, a recent paper by Soneson et al  recommends exactly this.

Even more definitively,  I see no reason not to use alignment-dependent methods to obtain transcript-level quantification estimates since the alignment-independent are considerably more accurate in the simulations here. Of course, these tools rely upon having a reference transcriptome to work with which may require prior alignment to a reference genome where annotation is poor. However, once a suitable reference transcriptome has been obtained using cufflinks2, bayesassembler, trinity etc, it makes much more sense to switch to an alignment-independent method for the final quantification.

In addition to the higher accuracy for the point expression estimates, the alignment-independent tools also allow the user to bootstrap the expression estimates to get an estimate of the technical variability associated with the point estimate. As demonstrated in the sleuth package, the bootstraps can be integrated into the differential expression to partition variance into technical and biological variance.

The alignment-independent methods are very simple to integrate into existing workflows and run rapidly with a single thread and minimal memory requirements so there really is no barrier to their usage in place of read counting tools. I didn’t observe any significant difference between kallisto, salmon and sailfish so if you’re using just the one, you can feel confident sticking with it! Personally, I’ve completely stopped using featureCounts and use salmon for all my RNA-Seq analyses now.

[EDIT] : Following the comment from Ian Sudbery I decided to have a look at the accuracy of expression for transcripts which aren’t expressed(!)

Quantification accuracy for unexpressed transcripts

Ian Sudbery comments below that he’s observed abundant non-zero expression estimates with kallisto when quantifying against novel transcripts which he didn’t believe were real and should therefore have zero expression. The thrust of Ian’s comment was that this might indicate that alignment-dependent quantification might be more accurate when using de-novo transcript assemblies. This also lead me to wonder whether kallisto is just worse at correctly quantifying expression of unexpressed transcripts which causes problems when using incorrectly assembled transcript models? So I thought I would use the simulations to assess the “bleed-through” of expression from truly expressed transcripts to unexpressed transcripts from the same gene – something that CGAT’s technical director Andreas Heger has also mentioned before. To test this I re-ran the RNA-Seq sample simulations but this time reads were only generated for 5592/19947 (28%) of the transcripts included in the reference genome; 78% of the transcripts were truly unexpressed. The figure below shows the average expression estimate over the 100 simulations for the 14354 transcripts from which zero reads were simulated (“Absent”) and the 5592 “Present” transcripts. Ideally, all the abscent transcripts should have expression values of zero or close to zero. Where the expression is above zero, the implication is that the isoform deconvolution/expression estimation step has essentially miss-assigned expression from an expression transcript to an unexpressed transcripts. There are a number of observations from the figure. The first is that some absent transcripts can show expression values on the same order of magnitude to the present transcripts, regardless of the quantification tool used, although the vast majority do show zero or near-zero expression. Within the three alignment-independent tools, kallisto and sailfish are notably more likely to assign truly unexpressed transcripts non-zero expression estimates, compared to salmon, even when the transcript sequence contains a high proportion of unique kmers. On the other hand, cufflinks2 performs very poorly for transcripts with little unique sequence but similarly to salmon as the uniqueness of the transcript increases. This clearly deserves some further investigation in a separate post…

Average expression estimates for transcripts which are known to be absent or present in the simulation.

Average expression estimates for transcripts which are known to be absent or present in the simulation. Fraction unique kmers (31-mers) is used to classify the “uniqueness” of the transcript sequence. Note that cufflinks2 systematically underestimates the expression of many transcripts, partly due to overestimation of the expression of “absent” transcripts.


The simulated RNA-Seq reads included random errors to simulate sequencing errors and reads were simulated from pre-mRNA at 1% of the depth of the respective mRNA to simulate the presence of immature transcripts. No attempt was made to simulate biases such as library preparation fragmentation bias, random hexmer priming bias, three-prime bias or biases from sequencing. This purpose of this simulation was to compare the best possible accuracy achievable for each workflow under near-perfect conditions without going down the rabbit hole of correctlysimulating these biases.

[EDIT: The original text sounded like a get-out for my bolder statements ] Some comparisons may not be considered “fair” given that the alignment-free methods only quantify against the reference transcriptome where as the genome alignment-based methods of course depend on alignment to the reference genome and hence can also be used to quantify novel transcripts. However, the intention of these comparisons was to provide myself and other potential users of these tools with a comparison between common workflows to obtain expression estimates for differential expression analysis, rather than directly testing e.g kallisto vs. Cufflinks2.

The intention of this simulation is to provide myself and other interested parties with a comparison of alignment-independent and dependent methods for gene and transcript level quantification. It is not intended to be a direct comparison between featureCounts and Cufflinks2 and salmon, sailfish and kallisto – a direct comparison is not possible due to the intermediate alignment step – it is a comparison of the alignment-dependent and independent workflows for quantification prior to differential expression analysis when a reference transcriptome is available.


If you really want some more detail…


Genome annotations

Genome build:hg38. Gene annotations = Ensembl 82. The “gene_biotype” field of the ensembl gtf was used to filter transcripts to retain those deriving from protein coding genes. Note, some transcripts from protein coding genes will not themselves be protein coding. In total, 1433548 transcripts from 19766 genes were retained

Unique kmer counting

To count unique kmers per transcript/gene, I wrote a script available through the CGAT code collection, fasta2unique_kmers. This script first parses through all transcript sequences and classifies each kmer as “unique” or “non-unique” depending on whether it is observed in just one transcript or more than one transcript. The transcript sequences are then re-parsed to count the unique and non-unique kmers per transcript. To perform the kmer counting at the gene-level, kmers are classified as “unique” or “non-unique” depending on whether it they are observed in just one gene (but possibly multiple transcripts) or more than one gene. kmer size was set as 31.


I’m not aware of a gold-standard tool for simulating RNA-Seq data. I’ve therefore been using my own script from the CGAT code collection ( For each transcript, sufficient reads to make 0-20 copies of each transcripts were generated. Paired reads were simulated at random from across the transcript with a mean insert size of 100 and standard deviation of 25. Naive sequencing errors were then simulated by randomly changing 1/1000 bases. In addition reads were simulated from the immature pre-mRNA at a uniform depth of 1% of the mature mRNA. No attempt was made to simulate biases such as library preparation fragmentation bias and random hexmer priming bias or biases from sequencing since the purpose of this simulation was to compare the best possible accuracy achievable for each workflow under near-perfect conditions.

Alignment-independent quantification

Kallisto (v0.43.0), Salmon (v0.6.0) and Sailfish (v0.9.0) were used with default settings except that the strandedness was specified as –fr-stranded,  ISF and ISF respectively. Salmon index type was fmd. kmer size was set as 31.

Simulations and alignment-independent quantification were performed using the CGAT pipeline pipeline_transcriptdiffexpression using the target “simulation”.

Read alignment

Reads were aligned to the reference genome (hg38 ) using hisat2 (v2.0.4) with default settings except –rna-strandness=FR and the filtered reference junctions were supplied with –known-splicesite-infile.

Alignment-dependent quantification

featureCounts (v1.4.6) was run with default settings except -Q 10 (MAPQ >=10) and strandedness specified using -s 2. Cufflinks2 was run with default setting with the following additional options, –compatible-hits-norm –no-effective-length-correction. Removing these cufflinks2 options had no impact on the final results.





Speeding up python

I was recently contacted by Ilgar Abdullayev from the Ludwig Institute for Cancer Research, Karolinska Institutet with a query about why the script in our UMI-Tools repository was taking so long to run on some of his BAMs. Finding the bottleneck was relatively easy and the immediate solution was straightforward thanks to cython (and our technical director Andreas!). Thanks to Ilgar for sharing some of his data to let me optimise the code. Before the details though, I thought it might be a good time to take stock on why python is quick to write but slow to run. Skip to “Reducing the interpretation overhead” if you’re already a whiz at python.

Quick development

One of the things I really love about python is how quick it is to write function code.  Python is dynamically typed which means we don’t have to worry about the type of a variable until we want to do something with it. The following is perfectly valid.

variable = "hello world"

variable = 6

It appears to the user that the type of variable has changed from string to int. Actually, we first created a python string object and bound the reference “variable” to it, then we created a new int object and bound the reference “variable” to it. The first object no longer has any references bound to it so it’s garbage collected (e.g deleted).

Another great feature of python is duck typing by which we can perform the same operation on different python types to different ends. The classic example being the use of “+” for both concatenating strings and for adding numeric types.

variable = "hello world"
print variable + "I'm python"
hello world, I'm python

variable = 6
print variable + 2

Along with a comprehensive set of well-maintained modules for more complex classes, data manipulations and statistics, dynamic typing and duck typing makes writing a complex python script something that can be achieved in minutes rather than days.

Slower to run

The penalty you pay for dynamic typing is that the python code must be interpreted, and a section of code is re-interpreted each time its run. In the, admittedly contrived, example below we can clearly see why the code in the loop needs to be interpreted each time anew.

dict_1 = {1:"bar", 0:"foo"}
list_1 = ["bar", "foo"]

objects = (dict_1, list_1)
for obj in (objects):
    print obj[0]


In the first loop, obj is a dictionary. The “[0]” index in a dictionary returns the value for key “0”, which in this case is “foo”. In the second loop, obj is a list. The “[0]” operation on a list slices the list to the xth element (0-indexed), which in this case is the first element, “bar”.

Reducing the interpretation overhead

The reinterpretation of code takes a lot of time and python is therefore considerably slower than languages like C. And this is where cython (see what they did there) comes in.  Cython allows you to statically determine the type of an object reducing the work of the python interpreter.

Now to return to the original problem with our script. For a background on our UMI-tools see (CGAT blog – UMIs). The script in question takes the alignment file (SAM/BAM format) and returns an alignment with the duplicated reads removed. In my hands this was taking about 10 seconds for 1 million reads which seems perfectly reasonable. For Ilgar however, it was taking around 12 minutes per 1 million reads which is less than ideal when alignment files may contain >100M reads.

To identify the bottleneck I first used a bit of trial and error to identify a single read start position which was particularly troublesome. In hindsight it was hardly surprising this was an area of extreme coverage. With the use of of the python profiler module we can identify the part of the script which was taking the longest time.

python -m cProfile -s tottime -I subset.bam -S out.bam --method=adjacency > profile.log

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 26352342   47.813    0.000   76.867    0.000
 26352397   20.992    0.000   20.992    0.000 {zip}
 26352353    8.062    0.000    8.062    0.000 {sum}

So the majority of the time taken was the edit_dist function which was run 26352342 times! Looking back at the script, we can see this is because our adjacency-based methods take the list of UMIs observed and create a dictionary mapping each UMI to the UMIs which are a single edit distance apart (hamming distance = 1). This is an all vs. all operation so the time taken increases quadratically with the number of UMIs observed. The obvious solution was therefore to improve the speed of the adjacency dictionary building. Let’s look at relevant portion of the original code.

def edit_dist(first, second):
    ''' returns the edit distance/hamming distances between
    its two arguments '''

    dist = sum([not a == b for a, b in zip(first, second)])
    return dist

def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi2) <= threshold]
                for umi in umis}

The _get_adj_list_adjacency method uses a dictionary comprehension with the values generated by a list comprehension so the edit_distance function is called on every combination of UMIS. For a quick improvement we can put the code for the edit distance calculation inside the list comprehension so we avoid recalling the function, but the gains are minimal. Alternatively, as you may have noticed, we’re only interested in UMIs where the distance is >= threshold so we could optimise the edit_distance method by returning as soon as the distance is over the threshold:

def edit_dist(first, second, threshold):
    ''' returns True if hamming distance is less than threshold '''

    dist = 0
    for a, b in in zip(first, second):
        if a != b:
            dist += 1
        if dist > threshold:
            return 0
    return 1

def _get_adj_list_adjacency(self, umis, counts, threshold):
        '''identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_dist(umi, umi2, threshold)]
                for umi in umis}

This improves the speed ~2-fold which is nice but not really enough!

Speaking to Andreas about this problem, he quickly identified cython as simple solution since the hamming distance calculation is very simple. Here’s the cython code snippet and the pertinent code in


cpdef int edit_distance(a, b):
    cdef char * aa = a
    cdef char * bb = b

    cdef int k, l, c
    c = 0
    l = len(a)
    for k from 0 <= k < l:
        if aa[k] != bb[k]:
            c += 1
    return c

import pyximport
from _dedup_umi import edit_distance
def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold '''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi) <= threshold]
                for umi in umis}

In the cython function we statically define the type for the two C strings a & b and the ints k, l & c. Then we count the positions in each string where the characters are different. In we just need to import the pyximport module and use it to compile Then we import the edit_distance function from and use it like any other python function. This simple modification reduces the run time on the troublesome genomic coordinate from 91 secs to 11 secs – the most time-consuming function for this particular region is now remove_umis.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6432    5.902    0.001    5.902    0.001
       11    3.626    0.330    5.544    0.504<dictcomp>)
 26359658    1.918    0.000    1.918    0.000 {_dedup_umi.edit_distance}

The cython function could be further improved along the lines of the python function by returning as soon as the threshold is exceeded. I’m sure there are other improvements we could make to the run time of the script. For example, transforming the UMI strings to 2bit arrays would make the hamming distance calculation a very cheap xor bitwise operation. However, I’ve not tested yet if the cost of the transformation outweighs the gains in the edit distance calculation.

This is the first time I’ve used cython – something that Ian Sudbury (UMI-tools co-author) had suggested might be worth considering a while ago. To be honest, the solution was so straightforward that Andreas ended up implementing it himself whilst he demonstrated it. I think it might be time for me to read that cython documentation fully now!

Improving kallisto quantification accuracy by filtering the gene set

In my previous post Road testing kallisto, I presented the results of a simulation study to test the accuracy of kallisto. Specifically, I hypothesised that kallisto would perform poorly for transcripts with little or no unique sequence and I wanted to use the simulation to flag these transcripts in my downstream analysis. The simulation was therefore intentionally naive and ignored biases from library preparation and sequencing so I could focus on the accuracy of quantification with “near perfect” sequence reads. The simulation results broadly fitted my expectations but also threw up a few more questions in the comments section. I’m going to post an update for each of the following questions:

  1. What’s the impact of including low-support transcripts on the quantification of well-supported transcripts?
  2. The correlation between truth and estimated counts drops off as a transcript has fewer unique k-mers. Is this also reflected in the bootstrap variances?
  3. How does kallisto compare to sailfish, salmon and cufflinks?
    • This comparison will come with caveats because of the way the reads were simulated

For this post I’ll focus on the first question.

The impact of including low-support transcripts on the quantification of well-supported transcripts

Previously I showed that the 9 DAZ2 transcripts (Ensembl v82) are very poorly quantified by kallisto. DAZ genes contain variable numbers of a repeated 72-bp exon (DAZ repeats) and I noted that transcript models for DAZ2 overlapped one another perfectly with regards to the splice sites, so some isoforms differ only in the number of DAZ repeats, hence 8/9 of the isoforms had no unique kmers whatsoever. I further suggested that the inclusion of low-support transcript models in the gene set may be reducing the accuracy of quantification for the high-support transcripts. I then suggested this was likely to be a problem at many other loci and filtering out low-support transcripts may be beneficial.

To test this, I created two genesets, “Filtered” and “Unfiltered“, both containing transcripts from hg38 Ensembl v82 protein-coding genes. For the “Filtered” gene set I removed all transcripts with a low-support level (>3 ; No supporting non-suspect EST or mRNA). For the “Unfiltered” gene set, I retained all transcripts. I then compared the accuracy of quantification over all transcripts with a support level of 1-3.

We would expect this filtering to increase the fraction of unique 31-mers in the remaining transcripts. Figure 1 demonstrates that this is indeed the case.

kmers filtered vs unfiltered

Figure 1. Distribution of fraction of unique 31-mers per transcript in the filtered and unfiltered gene sets

Figure 2 is a re-representation of the the previous figure showing the correlation between the sum of ground truths across the 30 simulations and the sum of estimated counts, with separate panels for the two gene sets. The two panels only show quantification over the high-support transcripts. There’s a clear but slight decrease in the accuracy of quantification when using the unfiltered gene set. With the arbitrary threshold of 1.5-fold that I used previously, 278 / 97375 (0.28%) are flagged as “poor” accuracy, compared to 448 / 97375 (0.46%) when we include the low-support transcripts. Honestly, I was expecting a bigger change but this hard thresholding somewhat masks the overall decrease in accuracy.

kallisto Filtering geneset improves accuracy

Figure.2 Filtering low-support transcripts improves the quantification accuracy for high-support transcripts with kallisto

Figure 3 shows the same plot but focusing on the DAZ genes. All the transcript models for DAZ4 are support level 5 so this gene is not included. When the gene set is not filtered, one of the DAZ1 (ENSG00000188120) transcripts is accurately quantified, with the other transcripts having zero or close-to-zero estimated counts across the 30 simulations. The filtering leads to a much improved quantification accuracy for the other DAZ1 transcript and one of the DAZ2 (ENSG00000205944) transcripts. However, there is no real improvement for the remaining transcripts, with two now considerably over-estimated and the other 4 showing no improvement.

Kallisto filtering improves DAZ gene accuracy

Figure 3. Improvement in quantification of DAZ gene transcripts when low-support transcripts are filtered out

The other important metric when assessing the accuracy of quantification is the Pearson’s correlation coefficient (r) between the ground truth and estimate counts across the simulations. For the majority of transcripts r>0.9 even when using the unfiltered gene set, however there is a noticeable improvement when the gene set is filtered (Figure 4).


Figure 4. Distribution of Pearson’s correlation coefficients between ground truths and kallisto estimated counts across 30 simulations. Transcripts were quantified using a filtered and unfiltered gene set

We would expect the greatest improvement where the correlation is low with the unfiltered gene set and the filtering leads to a large gain in the fraction of unique 31-mers. To show this, Figure 5 presents the gain in fraction of unqiue 31-mers (x-axis) against the improvement in correlation with the filtered geneset (y-axis), split by the correlation when using the unfiltered geneset. The transcripts in the top left panel showed very poor correlation with the unfiltered gene set (<0.5), so improvements here may be partly affected by regression to the mean. Looking at the other panels which show the transcripts with correlation >0.5 when using the unfiltered gene set, we can see that as the gain in fraction of unique 31-mers approaches 1 the improvement in the correlation tends towards the maximum possible improvement. Even those transcripts which only show a very slight gain in the fraction of unique 31-mers show a considerable improvement in the correlation. This is in line with my observation in the previous post that kallisto shows a much better performance for transcripts with > 0.05 unique kmers. 11047 / 97375 (11.3%) of high-support transcripts show at least a 0.05 increase in the fraction of unique kmers.

Transcripts unique kmers improved accuracy

Figure 5. Change in correlation when using the filtered gene set relative to the unfiltered gene set, split by the gain in fraction of unique 31-kmers and the correlation in the unfiltered gene set


The figures I’ve shown here fit perfectly with the expectation that filtering out the low-support transcripts increases the fraction of unique kmers in a subset of the remaining transcripts, in turn improving the accuracy of the estimated counts from kallisto. Removing the 44023 low-support transcripts improves the accuracy of the sum of estimates for the high-support transcripts, but the improvement is only slight. Whether this gain is worth removing 44023 isoforms is really down to how much you believe these low-support isoforms. Personally, I worry that a lot of these low-support transcripts are spurious models from poor transcript assembly.

The DAZ genes provide an insight into a very hard-to-quantify set of transcripts, indicating that there are gains to be made from filtering out the low-support transcripts but in some instances these will be minimal. Indeed, in some cases, even transcripts we know are real are just going to be near-impossible to quantify with short-read RNA-Seq due to their sequence similarity with other transcripts from the same locus and/or homologous loci.

The improvement to the correlation across the 30 simulations is somewhat clearer than the improvement in the sum of estimates, and most prominent where the gain in unique kmers is greatest, as expected. 11.2% of transcripts show a significant improvement in the fraction of kmers which seems to me a worthwhile gain for the loss of the low-support transcripts from the gene set. The next step will be to explore the effect of the filtering on differential expression analysis, since this is what I want to use kallisto for, and I expect this will be it’s major use case.

UPDATE (22/10/15):

In response to Ian’s question below, I had a look at whether removing a random selection of transcript models has a similar impact on the accuracy of the remaining models. To do this, I identified the number of transcripts which would be removed by the transcript support level (TSL) filtering for each gene_id and removed a random set (of the same size) of transcript models for the gene_id. This should ensure the comparison is “fair” with regards to the complexity of the loci at which transcript models are removed and the total number of transcript models filtered out.

My expectation was that low-support transcripts would be contaminated by transcript models from incomplete transcript assembly and would therefore be shorter and show a greater overlap with other transcript models from the same locus. I therefore expected that the removal of the random transcripts would also increase the fraction of unique kmers, as suggested by Ian, but that the effect would be greater when the low-support transcripts were removed.  It turns out I was wrong! Figure 6 shows the distribution of fraction unique 31-mers with the unfiltered, TSL-filtered and randomly filtered annotations. The random filtering actually leads to a slightly greater increase in the fraction unqiue 31-mers for the remaining transcripts. This also leads to an slightly greater correlation for the remaining transcripts (Figure 7).

So what about the length and overlap of the low support transcripts? Well it turns out the transcript models do decrease in length from TSL1-4, but oddly transcripts with TSL 5 (“no single transcript supports the model structure”) are very similar in length to TSL1 transcripts (Figure 8). To compare the overlap between the transcript models, I took a short cut and just used the fraction unique kmers as a proxy. Transcripts which have more overlap with other transcripts from the same locus will have a lower fraction of unique kmers. This is complicated by homology between gene loci but good enough for a quick check I think. Figure 9 shows the fraction unique kmers at each TSL. Again TSL1-4 show a trend with TSL5 looking more similar to TSL1 & 2. Interestingly the trend is actually in the opposite direction to what I had expected, although this comes with the caveat that I’m not controlling for any differences in the genome-wide distribution of the different TSLs so I’d take this with a pinch of salt.

Why TSL5 transcripts are more similar to TSL1-2 transcripts I’m not sure, perhaps this scale shouldn’t be considered as ordinal after-all. I think I’d better get in contact with Ensembl for some more information about the TSL. If anyone has any thoughts, please do comment below.

Of course this doesn’t take away from the central premise that the higher TSL transcript models are more likely to be “real” and accurate quantification over these transcripts should be a priority over quantifying as many transcript models as possible, given some of these models will be wrong.

Thanks to Ian for the suggestion, it turned out an unexpected result!


Figure 6. Impact of filtering on fraction unique 31-mers


Figure 7. Impact of filtering on correlation coefficient. X-axis trimmed to [0.7-1]


Figure 8. Fraction unique 31-mers for transcripts split by Transcript Support Level (TSL)


Figure 9. Transcript length (log10) split by Transcript Support Level (TSL)

Road-testing Kallisto

In this blog post I’m going to discuss some testing I’ve been doing with Kallisto, a lightning-fast tool for RNA-Seq transcript expression quantification. First though, I’ll discuss very briefly how Kallisto works (skip to “Testing Kallisto” if you don’t want my very simplified explanation of Kallisto).

Alignment-free quantification in RNA-Seq

RNA-Seq bioinformatics can take many forms but typically involves alignment of sequence reads to a reference genome, in order to establish where the reads originate from, followed by estimation of transcript abundance using a reference gene set. These steps are computationally heavy and can be very time-consuming even with multi-threading. Last year, Rob Patro et al released Sailfish which demonstrated that it was not necessary to actually align each read to the genome in order to obtain accurate transcript abundance estimates, all you actually need to do is establish the most likely transcript for each read. They achieved this by shredding the transcriptome and reads into kmers (short overlapping sequences; see Fig. 1), and then matching the transcriptome and read kmers which is a very fast and has a low memory usage. We discussed this paper in our journal club last year and I think most of us took a bit of time to accept that this was actually a viable approach, it just felt like you were losing too much information by shredding the reads into kmers. The results were very convincing though and the Sailfish paper was quickly followed by further “alignment-free” tools including a modified method using only a subset of the read kmers called RNASkim, an update to Sailfish called Salmon, and most recently Kallisto. Kallisto is actually a slight deviation from the kmer shredding method as it forms a de Bruijn graph from the transcript kmers and then aligns the read kmers to the transcriptome de Bruijn graph to form what they describe as “pseudoalignments” (Fig.2). The accuracy of Kallisto is equivalent to the best in class methods whilst being >100 x faster! In fact Kallisto is so quick, it’s perfectly feasible to bootstrap the quantification estimates 100 times to obtain an estimate of the technical variance which is very useful when testing for differentially expressed transcripts (utilised by Sleuth).


Figure 1. Kmers. Sequences can be split into overlapping (k)mers where k is less than the length of the sequence. Typically for assembly purposes, a single nucleotide overlap is used and the kmers are transformed into a de Bruijn graph which links overlapping kmers. Figure from

kallisto method schematic

Figure 2. Schematic representation of pseudoalignment. a. Transcripts (shown in colours) are shredded into kmers and then used to form coloured de Bruijn graph (c) in which each transcript corresponds to a path through the graph and each node is a kmer. A read (shown in black at top) is aligned to the graph by matching kmers to find the compatibility of the read to the transcripts. In this case, the read could have originated from either the pink or blue transcripts. Figure adapted from Kallisto publication.

Testing Kallisto

As I mentioned briefly in a previous blog post, Highlights from Genome Science 2015,  Mick Watson gave a very engaging presentation of his recent paper with Christelle Robert at Genome Science 2015 in which they identified human genes that cannot be accurately quantified in isolation using standard RNA-Seq analyses. He explained that for some genes, many of the reads which originate from the gene will align to multiple places in the genome so the quantification of these genes is very inaccurate with many methods. Their proposal is to use the multi-mapping reads to define “multi-map groups” (MMGs) which can treated as a meta-gene for the downstream differential expression testing. Mick presented an example where this analysis of MMGs leads to an interesting finding which would be missed by standard analyses.

After Mick’s talk, I got thinking about a similar analysis for Kallisto to flag transcripts which it cannot accurately quantify. There’s been lots of excitement about the speed to the alignment-free methods, and with the introduction of Sleuth, a pipeline from sequence reads to differentially expressed transcripts can now easily analyse dozens of samples on a small cluster within just a few hours. However, I wanted to get a feel for the accuracy of Kallisto before I jumped in with any real data. My assumption was that transcripts with little unique sequence would be very difficult to quantify by Kallisto (or any other tool for that matter). Originally I followed the methodology from Robert and Watson and simulated 1000 reads per transcript – I quickly realised this was inappropriate. Because each transcript has the same number of reads assigned to it, the simulation wasn’t properly testing whether the expectation–maximization (EM) algorithm used by Kallisto is able to correctly resolve the number of reads per transcript. I therefore took a slightly different approach and simulated 1-100 reads at random from each gene in my reference gene set (hg38 Ensembl v.80 filtered to retain transcripts from protein-coding genes only) with 30 repeat simulations (x30). The two metrics I was interested in were the difference between the sum of ground truths and the sum of Kallisto estimated counts, and the correlation between the ground truth and the estimated counts per simulation. The first metric describes how accurately Kallisto quantifies the true abundance of a transcript within the transcriptome and the second metric described how accurate the Kallisto estimate is for differential expression purposes. N.B It’s possible that a transcript may show a very good correlation but the estimated counts are systematically over- or under-estimated. For the purposes of differential expression analysis this may not be so much of an issue so long as the fold changes in expression are accurately captured.

Figure 3 shows the relationship between the number of unique kmers (here shown as the fraction of all kmers across the transcript) and the fold difference between the sum of ground truths and the sum of Kallisto estimates. As expected, transcripts with a higher fraction of unique kmers are quantified more accurately by Kallisto. However, I was surprised to see Kallisto still seems to accurately quantify many of the transcripts which don’t have a single unique kmer(!). I’m not sure whether this might be an artefact of the simulation somehow?

Kallisto simulation accuracy fold difference

Figure 3. Increased fraction of unique kmers is associated with lower fold differences between Kallisto estimated counts and ground truth. Transcripts were binned by the fraction of kmers which are unique to the transcript. The log2-fold difference between the sum of ground truths and Kallisto estimated counts is shown, along with a boxplot per bin to show the distribution within each bin. Transcripts with absolute differences > 2-fold are shown in red at 1 or -1 respectively.

Figure 4 shows the overall correlation between the sum of ground truths and the sum of estimated counts. Transcripts where the estimated counts were 1.5-fold higher or lower than the ground truths are flagged as “Poor” accuracy. In total, just 701/182510 (0.46%) transcripts failed this threshold. With a even-tighter threshold of 1.2-fold, still only 2.8% of transcripts fail. There were a few transcripts which are considerably overestimated in each individual simulation, the most extreme of which is ENST00000449947, a transcripts of the Azoospermia factor DAZ2 which is ~9-fold overestimated on average across the simulations. When we look at the transcript table and model for DAZ2, (Fig. 5 & Fig. 6) we can see that there are 9 transcripts within this locus, and they appear to vary only in the inclusion/exclusion of particular exons so their sequences will be very similar. In fact, of the 9 transcripts, only ENST0000382440 contains any unique kmers (13/3223; 0.4%), whereas the rest contain 1784-3232 non-unique kmers. It’s no surprise then that Kallisto struggles to accurately quantify the abundance of all 9 transcripts (Fig. 7) with three of the transcripts having zero estimated counts across all simulated data sets. When we look back at the Ensembl table, we can see that a number of the transcripts are annotated with the flag TSL:5 (Transcript Support Level 5) which indicates that there are no mRNAs or ESTs which support this transcript, i.e it’s highly suspect. If all these TSL:5 transcripts were removed from the gene set, every transcript would have some unique kmers which would certainly improve the Kallisto quantification for the transcripts of this gene. Since these transcripts are likely to be miss-assembly artefacts, it seems sensible to suggest they should be removed from the gene set to allow accurate quantification at the other transcripts of this gene.

Kallisto simulation accuracy correlation

Figure 4. Correlation between sum of ground truths and estimated counts from Kallisto across all simulation (n=30). The number of reads per transcript per simulation was randomly sampled from a uniform distribution [0-100], hence the sum of ground truths is not the same for each transcript. Transcripts for which the Kallisto estimates are >1.5-fold higher or lower are flagged as “Poor” quality

Ensembl transcripts

Figure 5. ENSG00000205944 Transcript table from Ensembl.

Ensembl transcript models

Figure 6. ENSG00000205944 Transcript models from Ensembl.

Kallisto accuracy DAZ2

Figure 7. As per Figure 4 except transcripts from gene DAZ2 (ENSG00000205944) are highlighted.

So far, so good for most (99.54%) transcripts with regards to accurate overall quantification. The next thing I wanted to look at was the Pearson product-moment correlation coefficient (r) between the ground truth and estimated counts. Most of the time I’m interested in doing differential expression testing with RNA-Seq. I know with a real data set the estimates themselves will be systematically biased by all sorts of things, chiefly amplification biases and illumina sequencing biases, but I need to know the Kallisto estimates correlate well with the transcript abundance. Figures 8 & 9 shows the relationship between the fraction of unique kmers and the correlation between ground truth and estimted counts. Again as expected, Kallisto performs much better when the transcript contains a high fraction of unique kmers, with a clear drop off in performance below 5% unique kmers. Yet again, Kallisto is still performing very well for some transcripts which don’t have a single unique kmer. In fact, it does slightly better for transcripts with 0-1% unique kmers than those with 1-2% unique kmers(!), which I’m at a complete loss to explain. Perhaps someone from Lior Pachter’s group can explain how this is possible?

In this case it’s slightly more difficult to define a threshold to flag transcripts with a low correlation as the range of ground truths for each transcript will be different so the comparison between transcripts wouldn’t be on the same basis. For my purposes, I’ve just flagged any transcript with less than 5% kmers as being difficult to quantify for now.

If we were to set a correlation coefficient threshold of 0.75, then 10.3% of transcripts would be flagged as failing the threshold. We can see in Figure 10 that transcripts from loci with a larger number of transcripts show much worse correlation. Returning to the example of ENSG00000205944 from earlier, all the transcripts fail to meet this threshold, with ENST00000382440 (the only transcript with unique kmers) showing the highest correlation (0.729).

Kallisto accuracy unique kmers

Figure 8. Transcripts with more unique kmers have a better correlation between ground truth and Kallisto estimates. Transcripts were binned by the fraction of kmers which are unique to the transcript. The boxplot shows the distribution of correlation coefficients per bin.

Kallisto accuracy unique kmers zoom

Figure 9. As per Fig.8 except x-axis restricted to [0-0.1].

Kallisto accuracy correlation transcripts per gene

Figure 10. Transcripts from loci with more transcripts have lower correlations between ground truth and Kallisto estimates. Transcripts were binned by the number of transcripts for the gene which they are annotated to. Boxplots show the distribution of correlation values per bin


Kallisto appears to be highly accurate for the majority of transcripts, with estimated counts for 99.54% of transcripts within 1.5-fold of the true counts and 97.2% within 1.2-fold. This seems pretty fantastic to me. Most of the transcripts show a good correlation between the simulated read counts and the Kallisto estimates (~90% coefficient > 0.75) which seems pretty good to me for a transcript level analysis with a very complex simulated transcriptome. Right now, I’ve not got anything to compare this to but I’m happy Kallisto’s performing very well. I’m going to add Sailfish and Salmon to the pipeline soon for comparison and then compare to some alignment-based methods.

Finally, this simulation has reminded me of the importance of properly considering the gene set. Inclusion of the poorly supported transcripts here looks likely to have reduced the accuracy of well supported transcripts from the same locus. Before I go onto quantify expression in my samples, I clearly need to define a high confidence set of transcripts. One of the fantastic things about the speed of Kallisto is that it’s so quick that I can include this simulation with every run of the pipeline and retest my new geneset at run-time. This also means anyone else using it within the group will get a report on the suitability of their gene set and transcripts which can’t be accurately quantified will be flagged in the final results.

With thanks:

There have been a number of really informative blog posts on alignment-free quantification, including from the authors of the tools. The Kallisto team have also been very quick at responding helpfully to questions on their google groups forum which is fantastic to see. Any misunderstanding or misrepresentation of how these methods work is, however, entirely my fault!

Rob Patro – Salmon and Kallisto

Lior Pachter – Kallisto

Nextgenseek – Kallisto

Highlights from Genome Science 2015

Genome Science is an annual conference focusing on genome research in the UK, with this year’s conference hosted by Nick Loman in Birmingham. Big thanks out to Nick and the organising committee for bringing together such as diverse range of speakers, ranging all the way from functional genomics through to clinical genomics and from new technologies through to fundamental research. CGAT was out in force with 8 of us attending, probably only outdone in pure numbers by the might of TGAC!

I thought I’d put a few highlights up on the blog, partly because I’m so bad at remembering names and faces and this might help me remember who it was who gave that really interesting talk at Genome Science 2015 in a few months time! This is far from a comprehensive summary of all the talks though.

In the opening talk of the conference, Daniel MacArthur’s presented the results from aggregating 1000s of exome studies. This included one of those “why hadn’t I thought about this before” moments when he explained that we’ve now sequenced the exomes of so many individuals that one can start applying statistical tests to determine genes where the number of single nucleotide variants (SNVs) identified is significantly lower than we would expect by chance. The obvious explanation for this is that mutations in these gene lead to such a severe phenotype that the mutation is removed from the gene-pool by purifying selection. This is indeed the case for a number of the genes which are implicated in severe diseases. Interestingly some of the genes they’ve identified have never been implicated in a human disease before. However, we can now say that a mutation causing an amino-acid change would likely be fatal. So there’s value not just in what you observe but what you don’t observe too! What really excites me is the potential to turn this analysis to non-coding regions when we have enough full genomes sequenced – this could really help increase our understanding of the regions of the genome which are important in regulating gene expression. Ed Yong’s got an article up on explaining this much better than I have (How Data-Wranglers Are Building the Great Library of Genetic Variation)

For me, some of the most interesting talks at Genome Science 2015 were those which introduced novel technologies or analyses and prompted me to think about new ways to answer those burning biological questions. There were many talks about nanopore-based sequencing and two in particular gave me some food for thought.

Mike Akeson (University of California Santa Cruz) spoke about the development of nanopore sequencing technologies, from the early work of George Church and others through to the recent results coming out of Oxford Nanopore’s early access program. There’s obviously a lot of excitement about the rapid improvements in base calling, however, what really excited me was the mention of nanopore sequencing of proteins. Whisper it quietly but clinical applications of current sequencing technologies often use transcript abundance merely as a proxy for protein abundance. If we could start reliably quantifying proteins with nanopore-based technologies, this could be a very disruptive technology. On the research side, as someone who’s interested in the question of how much alternative splicing of transcripts is actually propagated to the protein level, being able to directly sequence both the transcriptome and proteome would be fantastic.

Keeping with the nanopore theme, Matt Loose (University of Nottingham) presented the latest incarnation of his minoTour platform for realtime minION analysis. It looks like a really slick interface but I must confess that up to this point I’d not really seen the value in the realtime aspect of nanopore analyses within research settings. Watching a minION generate reads in a live demonstration is very cool but apart from a constantly updated histogram of read length, what do you really gain from this? Of course in a clinical setting you could hook up the output directly to the downstream analysis and obtain your results 24 hours quicker, which could be of real benefit. But for me the downstream analysis is probably best measured in months not hours! What I hadn’t considered until Matt’s talk though was that the realtime analysis allows you to selectively focus on particular sequences. In the example he gave, if you barcode DNA from each sample, you can identify the sample source of the DNA in realtime. This then enables you to obtain roughly equal sequencing depth from all your samples in a multiplexed pool of DNA by rejecting pores which are occupied by an over-sequenced sample and switching to another pore. There’ll no doubt be other similar applications of the realtime sequence analysis to come. Interestingly, Matt mentioned that this is currently only possible because the speed which the DNA is being ratcheted through the nanopore is suppressed to improve accuracy, when the speed increases, it may start to become difficult to analyses the sequences quick enough!

Other highlights included Mick Watson’s very engaging presentation of his recent publication which defines genes which can’t be accurately quantified in isolation but can yield biologically relevant results when considered as a group of genes. Definitely worth a read: Errors in RNA-Seq quantification affect genes of relevance to human disease. I’ll return to this topic in more depth with some of my own simulation data using Kallisto soon.