Road-testing Kallisto

In this blog post I’m going to discuss some testing I’ve been doing with Kallisto, a lightning-fast tool for RNA-Seq transcript expression quantification. First though, I’ll discuss very briefly how Kallisto works (skip to “Testing Kallisto” if you don’t want my very simplified explanation of Kallisto).

Alignment-free quantification in RNA-Seq

RNA-Seq bioinformatics can take many forms but typically involves alignment of sequence reads to a reference genome, in order to establish where the reads originate from, followed by estimation of transcript abundance using a reference gene set. These steps are computationally heavy and can be very time-consuming even with multi-threading. Last year, Rob Patro et al released Sailfish which demonstrated that it was not necessary to actually align each read to the genome in order to obtain accurate transcript abundance estimates, all you actually need to do is establish the most likely transcript for each read. They achieved this by shredding the transcriptome and reads into kmers (short overlapping sequences; see Fig. 1), and then matching the transcriptome and read kmers which is a very fast and has a low memory usage. We discussed this paper in our journal club last year and I think most of us took a bit of time to accept that this was actually a viable approach, it just felt like you were losing too much information by shredding the reads into kmers. The results were very convincing though and the Sailfish paper was quickly followed by further “alignment-free” tools including a modified method using only a subset of the read kmers called RNASkim, an update to Sailfish called Salmon, and most recently Kallisto. Kallisto is actually a slight deviation from the kmer shredding method as it forms a de Bruijn graph from the transcript kmers and then aligns the read kmers to the transcriptome de Bruijn graph to form what they describe as “pseudoalignments” (Fig.2). The accuracy of Kallisto is equivalent to the best in class methods whilst being >100 x faster! In fact Kallisto is so quick, it’s perfectly feasible to bootstrap the quantification estimates 100 times to obtain an estimate of the technical variance which is very useful when testing for differentially expressed transcripts (utilised by Sleuth).

kmers

Figure 1. Kmers. Sequences can be split into overlapping (k)mers where k is less than the length of the sequence. Typically for assembly purposes, a single nucleotide overlap is used and the kmers are transformed into a de Bruijn graph which links overlapping kmers. Figure from http://homes.cs.washington.edu/~dcjones/assembly-primer/index.html.

kallisto method schematic

Figure 2. Schematic representation of pseudoalignment. a. Transcripts (shown in colours) are shredded into kmers and then used to form coloured de Bruijn graph (c) in which each transcript corresponds to a path through the graph and each node is a kmer. A read (shown in black at top) is aligned to the graph by matching kmers to find the compatibility of the read to the transcripts. In this case, the read could have originated from either the pink or blue transcripts. Figure adapted from Kallisto publication.


Testing Kallisto

As I mentioned briefly in a previous blog post, Highlights from Genome Science 2015,  Mick Watson gave a very engaging presentation of his recent paper with Christelle Robert at Genome Science 2015 in which they identified human genes that cannot be accurately quantified in isolation using standard RNA-Seq analyses. He explained that for some genes, many of the reads which originate from the gene will align to multiple places in the genome so the quantification of these genes is very inaccurate with many methods. Their proposal is to use the multi-mapping reads to define “multi-map groups” (MMGs) which can treated as a meta-gene for the downstream differential expression testing. Mick presented an example where this analysis of MMGs leads to an interesting finding which would be missed by standard analyses.

After Mick’s talk, I got thinking about a similar analysis for Kallisto to flag transcripts which it cannot accurately quantify. There’s been lots of excitement about the speed to the alignment-free methods, and with the introduction of Sleuth, a pipeline from sequence reads to differentially expressed transcripts can now easily analyse dozens of samples on a small cluster within just a few hours. However, I wanted to get a feel for the accuracy of Kallisto before I jumped in with any real data. My assumption was that transcripts with little unique sequence would be very difficult to quantify by Kallisto (or any other tool for that matter). Originally I followed the methodology from Robert and Watson and simulated 1000 reads per transcript – I quickly realised this was inappropriate. Because each transcript has the same number of reads assigned to it, the simulation wasn’t properly testing whether the expectation–maximization (EM) algorithm used by Kallisto is able to correctly resolve the number of reads per transcript. I therefore took a slightly different approach and simulated 1-100 reads at random from each gene in my reference gene set (hg38 Ensembl v.80 filtered to retain transcripts from protein-coding genes only) with 30 repeat simulations (x30). The two metrics I was interested in were the difference between the sum of ground truths and the sum of Kallisto estimated counts, and the correlation between the ground truth and the estimated counts per simulation. The first metric describes how accurately Kallisto quantifies the true abundance of a transcript within the transcriptome and the second metric described how accurate the Kallisto estimate is for differential expression purposes. N.B It’s possible that a transcript may show a very good correlation but the estimated counts are systematically over- or under-estimated. For the purposes of differential expression analysis this may not be so much of an issue so long as the fold changes in expression are accurately captured.

Figure 3 shows the relationship between the number of unique kmers (here shown as the fraction of all kmers across the transcript) and the fold difference between the sum of ground truths and the sum of Kallisto estimates. As expected, transcripts with a higher fraction of unique kmers are quantified more accurately by Kallisto. However, I was surprised to see Kallisto still seems to accurately quantify many of the transcripts which don’t have a single unique kmer(!). I’m not sure whether this might be an artefact of the simulation somehow?

Kallisto simulation accuracy fold difference

Figure 3. Increased fraction of unique kmers is associated with lower fold differences between Kallisto estimated counts and ground truth. Transcripts were binned by the fraction of kmers which are unique to the transcript. The log2-fold difference between the sum of ground truths and Kallisto estimated counts is shown, along with a boxplot per bin to show the distribution within each bin. Transcripts with absolute differences > 2-fold are shown in red at 1 or -1 respectively.

Figure 4 shows the overall correlation between the sum of ground truths and the sum of estimated counts. Transcripts where the estimated counts were 1.5-fold higher or lower than the ground truths are flagged as “Poor” accuracy. In total, just 701/182510 (0.46%) transcripts failed this threshold. With a even-tighter threshold of 1.2-fold, still only 2.8% of transcripts fail. There were a few transcripts which are considerably overestimated in each individual simulation, the most extreme of which is ENST00000449947, a transcripts of the Azoospermia factor DAZ2 which is ~9-fold overestimated on average across the simulations. When we look at the transcript table and model for DAZ2, (Fig. 5 & Fig. 6) we can see that there are 9 transcripts within this locus, and they appear to vary only in the inclusion/exclusion of particular exons so their sequences will be very similar. In fact, of the 9 transcripts, only ENST0000382440 contains any unique kmers (13/3223; 0.4%), whereas the rest contain 1784-3232 non-unique kmers. It’s no surprise then that Kallisto struggles to accurately quantify the abundance of all 9 transcripts (Fig. 7) with three of the transcripts having zero estimated counts across all simulated data sets. When we look back at the Ensembl table, we can see that a number of the transcripts are annotated with the flag TSL:5 (Transcript Support Level 5) which indicates that there are no mRNAs or ESTs which support this transcript, i.e it’s highly suspect. If all these TSL:5 transcripts were removed from the gene set, every transcript would have some unique kmers which would certainly improve the Kallisto quantification for the transcripts of this gene. Since these transcripts are likely to be miss-assembly artefacts, it seems sensible to suggest they should be removed from the gene set to allow accurate quantification at the other transcripts of this gene.

Kallisto simulation accuracy correlation

Figure 4. Correlation between sum of ground truths and estimated counts from Kallisto across all simulation (n=30). The number of reads per transcript per simulation was randomly sampled from a uniform distribution [0-100], hence the sum of ground truths is not the same for each transcript. Transcripts for which the Kallisto estimates are >1.5-fold higher or lower are flagged as “Poor” quality

Ensembl transcripts

Figure 5. ENSG00000205944 Transcript table from Ensembl.

Ensembl transcript models

Figure 6. ENSG00000205944 Transcript models from Ensembl.

Kallisto accuracy DAZ2

Figure 7. As per Figure 4 except transcripts from gene DAZ2 (ENSG00000205944) are highlighted.

So far, so good for most (99.54%) transcripts with regards to accurate overall quantification. The next thing I wanted to look at was the Pearson product-moment correlation coefficient (r) between the ground truth and estimated counts. Most of the time I’m interested in doing differential expression testing with RNA-Seq. I know with a real data set the estimates themselves will be systematically biased by all sorts of things, chiefly amplification biases and illumina sequencing biases, but I need to know the Kallisto estimates correlate well with the transcript abundance. Figures 8 & 9 shows the relationship between the fraction of unique kmers and the correlation between ground truth and estimted counts. Again as expected, Kallisto performs much better when the transcript contains a high fraction of unique kmers, with a clear drop off in performance below 5% unique kmers. Yet again, Kallisto is still performing very well for some transcripts which don’t have a single unique kmer. In fact, it does slightly better for transcripts with 0-1% unique kmers than those with 1-2% unique kmers(!), which I’m at a complete loss to explain. Perhaps someone from Lior Pachter’s group can explain how this is possible?

In this case it’s slightly more difficult to define a threshold to flag transcripts with a low correlation as the range of ground truths for each transcript will be different so the comparison between transcripts wouldn’t be on the same basis. For my purposes, I’ve just flagged any transcript with less than 5% kmers as being difficult to quantify for now.

If we were to set a correlation coefficient threshold of 0.75, then 10.3% of transcripts would be flagged as failing the threshold. We can see in Figure 10 that transcripts from loci with a larger number of transcripts show much worse correlation. Returning to the example of ENSG00000205944 from earlier, all the transcripts fail to meet this threshold, with ENST00000382440 (the only transcript with unique kmers) showing the highest correlation (0.729).

Kallisto accuracy unique kmers

Figure 8. Transcripts with more unique kmers have a better correlation between ground truth and Kallisto estimates. Transcripts were binned by the fraction of kmers which are unique to the transcript. The boxplot shows the distribution of correlation coefficients per bin.

Kallisto accuracy unique kmers zoom

Figure 9. As per Fig.8 except x-axis restricted to [0-0.1].

Kallisto accuracy correlation transcripts per gene

Figure 10. Transcripts from loci with more transcripts have lower correlations between ground truth and Kallisto estimates. Transcripts were binned by the number of transcripts for the gene which they are annotated to. Boxplots show the distribution of correlation values per bin


Summary

Kallisto appears to be highly accurate for the majority of transcripts, with estimated counts for 99.54% of transcripts within 1.5-fold of the true counts and 97.2% within 1.2-fold. This seems pretty fantastic to me. Most of the transcripts show a good correlation between the simulated read counts and the Kallisto estimates (~90% coefficient > 0.75) which seems pretty good to me for a transcript level analysis with a very complex simulated transcriptome. Right now, I’ve not got anything to compare this to but I’m happy Kallisto’s performing very well. I’m going to add Sailfish and Salmon to the pipeline soon for comparison and then compare to some alignment-based methods.

Finally, this simulation has reminded me of the importance of properly considering the gene set. Inclusion of the poorly supported transcripts here looks likely to have reduced the accuracy of well supported transcripts from the same locus. Before I go onto quantify expression in my samples, I clearly need to define a high confidence set of transcripts. One of the fantastic things about the speed of Kallisto is that it’s so quick that I can include this simulation with every run of the pipeline and retest my new geneset at run-time. This also means anyone else using it within the group will get a report on the suitability of their gene set and transcripts which can’t be accurately quantified will be flagged in the final results.


With thanks:

There have been a number of really informative blog posts on alignment-free quantification, including from the authors of the tools. The Kallisto team have also been very quick at responding helpfully to questions on their google groups forum which is fantastic to see. Any misunderstanding or misrepresentation of how these methods work is, however, entirely my fault!

Rob Patro – Salmon and Kallisto http://robpatro.com/blog/?p=248

Lior Pachter – Kallisto https://liorpachter.wordpress.com/2015/05/10/near-optimal-rna-seq-quantification-with-kallisto/

Nextgenseek – Kallisto http://nextgenseek.com/2015/05/kallisto-a-new-ultra-fast-rna-seq-quantitation-method/

17 thoughts on “Road-testing Kallisto

  1. Nice write up Tom!

    Like

  2. Really great but I would love to see a comparison to Tophat-Cuffdiff for this as I think this is around the most common tool (at the transcript level) and a lot of people need convincing that they should leave it behind.

    Like

    • I agree. A comparison with the tuxedo pipeline is definitely required as well. Along with a comparison with Sailfish/Salmon this top of my to do list when I have time to return to this.

      When considering what tools to use with real RNA-Seq data, I think it’s important to note that this simulation was only really trying to confirm my expectations that transcripts with a low fraction of unique kmers would be difficult to quantify. The simulation is very naive (all the reads align to exons, random uniform sampling, random uniform probability of sequencing error) so it doesn’t really test the performance under real RNA-Seq conditions. For example, Kallisto currently ignores the strandedness of the reads. I know of at least one example in an RNA-Seq sample where reads which appear intronic from one transcript are assigned to the antisense transcript within the intron (both transcripts are real!). I’ve raised this with the Kallisto team so hopefully Kallisto will consider strandedness where appropriate soon. Personally, I think this is an essential feature for my RNA-Seq samples.

      Like

  3. Glad I was somewhat of an inspiration! 😉

    I think what’s interesting is that, in the context of a *good* annotation, the alignment step is a real weakness and just introduces error around repeats and gene families. So with a *good* annotation, you in fact wouldn’t align at all, not only for speed but for accuracy.

    I’m still interested in how robust the new, fast methods are to poor annotation…

    Like

    • Yeah I think the observation that including the “bad” annotations for DAZ2 leads to poor quantification for the “good” annotations could have serious implications if this is a widespread phenomena, especially with the more recent Ensembl releases which include more isoforms. I noticed that you’ve previously raised the issue of the DAZ genes in a conservation with Lior Pachter, Pall Mellsted and Ewan Birney via Twitter from your own simulations https://twitter.com/biomickwatson/status/596048917051088897. One of the things I want to look more systematically is the impact of including these transcripts with poor support. More to come soon hopefully…

      Like

    • The recommended way to run the standard tuxedo pipeline is to include a reference annotation. This (I think) always includes all of the provided isoforms in the output, as well as any extra, even if the data provides no evidence for these isofroms in your particular samples. This always struck me as a bit odd: Half the point of using Cufflinks was supposed to be that you need the correct annotation to accurately quantify transcript (or even gene) levels.

      Like

  4. When we developed Cufflinks we actually frequently used it only with the annotation provided by the assembly Cufflinks can output. That was the reason for developing cuffcompare, and that mode of us is how we wrote the paper http://www.nature.com/nbt/journal/v28/n5/full/nbt.1621.html
    With time I think people just felt more comfortable using the reference annotation (at least in human). Maybe this was partly due to the fact that cuffcompare still leaves a bit of a mess and it’s not always easy to map discoveries on the basis of assembly to known genes/transcripts.

    Like

    • Speaking as someone who’s always included the reference transcripts, my reasoning was that these transcripts are well supported by previous analyses including by cDNA etc. My understanding of transcript assembly from short reads is that it’s a very tricky problem with lots of inherrant noise and transcript models are often fragmented. Unfortunately, I think the ensembl annotations may now include many spurious/incorrect transcript models which, in at least one case, here makes quantification of the well supported transcripts impossible. This is something I want to return to in another post, maybe by simply comparing the accuracy of Kallusto/Salmom/Cufflinks for well supported transcripts with and without the poorly supported transcripts included in the reference.

      Like

      • We have seen that Ensembl provides worse TPM estimates than RefSeq using Kallisto, Salmon or Sailfish. In fact, the TPM estimates improve when you use just the Ensembl CDS regions. My conjecture here is that RefSeq has more complete UTR regions than Ensembl. Ensembl must have many incomplete transcript, which makes the quantification less accurate. We put some of these analyses in the SUPPA paper: http://rnajournal.cshlp.org/content/early/2015/07/15/rna.051557.115.full.pdf

        Like

      • We have seen that Ensembl provides worse TPM estimates than RefSeq using Kallisto, Salmon or Sailfish. In fact, the TPM estimates improve when you use just the Ensembl CDS regions. My conjecture here is that RefSeq has more complete UTR regions than Ensembl. Ensembl must have many incomplete transcript, which makes the quantification less accurate. We put some of these analyses in the SUPPA paper http://rnajournal.cshlp.org/content/21/9/1521.long

        Like

  5. Great analysis! Thanks for sharing.

    You show that the correlation between truth and estimated counts drops off as a transcript has fewer unique k-mers. Do you know offhand whether this is also reflected in the bootstrap variances? Are transcripts with a low fraction of unique k-mers more variable? Probably this needs to be done in some relative sense (e.g., dispersion) since variance will be naturally also related to the (true) mean.

    Like

    • Hi Mark. I had the exact same thought earlier. I’ve just been using the point estimate so far (no bootstraps) but this shouldn’t take long to re-run and check. Of course, if the bootstrap variance does correlate well that would be brilliant for differential expression purposes. I’ll update some time later this week hopefully…

      Like

  6. I wrote a simple dirty python script which is quick enough for my purposes but very memory hungry. It takes about 18G to run on the hg38 Ensembl protein coding geneset(!) The script has just been pushed to our code collection (https://github.com/CGATOxford/cgat/blob/master/scripts/fasta2unique_kmers.py). If you know of any suitable open source tool I’d be very tempted to switch!

    Like

  7. […] my previous post Road testing kallisto, I presented the results of a simulation study to test the accuracy of kallisto. Specifically, I […]

    Like

  8. […] post follows on previous posts about the wonderful new world of alignment-free quantification (Road-testing Kallisto, Improving kallisto quantification accuracy by filtering the gene set). Previously I focused on the […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s