The dependency hell in software development


Most software projects depend on other software packages to provide its functionality. Actually, the bigger your software project, the more likely is that you end up having a large number of dependencies either internally between different parts of your code or externally with third-party libraries.

The goal of adding a new dependency to your code is to avoid reinventing the wheel. If there is code already available that does what you want, normally you would prefer to re-use it (there are exceptions, though) rather than investing your time in re-writing new code to solve the same task. On the other hand, once your software project has matured over time and is ready for production it may end up being a dependency for other software projects as well.

Grouping off-the-shelf functions into software packages and defining dependencies between them has been traditionally at the core of software development. It is specially important if you try to follow the Unix Design Philosophy where the aim is to build small programs that do one thing well, and can be easily re-used.

At the time that you decide to write new software to solve a problem that is not being tackled by other software out there, you will also have to choose the operating system, the programming language, system libraries, and a collection of other software packages available on the system. These choices will create dependencies between your new piece of code and a specific software environment that will need to be reproduced to run properly on a different system.

Initially, re-using software packages as much as possible looks like the best way forward to save time and effort in the long run; and it is indeed true. However, creating dependencies with other software packages also brings its own disadvantages, specially when the number of dependencies becomes larger and larger. Here is a nice example illustrating what I am talking about.


Copy and paste the original text to explain the picture:

This graph shows the dependencies between software packages in the Gentoo Linux operating system. In total there are 14,319 packages with 63,988 dependencies between them!

Packages are drawn on the circumference of the circle and a dependency is indicated by a line draw from one package to another. The colour of a package is randomly generated, while the colour of a line is determined by the package that is required by the other packages.

Every package has a line starting on the rim of the circle drawn radially outwards. The length of the line is determined by the amount of other packages that depend on the package.

As you can imagine, the larger the number of dependencies between packages, the more likely is to create conflicts between them (among other issues). This problem is known as the dependency hell and is the driver for the rise of new technologies out there, like containers, that are trying to propose a feasible solution.

The example about Gentoo Linux might seem an extreme one. However, the codebase in CGAT currently depends on 58 R packages and 45 Python packages, plus third party tools available in our compute cluster, where we have over 200 packages installed. Click here if you dare to check out the graph of internal dependencies for the CGAT scripts!

In the rest of this post I will revisit quickly how the dependency hell problem has been addressed up to date, from no dependency resolution at all to the full isolation provided by containers. Then, I present how I have approached the dependency hell problem for the CGAT scripts. Finally, I provide my conclusions and some advice that will hopefully be useful for someone out there.

A brief history of open-source software distribution

Before talking a bit more about containers, let us now have a quick look at the evolution of how open-source software has been distributed since the Internet was available.


Back in the old days the only way of sharing software was to create a tarball containing the code, configuration files and scripts to compile/install the code. As mentioned, unless the software package was completely self-contained, the dependencies were not shipped with it and if you did not have them installed on your box, it would rapidly complain either at compilation or run time. The amount of time spent making the new software package to run properly varied with the number of dependencies and the conflicts with the packages already installed. Finding the right versions for all the dependencies manually was a big limitation preventing the distribution of open source software, if not impossible in some cases.

Another problem was to build the software package on a CPU architecture that was compatible with the target system. Nowadays, the amount of different CPU architectures available in the market has shrunk quite a lot, but that has not been always the case.

Having said that, it is also true that if your software only depends on a couple of packages, and your reference CPU architecture is the widely used x86, tarballs are still today a feasible way of sharing code (although you should consider better options as explained below!)

Package managers

In order to ease the distribution of open source software, package managers were created. Instead of sharing just the code, the software was pre-compiled in a reference CPU architecture and packaged into special file format that included a way of specifying where the code should be located in the destination file system along with a list of the required software dependencies and version numbers. RPM and APT are probably the most popular package managers for Red Hat and Debian based Linux distributions, respectively.

Depending on the complexity of a software project, it is not trivial to create a package for a package manager. However, the benefits are evident. The package just needs to be uploaded to a proper repository and it can be easily installed by end users with no effort via the package manager, which will download the package and all its dependencies if they are not installed on the target system.

Although dependency resolution is a key feature of a package manager, it is worth mentioning that advanced package managers also play other important roles like applying security updates in a timely manner or checking that the software to be installed is not malicious.

Package managers started as a core part of Linux distributions but the list has grown to other operating systems as well. The concept has been extended outside operating systems to manage packages for programming languages like Perl, Python and R to name a few. Even more remarkable are initiatives like Conda, a package manager that was initially intended for Python packages only but has been generalised to manage packages in different programming languages (Python, R, Ruby, Lua, Scala, Java, Javascript, C/ C++ and FORTRAN) and can be installed in all the major operating systems in the market (Linux, OS X and Windows). More about Conda later.

Virtual machines

When the number of dependencies either with other software packages or with the underlying operating system is quite specific and/or large, another approach is to create a virtual machine where the complete software stack is under control. Virtual machines can be shared via its virtual disk only or as virtual appliances. The former gives freedom to the end user to select the amount of virtual resources assigned to the virtual machine (e.g. number of CPU cores and size of main memory). The latter not only comes with the software pre-installed but also with the virtual resources pre-selected for you so your hypervisor of choice just runs it.


Since Docker irrupted in 2014, there has been a lot of hype around containers in software development in the last years. Here is a good definition of containers:

Using containers, everything required to make a piece of software run is packaged into isolated containers. Unlike virtual machines, containers do not bundle a full operating system – only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed.

Docker is available for all major operating systems (Linux, OS X and Windows) and the idea is that you create a container for your application once, and it runs everywhere. That is quite appealing and is the reason why Docker in particular, and containers in general, are getting a huge attention these days.

Conda package for CGAT scripts

CGAT maintains a collection of scripts and pipelines for the analysis of high-throughput sequencing data. As stated above, the code has a large number of dependencies and it has been traditionally a challenge to deploy outside the CGAT cluster. Basically, we are facing the dependency hell ourselves and here I describe our approach to solve it.

In the past, I have been playing more or less with several options presented in this post to make our code portable. Right now, I consider Conda the most suitable option going forward but time will tell whether I am wrong or not.

Conda is easy to learn and install plus there are already a couple of community led efforts that make this option attractive: Bioconda and conda-forge. These teams are making great contributions to enable reproducible research and solve the dependency hell problem. Special thanks to Johannes Köster for setting out the Bioconda community and asking me to become a contributor. I am also grateful to Bioconda’s the core team for their hard work and dedication to keep this project moving on.

Recently, I have added a Conda package to the Bioconda channel for the CGAT scripts. Next, I will describe the installation steps so you can give it a try and confirm whether the Conda approach is really a solution to our dependency hell!

The installation instructions below are for Linux. OS X and Windows users must install Docker and use the container option. I will use the a temporary folder, but feel free to choose a different location on your system.

# create an installation folder
mkdir /tmp/conda
cd /tmp/conda

# download and install Conda
bash -b -p /tmp/conda/install
export PATH=/tmp/conda/install/bin:$PATH

# configure Conda
conda config --set allow_softlinks False
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda

# install CGAT scripts into a Conda environment
conda create -n cgat-scripts cgat-scripts -y

Since the number of dependencies is high, the installation will take up to 10 minutes depending on your system’s configuration. According to my test, you’ll need 3GB of free disk space for the complete installation as well.

Now, you are ready to go:

# on the same or a different terminal, activate the Conda environment
source /tmp/conda/install/bin/activate cgat-scripts

# and now you should be able to use the CGAT scripts
cgat --help
cgat --help Genomics
cgat --help Conversion
cgat gtf2table -h
cgat bam2geneprofile -h

So all you need to decide is the path on your file system to install Conda and all the dependencies will be located under a single folder that you can then enable on demand. Please note, we have made use of Conda environments, which gives you flexibility to have a specific set of dependencies without conflicting with other environments (e.g. different versions of Python or R) on the same Conda installation. Or, if you are really paranoid about breaking things (like myself), you can create a separate Conda installation on a different location on your file system, to ensure full isolation.

If that worked for you, and are interested in more details about how the Conda package was created, please feel free to comment on this post and I will try to help.


In summary, I explained the dependency hell and how the problem has been tackled until today, and later I presented the Conda package for CGAT scripts as a way of solving our own dependency hell, which will hopefully work on your systems!

Finally, based on my experience solving the dependency hell for the CGAT scripts, here is my advice on how to address the problem properly if you are a software developer. Firstly, I recommend to be really picky when choosing new dependencies for your code. Make sure the software you depend on is in good health and maintained. Secondly, have a way of tracking the dependencies for your software. For example, a simple up to date list of software packages and the required versions will do the job.

matplotlib Tutorial

Hi Everyone,

I have made a quick tutorial on the basics of plotting with matplotlib.

It covers:

  • Matplotlib vocabulary – figures and axes
  • The different ways to use matplotlib: the interactive interface, the object-oriented interface and the object-oriented class library
  • Basic plotting commands
  • Customising plots using containers and artists

It is available to view here.

Or to download as an interactive IPython notebook here.

Good luck!


Resources to help you on your way to learning Python for biology

Having been a wet lab biologist for 5 years with very little programming knowledge (zero python, a little C++), my first task when joining the Computational Biology and Training Department (CGAT) was to develop the Python programming skills. However, knowing where to start was more problematic.

My first port of call was to buy the ‘Python for biologists’ books that are amazing introductions to the basic use of python in biology. However, I quickly realised that even these simple to understand books were far too advanced for me at the time, as I hadn’t even grasped how to use the for loop yet!.

My lack of knowledge on the simple basics of python led me to the Coursera python course, where basic principles are introduced and then the course explores some of the more advanced aspects of python, which I felt at the time were far too complicated for what I needed. However, I persisted and completed the course and it allowed me to begin my new life as a computational biologist.

However, It was only after completing the Coursera series that I discovered Codeacademy. If I had discovered this first I think that my road to becoming a python programmer would have been simpler and less complicated, as the interactive session used to teach python is really intuitive. Moreover, it covers the basic principles clearly and concisely.

I think the most significant issue when embarking on learning a programming language wasn’t actually getting access to material; it was trying to decide where to start first. Therefore, for anyone embarking on learning python for biology related purposes I would go through these sources in order:

  1. Codeacademy – this is a great free resource and introduces the principles of python perfectly.
  2. ‘Python for Biologists’ – this is an excellent introduction to building python code and then applying it to simple biological problems. – However, don’t expect too much from this book, it wont give you solutions to complicated research questions.
  3. Practical computing for biologists – Again another great resource for beginners how to use python to answer simple scientific questions.
  4. Coursera (Python programming) – This was a great course to begin with but goes into some more advanced topics that at the time I didn’t need. I would consider doing this course and then stopping when you either get bored or find that it isn’t really helping you anymore. After a year of using python I re-enrolled in this course and found the more advanced aspects of the tutorials so much more informative.

All in all, it took me a month to have a good grasp of python (I have no idea whether this is quick or slow) and about another month to start using the language to a sufficiently advanced level to be useful for my work.

Adam Cribbs (

Speeding up python

I was recently contacted by Ilgar Abdullayev from the Ludwig Institute for Cancer Research, Karolinska Institutet with a query about why the script in our UMI-Tools repository was taking so long to run on some of his BAMs. Finding the bottleneck was relatively easy and the immediate solution was straightforward thanks to cython (and our technical director Andreas!). Thanks to Ilgar for sharing some of his data to let me optimise the code. Before the details though, I thought it might be a good time to take stock on why python is quick to write but slow to run. Skip to “Reducing the interpretation overhead” if you’re already a whiz at python.

Quick development

One of the things I really love about python is how quick it is to write function code.  Python is dynamically typed which means we don’t have to worry about the type of a variable until we want to do something with it. The following is perfectly valid.

variable = "hello world"

variable = 6

It appears to the user that the type of variable has changed from string to int. Actually, we first created a python string object and bound the reference “variable” to it, then we created a new int object and bound the reference “variable” to it. The first object no longer has any references bound to it so it’s garbage collected (e.g deleted).

Another great feature of python is duck typing by which we can perform the same operation on different python types to different ends. The classic example being the use of “+” for both concatenating strings and for adding numeric types.

variable = "hello world"
print variable + "I'm python"
hello world, I'm python

variable = 6
print variable + 2

Along with a comprehensive set of well-maintained modules for more complex classes, data manipulations and statistics, dynamic typing and duck typing makes writing a complex python script something that can be achieved in minutes rather than days.

Slower to run

The penalty you pay for dynamic typing is that the python code must be interpreted, and a section of code is re-interpreted each time its run. In the, admittedly contrived, example below we can clearly see why the code in the loop needs to be interpreted each time anew.

dict_1 = {1:"bar", 0:"foo"}
list_1 = ["bar", "foo"]

objects = (dict_1, list_1)
for obj in (objects):
    print obj[0]


In the first loop, obj is a dictionary. The “[0]” index in a dictionary returns the value for key “0”, which in this case is “foo”. In the second loop, obj is a list. The “[0]” operation on a list slices the list to the xth element (0-indexed), which in this case is the first element, “bar”.

Reducing the interpretation overhead

The reinterpretation of code takes a lot of time and python is therefore considerably slower than languages like C. And this is where cython (see what they did there) comes in.  Cython allows you to statically determine the type of an object reducing the work of the python interpreter.

Now to return to the original problem with our script. For a background on our UMI-tools see (CGAT blog – UMIs). The script in question takes the alignment file (SAM/BAM format) and returns an alignment with the duplicated reads removed. In my hands this was taking about 10 seconds for 1 million reads which seems perfectly reasonable. For Ilgar however, it was taking around 12 minutes per 1 million reads which is less than ideal when alignment files may contain >100M reads.

To identify the bottleneck I first used a bit of trial and error to identify a single read start position which was particularly troublesome. In hindsight it was hardly surprising this was an area of extreme coverage. With the use of of the python profiler module we can identify the part of the script which was taking the longest time.

python -m cProfile -s tottime -I subset.bam -S out.bam --method=adjacency > profile.log

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 26352342   47.813    0.000   76.867    0.000
 26352397   20.992    0.000   20.992    0.000 {zip}
 26352353    8.062    0.000    8.062    0.000 {sum}

So the majority of the time taken was the edit_dist function which was run 26352342 times! Looking back at the script, we can see this is because our adjacency-based methods take the list of UMIs observed and create a dictionary mapping each UMI to the UMIs which are a single edit distance apart (hamming distance = 1). This is an all vs. all operation so the time taken increases quadratically with the number of UMIs observed. The obvious solution was therefore to improve the speed of the adjacency dictionary building. Let’s look at relevant portion of the original code.

def edit_dist(first, second):
    ''' returns the edit distance/hamming distances between
    its two arguments '''

    dist = sum([not a == b for a, b in zip(first, second)])
    return dist

def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi2) <= threshold]
                for umi in umis}

The _get_adj_list_adjacency method uses a dictionary comprehension with the values generated by a list comprehension so the edit_distance function is called on every combination of UMIS. For a quick improvement we can put the code for the edit distance calculation inside the list comprehension so we avoid recalling the function, but the gains are minimal. Alternatively, as you may have noticed, we’re only interested in UMIs where the distance is >= threshold so we could optimise the edit_distance method by returning as soon as the distance is over the threshold:

def edit_dist(first, second, threshold):
    ''' returns True if hamming distance is less than threshold '''

    dist = 0
    for a, b in in zip(first, second):
        if a != b:
            dist += 1
        if dist > threshold:
            return 0
    return 1

def _get_adj_list_adjacency(self, umis, counts, threshold):
        '''identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_dist(umi, umi2, threshold)]
                for umi in umis}

This improves the speed ~2-fold which is nice but not really enough!

Speaking to Andreas about this problem, he quickly identified cython as simple solution since the hamming distance calculation is very simple. Here’s the cython code snippet and the pertinent code in


cpdef int edit_distance(a, b):
    cdef char * aa = a
    cdef char * bb = b

    cdef int k, l, c
    c = 0
    l = len(a)
    for k from 0 <= k < l:
        if aa[k] != bb[k]:
            c += 1
    return c

import pyximport
from _dedup_umi import edit_distance
def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold '''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi) <= threshold]
                for umi in umis}

In the cython function we statically define the type for the two C strings a & b and the ints k, l & c. Then we count the positions in each string where the characters are different. In we just need to import the pyximport module and use it to compile Then we import the edit_distance function from and use it like any other python function. This simple modification reduces the run time on the troublesome genomic coordinate from 91 secs to 11 secs – the most time-consuming function for this particular region is now remove_umis.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6432    5.902    0.001    5.902    0.001
       11    3.626    0.330    5.544    0.504<dictcomp>)
 26359658    1.918    0.000    1.918    0.000 {_dedup_umi.edit_distance}

The cython function could be further improved along the lines of the python function by returning as soon as the threshold is exceeded. I’m sure there are other improvements we could make to the run time of the script. For example, transforming the UMI strings to 2bit arrays would make the hamming distance calculation a very cheap xor bitwise operation. However, I’ve not tested yet if the cost of the transformation outweighs the gains in the edit distance calculation.

This is the first time I’ve used cython – something that Ian Sudbury (UMI-tools co-author) had suggested might be worth considering a while ago. To be honest, the solution was so straightforward that Andreas ended up implementing it himself whilst he demonstrated it. I think it might be time for me to read that cython documentation fully now!

Learning a programming language for the first time

So, having been at CGAT for 6 months I thought that it would be a good opportunity to share the problems that have occurred during my endeavour to learn a new programming language.

One of the first tasks a CGAT fellow is required to do is to learn how to program. My computational background was extremely limited, with a very basic knowledge of C++ that I learned over 6 weeks in the school summer holidays when I was about 14 (I know I was a sad child). Therefore, on my first day I asked others in the CGAT office which programming  language I should learn first,  the overwhelming response was to focus on python. However, others suggested that I also learn R as a lot of downstream data analysis is performed using this language. Not being able to decide which language to focus on first I googled the question and came over an interesting blog post setting out the pros and cons of each language. Although this was very informative it didn’t really give me the answer I was looking for so I decided to embark on learning both, dedicating one day to python then the next to R. With hindsight this wasn’t a very good decision as I found myself getting confused between the different syntaxes. As a consequence I decided to drop learning R for the time being, not because I though python would be more useful in the short term (which it actually has turned out to be) but because the online learning resources for python are much more readily available, some of which I have listed below:

Python resources that I found extremely useful when first starting out:

I think to understand the basics of python and start to write functional programs took me around a month, I’m not sure whether this is quick or slow but it felt like a good pace for me. I think I was a little naive before I joined CGAT because I thought that once you have a basic understanding of the syntax then you could simply go about writing a program to fulfil any needs. However, this soon changed after 2/3 months when I was given my first project which was to find and annotate CRISPR gRNAs sites across the genome. This has been a fantastic first project that has allowed me to develop the skills that are required to write and develop CGAT code. Although the project has progressed at a good pace, the more I have delved into writing python code the more I have realised how much more I have to learn to become proficient at writing not only functional code but code that is succinct and computationally efficient.

Following my intensive one month learning python I then decided to focus my attention on learning the basics of R. R is a fantastic tool that makes data analysis and plot generation the go-to language. It has been around for many years and one of its major bonuses is the availability of packages that you can download and use to make specialist data analysis easier. The problem I encountered early on though was the availability of good online training material, none of it seemed to give me the right level of knowledge to let me perform the analysis I want (although I list some of the best online courses that I found helpful below). Therefore,  I have had to look at other offline courses to supplement my training. One of the courses that I found useful as a basic introduction are those offered by the Oxford uni IT department, the course gives a really good introduction to the basics of R through lecture and practical based learning. Since CGAT fellows are allowed to spend £2000 a year on outside training, I have used some of this money to attend a 5 day bioconductor R course given at Newcastle university to try to improve my skills in this area.

R courses that I found useful:

Highlights from Genome Science 2015

Genome Science is an annual conference focusing on genome research in the UK, with this year’s conference hosted by Nick Loman in Birmingham. Big thanks out to Nick and the organising committee for bringing together such as diverse range of speakers, ranging all the way from functional genomics through to clinical genomics and from new technologies through to fundamental research. CGAT was out in force with 8 of us attending, probably only outdone in pure numbers by the might of TGAC!

I thought I’d put a few highlights up on the blog, partly because I’m so bad at remembering names and faces and this might help me remember who it was who gave that really interesting talk at Genome Science 2015 in a few months time! This is far from a comprehensive summary of all the talks though.

In the opening talk of the conference, Daniel MacArthur’s presented the results from aggregating 1000s of exome studies. This included one of those “why hadn’t I thought about this before” moments when he explained that we’ve now sequenced the exomes of so many individuals that one can start applying statistical tests to determine genes where the number of single nucleotide variants (SNVs) identified is significantly lower than we would expect by chance. The obvious explanation for this is that mutations in these gene lead to such a severe phenotype that the mutation is removed from the gene-pool by purifying selection. This is indeed the case for a number of the genes which are implicated in severe diseases. Interestingly some of the genes they’ve identified have never been implicated in a human disease before. However, we can now say that a mutation causing an amino-acid change would likely be fatal. So there’s value not just in what you observe but what you don’t observe too! What really excites me is the potential to turn this analysis to non-coding regions when we have enough full genomes sequenced – this could really help increase our understanding of the regions of the genome which are important in regulating gene expression. Ed Yong’s got an article up on explaining this much better than I have (How Data-Wranglers Are Building the Great Library of Genetic Variation)

For me, some of the most interesting talks at Genome Science 2015 were those which introduced novel technologies or analyses and prompted me to think about new ways to answer those burning biological questions. There were many talks about nanopore-based sequencing and two in particular gave me some food for thought.

Mike Akeson (University of California Santa Cruz) spoke about the development of nanopore sequencing technologies, from the early work of George Church and others through to the recent results coming out of Oxford Nanopore’s early access program. There’s obviously a lot of excitement about the rapid improvements in base calling, however, what really excited me was the mention of nanopore sequencing of proteins. Whisper it quietly but clinical applications of current sequencing technologies often use transcript abundance merely as a proxy for protein abundance. If we could start reliably quantifying proteins with nanopore-based technologies, this could be a very disruptive technology. On the research side, as someone who’s interested in the question of how much alternative splicing of transcripts is actually propagated to the protein level, being able to directly sequence both the transcriptome and proteome would be fantastic.

Keeping with the nanopore theme, Matt Loose (University of Nottingham) presented the latest incarnation of his minoTour platform for realtime minION analysis. It looks like a really slick interface but I must confess that up to this point I’d not really seen the value in the realtime aspect of nanopore analyses within research settings. Watching a minION generate reads in a live demonstration is very cool but apart from a constantly updated histogram of read length, what do you really gain from this? Of course in a clinical setting you could hook up the output directly to the downstream analysis and obtain your results 24 hours quicker, which could be of real benefit. But for me the downstream analysis is probably best measured in months not hours! What I hadn’t considered until Matt’s talk though was that the realtime analysis allows you to selectively focus on particular sequences. In the example he gave, if you barcode DNA from each sample, you can identify the sample source of the DNA in realtime. This then enables you to obtain roughly equal sequencing depth from all your samples in a multiplexed pool of DNA by rejecting pores which are occupied by an over-sequenced sample and switching to another pore. There’ll no doubt be other similar applications of the realtime sequence analysis to come. Interestingly, Matt mentioned that this is currently only possible because the speed which the DNA is being ratcheted through the nanopore is suppressed to improve accuracy, when the speed increases, it may start to become difficult to analyses the sequences quick enough!

Other highlights included Mick Watson’s very engaging presentation of his recent publication which defines genes which can’t be accurately quantified in isolation but can yield biologically relevant results when considered as a group of genes. Definitely worth a read: Errors in RNA-Seq quantification affect genes of relevance to human disease. I’ll return to this topic in more depth with some of my own simulation data using Kallisto soon.

Unique Molecular Identifiers – the problem, the solution and the proof

[EDIT: UMI-Tools open access publication is now out]

I’ve been working with Dr. Ian Sudbery (former CGAT fellow, currently Lecturer in Bioinformatics at the University of Sheffield) on a script called to correctly remove PCR duplicates from alignment files when using Unique Molecular Identifiers (UMIs). The script is available from GitHub as part of the UMI-tools repository. As part of the development I’ve been running some simulations to compare different methods which have proved quite illuminating. Before I go into the details though, some background.

PCR duplication artifacts

Many sequencing-based techniques rely upon an amplification step to increase the concentration of the library generated from the fragmented DNA prior to sequencing. Following alignment to the genome, it’s usually preferable to identify and remove PCR duplicates as there are inherent biases in the amplification step as some sequences become overrepresented in the finally library compared to their true abundance in the fragmented DNA. The brilliant Picard tool is commonly used to identify and remove these PCR duplicates using their genomic coordinates. Of course, it’s also possible that two sequence reads with the same genomic coordinates will have the originated from two independent fragments. However, where the alignment density is essentially random (as in DNA-Seq) this is unlikely, even more so if paired end sequencing has been performed. The human genome has ~3 x 109 bases and the insert size will vary so the combinations of possible genomic coordinates for the paired ends is so vast we don’t expect to sequence two fragments with the same genomic coordinates.

Conversely, in RNA-Seq, some cDNA sequences will be highly abundant and we are therefore much more likely to sequence multiple fragments with the exact same genomic coordinates. This problem is most acute where a high number of PCR cycles have been used to generate the sequencing library, for example where the concentration of starting material in low, such as in single cell RNA-Seq as this necessitates many PCR cycles.

Using unique molecular identifiers to correctly identify PCR duplicates

Unique molecular identifiers (UMIs; also called Random Molecular Tags (RMTs)) are random sequences of bases used to tag each molecule (fragment) prior to library amplification (see figure below and Kivioja et al (2012), thereby aiding in the identification of PCR duplicates.  If two reads align to the same location and have the same UMI, it is highly likely that they are PCR duplicates originating from the same fragment prior to amplification. We can therefore collapse all the reads with the same genomic coordinates and UMI into a single representative read and obtain an accurate estimate of the relative concentration of the fragments in the original sample.

nmeth.2772-F1This figure is adapted from Islam et al (2014)

The problem

The use of UMIs as described above would work perfectly if it were not for base-calling errors, which erroneously create sequencing reads with the same genomic coordinates and UMIs and that are identical except for the base at which the error occurred. Base-calling errors therefore inflate the apparent numbers of unique fragments sequenced (see below). An additional source of errors comes from the amplification itself which can introduce errors during the DNA duplication step.


The solution

Ian initially identified this issue in his own data and wrote a python script utilising pysam to parse an alignment BAM file and inspect the UMIs of reads with the same genomic coordinates. He reasoned that sequencing errors would produce UMIs which are a single edit distance away from another UMI. Therefore, by forming an adjacency matrix it should be possible to identify the sequencing errors and collapse the reads to the true number of unique UMIs present (see schematic below for a simple example). The first step is to identify connected components, where UMIs are represented by nodes and edges represent links between nodes which are a single edit distance apart. The counts of each node are also retained. Within each component, the most abundant node is selected and all nodes within a single edit distance are removed. This process is repeated using the second most abundant node and so on until all edges are removed and the number of nodes remaining is inferred to be the number of unique fragments sequenced. We refer to this method as the “adjacency” method.


It turns out this works pretty well, however, one can envisage situations where this may not correctly identify the true number of unique fragments. For example,  an error during the PCR amplication may create an apparently distinct UMI from which sequencing errors could created further UMIs, all of which are actually from a single unique fragment (see schematic below). A similar situation could also arise if one particular position in the reads was more likely to contain an error (this is quite common in illumina sequencing). In this case, one would see lots of UMIs with a single edit distance due to an error at the common position and further errors may be present off of this UMI.

schematic_3In order to deal with these more complicated networks of connected UMIs, I developed the “directional_adjacency” method. I reasoned that we could resolve these networks by forming edges between nodes where the count of node A was >= 2 x node B. Now, each connected component is inferred to originate from a single unique fragment.

In addition to these methods, we also wanted to try a modification of the “adjacency” method in which the whole connected component is inferred to originate from a single unique fragment (“cluster”).

This issue has been previously observed by Islam et al (2014) who proposed that UMIs with less than 1% of the average counts should be removed “to account for UMIs that stem from PCR-induced mutations or sequencing errors”. We implemented this method within our tool for comparison (“percentile”).

Finally, we implemented a “unique” method, where all unique UMIs were retained. Note, this appears to be the method which is most commonly used. The schematic below shows a comparison of the various methods (I’ve obviously rigged it to make directional adjacency look the best).


The proof

We’re taking two approaches to prove the value of improved detection of PCR duplicates using the adjacency methods implemented in Firstly, we’ve conducted simulations to show they consistently estimate the true number of unique fragments more accurately. Secondly, we’ve applied to multiple single-cell RNA-Seq studies and show an improvement, with respect to the distribution of UMIs, the quantification of spike-in RNAs and the clustering of cells. I’ll focus on the simulation for now and leave the single-cell RNA-Seq to another post when we’ve submitted the manuscript.

In order to compare the various methods, we wanted to know how the methods fared across a range of UMI lengths and depths of sequencing.  For this I turned to python to replicate the process of amplification and sequencing in silico and model the impact of the various parameters on the efficacy of the methods. The whole process is contained in a standalone ipython notebook which you can view here. If you would like to download and run the notebook yourself you can do so from here.

In brief, the simulation starts with a set number of unique molecules (fragments), our “ground truth”. A UMI is then randomly assigned to each molecule and the pool of molecules amplified by duplicating each molecules for every in silico PCR cycle. The final pool of molecules are then “sequenced” to produce the reads. Within this process, base-calling errors are incorporated during the sequencing and substitutions are incorporated during the PCR cycles. Furthermore, amplification efficiency of each UMI varies to naively model the impact of amplification biases. Various methods are then applied to the reads generated to simulate the de-duplication process within and an estimate of the number of unique molecules is obtained. To compare the methods I used 2 metrics:

  • The difference between this estimate and the ground truth
  • The variability of the estimate for which I used the coefficient of variation (CV; sd/mean)

Each of the following plots below show the impact of changing one of the simulation parameters on theses two metrics. In each plot, the red dashed line shows the parameter value that was used in the other simulations.

We can see that under the assumptions of the simulations, the directional adjacency method is the least sensitive to variations in the input parameters.

The unique method (the method used in most publications) severely overestimates the number of unique molecules since each sequencing error or PCR error erroneously increases the estimate. As the number of starting molecules increase, all methods eventually struggle to identify the true number of unique molecules, with the adjacency, directional adjacency and percentile methods showing the best performance.  As we would expect, the percentile method tends to overestimate the number of unique molecules since it fails to remove sequencing errors unless they are very rare events. The adjacency method tends towards an overestimation too, as it can only identify erroneous UMIs within a single edit distance. Interestingly, the directional adjacency method tends towards an underestimation as it can link together a string of UMIs each separated by a single edit distance. The directional aspect of the network is designed to avoid this happening between truly unique UMIs, however occasionally these will be collapsed together. The cluster method tends towards a more severe underestimation as it does not take counts into account at all and collapses whole networks into a single UMI.

Overall, the directional adjacency method shows very little sensitivity to the number of starting molecules and outperforms all other methods. As expected, the percentile method becomes more consistent as the number of molecules increase, since the variability between the counts in each one of the individual iterations is reduced. In contrast, the cluster and adjacency methods become less consistent as the number of starting molecules increases, as the variability in the topology of the networks formed will also increase.


As the UMI length increases (from a initial very short 3bp), the methods which utilise a network quickly improve in terms of accuracy and variability. However, the percentile and unique methods actually become less accurate as a longer UMI obviously means more probability of errors accumulating. Perhaps counter intuitively then, if you’re not going to account for errors when using UMIs, you’re better off not using a very long UMI.


As the PCR cycles are increased, all methods except directional adjacency show a decrease in accuracy and an increase in variability, with adjacency being the second most robust to high PCR cycle numbers. We see the same as we increase the DNA polymerase error rate. This confirms that the directional adjacency method is better able to handle errors originating from the amplification stage.



Is a non-heuristic solution possible?

The methods we propose are definitely an improvement on “unique” deduplication using UMIs within these simulations.  We’ve also shown that the application of our proposed methods can have a real impact on biological interpretation of single cell RNA-Seq data (more to come on that soon). However, all implemented methods are heuristic approaches to the problem. My initial expectation was that we would be able to apply some probabilistic approach to determine the most likely set of UMIs which would account for the UMIs observed at a given genomic location. So far, we’ve been unable to develop such a method. If anyone would like to discuss alternative probabilistic approaches, or the script in general, please get in touch.