The dependency hell in software development

Introduction

Most software projects depend on other software packages to provide its functionality. Actually, the bigger your software project, the more likely is that you end up having a large number of dependencies either internally between different parts of your code or externally with third-party libraries.

The goal of adding a new dependency to your code is to avoid reinventing the wheel. If there is code already available that does what you want, normally you would prefer to re-use it (there are exceptions, though) rather than investing your time in re-writing new code to solve the same task. On the other hand, once your software project has matured over time and is ready for production it may end up being a dependency for other software projects as well.

Grouping off-the-shelf functions into software packages and defining dependencies between them has been traditionally at the core of software development. It is specially important if you try to follow the Unix Design Philosophy where the aim is to build small programs that do one thing well, and can be easily re-used.

At the time that you decide to write new software to solve a problem that is not being tackled by other software out there, you will also have to choose the operating system, the programming language, system libraries, and a collection of other software packages available on the system. These choices will create dependencies between your new piece of code and a specific software environment that will need to be reproduced to run properly on a different system.

Initially, re-using software packages as much as possible looks like the best way forward to save time and effort in the long run; and it is indeed true. However, creating dependencies with other software packages also brings its own disadvantages, specially when the number of dependencies becomes larger and larger. Here is a nice example illustrating what I am talking about.

gentoo-deps

Copy and paste the original text to explain the picture:

This graph shows the dependencies between software packages in the Gentoo Linux operating system. In total there are 14,319 packages with 63,988 dependencies between them!

Packages are drawn on the circumference of the circle and a dependency is indicated by a line draw from one package to another. The colour of a package is randomly generated, while the colour of a line is determined by the package that is required by the other packages.

Every package has a line starting on the rim of the circle drawn radially outwards. The length of the line is determined by the amount of other packages that depend on the package.

As you can imagine, the larger the number of dependencies between packages, the more likely is to create conflicts between them (among other issues). This problem is known as the dependency hell and is the driver for the rise of new technologies out there, like containers, that are trying to propose a feasible solution.

The example about Gentoo Linux might seem an extreme one. However, the codebase in CGAT currently depends on 58 R packages and 45 Python packages, plus third party tools available in our compute cluster, where we have over 200 packages installed. Click here if you dare to check out the graph of internal dependencies for the CGAT scripts!

In the rest of this post I will revisit quickly how the dependency hell problem has been addressed up to date, from no dependency resolution at all to the full isolation provided by containers. Then, I present how I have approached the dependency hell problem for the CGAT scripts. Finally, I provide my conclusions and some advice that will hopefully be useful for someone out there.

A brief history of open-source software distribution

Before talking a bit more about containers, let us now have a quick look at the evolution of how open-source software has been distributed since the Internet was available.

Tarballs

Back in the old days the only way of sharing software was to create a tarball containing the code, configuration files and scripts to compile/install the code. As mentioned, unless the software package was completely self-contained, the dependencies were not shipped with it and if you did not have them installed on your box, it would rapidly complain either at compilation or run time. The amount of time spent making the new software package to run properly varied with the number of dependencies and the conflicts with the packages already installed. Finding the right versions for all the dependencies manually was a big limitation preventing the distribution of open source software, if not impossible in some cases.

Another problem was to build the software package on a CPU architecture that was compatible with the target system. Nowadays, the amount of different CPU architectures available in the market has shrunk quite a lot, but that has not been always the case.

Having said that, it is also true that if your software only depends on a couple of packages, and your reference CPU architecture is the widely used x86, tarballs are still today a feasible way of sharing code (although you should consider better options as explained below!)

Package managers

In order to ease the distribution of open source software, package managers were created. Instead of sharing just the code, the software was pre-compiled in a reference CPU architecture and packaged into special file format that included a way of specifying where the code should be located in the destination file system along with a list of the required software dependencies and version numbers. RPM and APT are probably the most popular package managers for Red Hat and Debian based Linux distributions, respectively.

Depending on the complexity of a software project, it is not trivial to create a package for a package manager. However, the benefits are evident. The package just needs to be uploaded to a proper repository and it can be easily installed by end users with no effort via the package manager, which will download the package and all its dependencies if they are not installed on the target system.

Although dependency resolution is a key feature of a package manager, it is worth mentioning that advanced package managers also play other important roles like applying security updates in a timely manner or checking that the software to be installed is not malicious.

Package managers started as a core part of Linux distributions but the list has grown to other operating systems as well. The concept has been extended outside operating systems to manage packages for programming languages like Perl, Python and R to name a few. Even more remarkable are initiatives like Conda, a package manager that was initially intended for Python packages only but has been generalised to manage packages in different programming languages (Python, R, Ruby, Lua, Scala, Java, Javascript, C/ C++ and FORTRAN) and can be installed in all the major operating systems in the market (Linux, OS X and Windows). More about Conda later.

Virtual machines

When the number of dependencies either with other software packages or with the underlying operating system is quite specific and/or large, another approach is to create a virtual machine where the complete software stack is under control. Virtual machines can be shared via its virtual disk only or as virtual appliances. The former gives freedom to the end user to select the amount of virtual resources assigned to the virtual machine (e.g. number of CPU cores and size of main memory). The latter not only comes with the software pre-installed but also with the virtual resources pre-selected for you so your hypervisor of choice just runs it.

Containers

Since Docker irrupted in 2014, there has been a lot of hype around containers in software development in the last years. Here is a good definition of containers:

Using containers, everything required to make a piece of software run is packaged into isolated containers. Unlike virtual machines, containers do not bundle a full operating system – only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed.

Docker is available for all major operating systems (Linux, OS X and Windows) and the idea is that you create a container for your application once, and it runs everywhere. That is quite appealing and is the reason why Docker in particular, and containers in general, are getting a huge attention these days.

Conda package for CGAT scripts

CGAT maintains a collection of scripts and pipelines for the analysis of high-throughput sequencing data. As stated above, the code has a large number of dependencies and it has been traditionally a challenge to deploy outside the CGAT cluster. Basically, we are facing the dependency hell ourselves and here I describe our approach to solve it.

In the past, I have been playing more or less with several options presented in this post to make our code portable. Right now, I consider Conda the most suitable option going forward but time will tell whether I am wrong or not.

Conda is easy to learn and install plus there are already a couple of community led efforts that make this option attractive: Bioconda and conda-forge. These teams are making great contributions to enable reproducible research and solve the dependency hell problem. Special thanks to Johannes Köster for setting out the Bioconda community and asking me to become a contributor. I am also grateful to Bioconda’s the core team for their hard work and dedication to keep this project moving on.

Recently, I have added a Conda package to the Bioconda channel for the CGAT scripts. Next, I will describe the installation steps so you can give it a try and confirm whether the Conda approach is really a solution to our dependency hell!

The installation instructions below are for Linux. OS X and Windows users must install Docker and use the container option. I will use the a temporary folder, but feel free to choose a different location on your system.

# create an installation folder
mkdir /tmp/conda
cd /tmp/conda

# download and install Conda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /tmp/conda/install
export PATH=/tmp/conda/install/bin:$PATH

# configure Conda
conda config --set allow_softlinks False
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda

# install CGAT scripts into a Conda environment
conda create -n cgat-scripts cgat-scripts -y

Since the number of dependencies is high, the installation will take up to 10 minutes depending on your system’s configuration. According to my test, you’ll need 3GB of free disk space for the complete installation as well.

Now, you are ready to go:

# on the same or a different terminal, activate the Conda environment
source /tmp/conda/install/bin/activate cgat-scripts

# and now you should be able to use the CGAT scripts
cgat --help
cgat --help Genomics
cgat --help Conversion
cgat gtf2table -h
cgat bam2geneprofile -h

So all you need to decide is the path on your file system to install Conda and all the dependencies will be located under a single folder that you can then enable on demand. Please note, we have made use of Conda environments, which gives you flexibility to have a specific set of dependencies without conflicting with other environments (e.g. different versions of Python or R) on the same Conda installation. Or, if you are really paranoid about breaking things (like myself), you can create a separate Conda installation on a different location on your file system, to ensure full isolation.

If that worked for you, and are interested in more details about how the Conda package was created, please feel free to comment on this post and I will try to help.

Conclusions

In summary, I explained the dependency hell and how the problem has been tackled until today, and later I presented the Conda package for CGAT scripts as a way of solving our own dependency hell, which will hopefully work on your systems!

Finally, based on my experience solving the dependency hell for the CGAT scripts, here is my advice on how to address the problem properly if you are a software developer. Firstly, I recommend to be really picky when choosing new dependencies for your code. Make sure the software you depend on is in good health and maintained. Secondly, have a way of tracking the dependencies for your software. For example, a simple up to date list of software packages and the required versions will do the job.

2 thoughts on “The dependency hell in software development

  1. Very interesting post Sebastian. Dependency hell is something I think I only really appreciated once I’d left the comforts of the CGAT environment!

    Really nice to have to brief history of how dependencies have been handled. A few questions: In your opinion, what are the relative merits of virtual machines and containers? I.e when would you use one over the other? Also, have you head of Singularity? It’s available on the Cambridge HPC and I’d be interested to hear your thoughts about it.

    Like

  2. Sebastian Luna-Valero

    Hi Tom, many thanks!

    Virtual machines and containers offer full isolation of the host operating system, and therefore full control of the runtime environment. There are different opinions on when to use one over the other, specially when the workloads are similar, but in my opinion virtual machines are better when you require dedicated access to virtualised hardware (fix number of CPU cores, amount of RAM memory and a disk space) whereas containers are for lightweight workloads. For example, I would use virtual machines for deploying a small-scale web server and a container for running the CGAT scripts on Windows.

    Indeed, I have heard of Singularity. Specially in the high performance computing world I would say is the competitor of Docker, the main advantage of Singularity being that it is able to run completely without root permissions and Docker does not. Thus, Singularity is quite attractive for sysadmins as we don’t have to worry about privilege escalation problems in our clusters. I have played with Singularity a little bit last year, and although it was less mature than Docker it was a promising project. There are two incoming workshops in Cambridge about containers [1][2], the latter being specifically targeted to Singularity, which I might attend so I hope to see you there 😉

    [1] https://www.software.ac.uk/c4rr
    [2] https://rse-cambridge.github.io/hpc-container-workshop/

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s