Most software projects depend on other software packages to provide its functionality. Actually, the bigger your software project, the more likely is that you end up having a large number of dependencies either internally between different parts of your code or externally with third-party libraries.
The goal of adding a new dependency to your code is to avoid reinventing the wheel. If there is code already available that does what you want, normally you would prefer to re-use it (there are exceptions, though) rather than investing your time in re-writing new code to solve the same task. On the other hand, once your software project has matured over time and is ready for production it may end up being a dependency for other software projects as well.
Grouping off-the-shelf functions into software packages and defining dependencies between them has been traditionally at the core of software development. It is specially important if you try to follow the Unix Design Philosophy where the aim is to build small programs that do one thing well, and can be easily re-used.
At the time that you decide to write new software to solve a problem that is not being tackled by other software out there, you will also have to choose the operating system, the programming language, system libraries, and a collection of other software packages available on the system. These choices will create dependencies between your new piece of code and a specific software environment that will need to be reproduced to run properly on a different system.
Initially, re-using software packages as much as possible looks like the best way forward to save time and effort in the long run; and it is indeed true. However, creating dependencies with other software packages also brings its own disadvantages, specially when the number of dependencies becomes larger and larger. Here is a nice example illustrating what I am talking about.
Copy and paste the original text to explain the picture:
This graph shows the dependencies between software packages in the Gentoo Linux operating system. In total there are 14,319 packages with 63,988 dependencies between them!
Packages are drawn on the circumference of the circle and a dependency is indicated by a line draw from one package to another. The colour of a package is randomly generated, while the colour of a line is determined by the package that is required by the other packages.
Every package has a line starting on the rim of the circle drawn radially outwards. The length of the line is determined by the amount of other packages that depend on the package.
As you can imagine, the larger the number of dependencies between packages, the more likely is to create conflicts between them (among other issues). This problem is known as the dependency hell and is the driver for the rise of new technologies out there, like containers, that are trying to propose a feasible solution.
The example about Gentoo Linux might seem an extreme one. However, the codebase in CGAT currently depends on 58 R packages and 45 Python packages, plus third party tools available in our compute cluster, where we have over 200 packages installed. Click here if you dare to check out the graph of internal dependencies for the CGAT scripts!
In the rest of this post I will revisit quickly how the dependency hell problem has been addressed up to date, from no dependency resolution at all to the full isolation provided by containers. Then, I present how I have approached the dependency hell problem for the CGAT scripts. Finally, I provide my conclusions and some advice that will hopefully be useful for someone out there.
A brief history of open-source software distribution
Before talking a bit more about containers, let us now have a quick look at the evolution of how open-source software has been distributed since the Internet was available.
Back in the old days the only way of sharing software was to create a tarball containing the code, configuration files and scripts to compile/install the code. As mentioned, unless the software package was completely self-contained, the dependencies were not shipped with it and if you did not have them installed on your box, it would rapidly complain either at compilation or run time. The amount of time spent making the new software package to run properly varied with the number of dependencies and the conflicts with the packages already installed. Finding the right versions for all the dependencies manually was a big limitation preventing the distribution of open source software, if not impossible in some cases.
Another problem was to build the software package on a CPU architecture that was compatible with the target system. Nowadays, the amount of different CPU architectures available in the market has shrunk quite a lot, but that has not been always the case.
Having said that, it is also true that if your software only depends on a couple of packages, and your reference CPU architecture is the widely used x86, tarballs are still today a feasible way of sharing code (although you should consider better options as explained below!)
In order to ease the distribution of open source software, package managers were created. Instead of sharing just the code, the software was pre-compiled in a reference CPU architecture and packaged into special file format that included a way of specifying where the code should be located in the destination file system along with a list of the required software dependencies and version numbers. RPM and APT are probably the most popular package managers for Red Hat and Debian based Linux distributions, respectively.
Depending on the complexity of a software project, it is not trivial to create a package for a package manager. However, the benefits are evident. The package just needs to be uploaded to a proper repository and it can be easily installed by end users with no effort via the package manager, which will download the package and all its dependencies if they are not installed on the target system.
Although dependency resolution is a key feature of a package manager, it is worth mentioning that advanced package managers also play other important roles like applying security updates in a timely manner or checking that the software to be installed is not malicious.
When the number of dependencies either with other software packages or with the underlying operating system is quite specific and/or large, another approach is to create a virtual machine where the complete software stack is under control. Virtual machines can be shared via its virtual disk only or as virtual appliances. The former gives freedom to the end user to select the amount of virtual resources assigned to the virtual machine (e.g. number of CPU cores and size of main memory). The latter not only comes with the software pre-installed but also with the virtual resources pre-selected for you so your hypervisor of choice just runs it.
Using containers, everything required to make a piece of software run is packaged into isolated containers. Unlike virtual machines, containers do not bundle a full operating system – only libraries and settings required to make the software work are needed. This makes for efficient, lightweight, self-contained systems and guarantees that software will always run the same, regardless of where it’s deployed.
Docker is available for all major operating systems (Linux, OS X and Windows) and the idea is that you create a container for your application once, and it runs everywhere. That is quite appealing and is the reason why Docker in particular, and containers in general, are getting a huge attention these days.
Conda package for CGAT scripts
CGAT maintains a collection of scripts and pipelines for the analysis of high-throughput sequencing data. As stated above, the code has a large number of dependencies and it has been traditionally a challenge to deploy outside the CGAT cluster. Basically, we are facing the dependency hell ourselves and here I describe our approach to solve it.
In the past, I have been playing more or less with several options presented in this post to make our code portable. Right now, I consider Conda the most suitable option going forward but time will tell whether I am wrong or not.
Conda is easy to learn and install plus there are already a couple of community led efforts that make this option attractive: Bioconda and conda-forge. These teams are making great contributions to enable reproducible research and solve the dependency hell problem. Special thanks to Johannes Köster for setting out the Bioconda community and asking me to become a contributor. I am also grateful to Bioconda’s the core team for their hard work and dedication to keep this project moving on.
Recently, I have added a Conda package to the Bioconda channel for the CGAT scripts. Next, I will describe the installation steps so you can give it a try and confirm whether the Conda approach is really a solution to our dependency hell!
The installation instructions below are for Linux. OS X and Windows users must install Docker and use the container option. I will use the a temporary folder, but feel free to choose a different location on your system.
# create an installation folder mkdir /tmp/conda cd /tmp/conda # download and install Conda wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p /tmp/conda/install export PATH=/tmp/conda/install/bin:$PATH # configure Conda conda config --set allow_softlinks False conda config --add channels conda-forge conda config --add channels defaults conda config --add channels r conda config --add channels bioconda # install CGAT scripts into a Conda environment conda create -n cgat-scripts cgat-scripts -y
Since the number of dependencies is high, the installation will take up to 10 minutes depending on your system’s configuration. According to my test, you’ll need 3GB of free disk space for the complete installation as well.
Now, you are ready to go:
# on the same or a different terminal, activate the Conda environment source /tmp/conda/install/bin/activate cgat-scripts # and now you should be able to use the CGAT scripts cgat --help cgat --help Genomics cgat --help Conversion cgat gtf2table -h cgat bam2geneprofile -h
So all you need to decide is the path on your file system to install Conda and all the dependencies will be located under a single folder that you can then enable on demand. Please note, we have made use of Conda environments, which gives you flexibility to have a specific set of dependencies without conflicting with other environments (e.g. different versions of Python or R) on the same Conda installation. Or, if you are really paranoid about breaking things (like myself), you can create a separate Conda installation on a different location on your file system, to ensure full isolation.
If that worked for you, and are interested in more details about how the Conda package was created, please feel free to comment on this post and I will try to help.
In summary, I explained the dependency hell and how the problem has been tackled until today, and later I presented the Conda package for CGAT scripts as a way of solving our own dependency hell, which will hopefully work on your systems!
Finally, based on my experience solving the dependency hell for the CGAT scripts, here is my advice on how to address the problem properly if you are a software developer. Firstly, I recommend to be really picky when choosing new dependencies for your code. Make sure the software you depend on is in good health and maintained. Secondly, have a way of tracking the dependencies for your software. For example, a simple up to date list of software packages and the required versions will do the job.