Estimating the number of true cell barcodes in single cell RNA-Seq (part 2)

This post starts where we left off in the first part. If you haven’t already done so, I’d recommend reading the first part now.

I hadn’t planned on writing another post but no sooner had I finished the first part of this blog post than I discovered that the Pachter group have developed a tool to perform the same task (sircel, pronounced ‘circle’). This appears to be in the early stages of development but there’s a bioAxiv manuscript out which happens to cover the same input data as I discussed in the previous post (thanks to Avi Srivastava for bringing this to my attention).

In the discussion with Avi in the comments of the previous post I also realised that I had failed to clarify that only base substitutions will be detectable when using our network methods. As it happens, one of the motivations for the development of sircel appears to be handling INDELs which they claim can be present in >25% of dropseq barcodes. I can’t see where they get this number from (In the manuscript they reference the original Macosko et al paper but I can’t see this 25% figure anywhere) so I’m assuming for now that this is based on additional analyses they’ve performed. This seems like an alarmingly high figure to me given that the rate of INDELs is ~1 order of magnitude less frequent than substituions (see Fig. 2/Table 2 here) and even at the start of the read where the cell barcode resides, the per-base rate of INDELs does not usually exceed 0.001. I would also assume that cell barcodes with INDELs would have far fewer reads and so drop below the threshold using the ‘knee’ method. Hence, while some cell barcodes may be discarded due to INDELs, I would be surprised if a INDEL-containing barcode was erroneously included in the accepted cell barcodes.

How common are INDELs?

Before I tested out sircel, I wanted to independently assess what proportion of cell barcodes might have a substitution or INDEL. To do this, I returned to the drop-seq data from the last post (GSE66693). Previously, I showed the distribution of reads per cell barcode separated into 4 bins, where bin 4 represents the accepted barcodes above the knee threshold and bins 1-3 are largely populated by different types of erroneous barcodes (reproduced below as Figure 1). I compared each barcode to all barcodes in bin 4 and looked for a match allowing for one error (substitution, insertion or deletion). The results are tabulated below.  These figures are an upper estimate of the number of barcodes/reads harbouring an error relative to the true cell barcodes since each barcode may contribute to all three error types if at least one match to a bin 4 barcode exists with each error type. We can see that ~1.5% of cell barcodes may contain an INDEL, far less than the 25% suggested in the sircel manuscript. Furthermore, when we present the figures according to the number of reads, it appears around 3% of reads have a cell barcode with an INDEL, whilst 7% have a substitution. Thus, we can be quite confident that we’re not losing lots of data if we don’t error correct cell barcodes or just correct substitution errors.

indrop_bins

Figure 1. Distribution of reads per cell barcode (log 10) from first 50M reads, separated into 4 bins. Bin 4 represents the ‘true’ barcodes.

Untitled 2

Table 1: Columns 1 & 2: The number and fraction of cell barcodes which match a bin 4 barcode allowing one error. Columns 3 & 4: The number and fraction of reads with a cell barcodes which match a bin 4 barcode allowing one error.

Are INDEL-containing barcodes passing the knee threshold?

Next, I want to investigate whether the INDEL barcodes pass the knee threshold and become ‘accepted’. I therefore split the above data according to bin and focused on bin 4. As you can see in Figure 2 below,  there are indeed a small number of bin 4 barcodes which are a single error away from another bin 4 barcodes. Our null expectations (from 100 000 randomly generated barcodes) for the percentage of subsitution, deletion and insertion matches by chance are 0.14, 0.11 and 0.13%, respectively. In contrast, amongst the 575 bin 4 barcodes, there are 6 barcodes (1.0%) which match at least one other bin 4 barcode with one error. 4/6 of these barcodes have a match with an INDEL and all have relatively low counts suggesting these are indeed errors. It seems I may have been wrong to assume a INDEL-containing barcodes will not pass the knee threshold. The results for bins 1-3 minor my previous analysis using edit-distance and suggest the bin3 is mainly made up of error-containing barcodes whilst bins 1 & 2 appear to originate from amplification of ambient RNA in empty droplets.

error_detectionFigure 2. The frequency of cell barcodes in each bin which are one error away from at least one of the bin 4 barcodes. Random = 100 000 randomly generated barcodes

How does sircle compare to the knee method?

So, INDELs are observed in this drop-Seq data, although perhaps not 25% of all cell barcodes, and INDEL-containing barcodes can pass the hard knee threshold for acceptance. Now on to sircel. As you would expect from the Pachter group, sircel introduces a novel, elegant solution to detect substitutions and INDELs in barcodes. Rather than using edit-distance or fuzzy matching, they ‘circularise’ the barcodes and form a de Bruijn graph using the kmers from the circularised barcodes. The circularisation ensures that even with a large kmer size, barcodes containing an error will share at least a portion of their path through the de Bruijn graph with the cognate error-free barcodes. The edges are weighted according to the number of times each edge is observed and each circular path is given a score according to the minimal weight of all its edges. This score is used to filter the paths to select the high weight, correct paths and incorrect paths are merged to the correct paths. Notably, the thresholding is done using a method analogous to the knee method, whereby the threshold is set as the local maxima of the smoothed first derivative from the cumulative distribution of path scores. The neat thing about the sircel method is that it enables simultaneous identification of this inflection point to identify the threshold for true barcodes, and error correction to the true barcodes. To compare the results for sircel and my knee method, I ran sircel on the GSE66693 drop-seq sample where the authors had suggested there were ~570 true cell barcodes. Previously, using the first 50M reads I had estimated that there were 573 true barcodes using the knee method. As per the sircel manuscript recommendation, I used only 1M reads to run sircel, which yielded 567 true barcodes. This is slightly lower than the number quoted in the manuscript for 1M reads (582) but perhaps this is due to more recent updates? The methods agree on 563 cell barcodes but what about the cell barcodes where they agree. 4/10 out of the knee method-specific cell barcodes are a single error away from another barcode above the threshold, suggesting these may in fact be errors. The remaining 6 barcodes all have very low counts. Likewise, the 4 sircel-specific cell barcodes fall just below the knee threshold (6170-9576 reads per barcode; threshold = 9960). For these 10 barcodes, the difference appears to be due to differences in the number of reads sampled. I wanted to check this by using 50M reads with sircel but the run-time becomes excessively long. Unfortunately, going to other way and using just 1M reads with the knee method yields unstable estimates and since it takes <6 minutes to run the knee method with 50M reads we would never recommend using the knee method with just 1M reads. So in essence, the major difference between the knee method and sircel is that sircel effectively corrects the small number of error barcodes above the threshold.

Can we improve the knee method?

It does look like sircel does a good job of removing error barcodes but we could easily replicate this performance by performing an INDEL-aware error detection step after the knee threshold. During the development of UMI-tools we considered detecting INDELs as well as substitutions but decided against it because: a) as mentioned at the top, the rate of INDELs is usually ~1 order of magnitude lower than substituions (that this doesn’t appear to hold true for cell barcodes), and b) UMI-tools groups reads together by their alignment position so the current framework does not allow the detection of INDELs. However, when we’re dealing with cell barcodes, we consider all barcodes at once and hence, it’s much easier to identify INDELs. For our network methods, we can simply replace the edit-distance threshold with a fuzzy regex, allowing one error (substitution, insertion or deletion). The fuzzy regex matching is ~30-fold slower than our cythonised edit-distance function but this isn’t an issue here since we only need to build the network from the cell barcodes which pass the knee threshold. From the network we can identify the error barcodes and either correct these to the correct barcode or else discard them. This will now be the default method for UMI-tools extract with the option to discard, correct or ignore potential barcode errors above the knee threshold. We will also include the option to correct barcodes which fall below the threshold.

A final thought: To correct or not?

The cost of incorrectly merging two cell barcodes through cell barcode error correction could be drastic, e.g appearing to create cells ‘transitioning’ between discrete states. On the other hand, discarding reads from error barcodes rather than correcting them reduces the sequencing depth and hence increases the noise. In the drop-seq data above 55.7% of reads are retained without error correction with a further 9.2% of reads potentially correctable. If all barcodes above the threshold and within one error of another barcode above the threshold were removed, this would only mean the loss of 0.3% of reads. Given the potential dangers of incorrect error correction, it may be prudent in some cases to discard all cell barcodes with errors and work with 55.4% of the input reads rather than correcting errors in order to retain 64.9% or reads

Speeding up python

I was recently contacted by Ilgar Abdullayev from the Ludwig Institute for Cancer Research, Karolinska Institutet with a query about why the dedup_umi.py script in our UMI-Tools repository was taking so long to run on some of his BAMs. Finding the bottleneck was relatively easy and the immediate solution was straightforward thanks to cython (and our technical director Andreas!). Thanks to Ilgar for sharing some of his data to let me optimise the code. Before the details though, I thought it might be a good time to take stock on why python is quick to write but slow to run. Skip to “Reducing the interpretation overhead” if you’re already a whiz at python.

Quick development

One of the things I really love about python is how quick it is to write function code.  Python is dynamically typed which means we don’t have to worry about the type of a variable until we want to do something with it. The following is perfectly valid.

variable = "hello world"

variable = 6

It appears to the user that the type of variable has changed from string to int. Actually, we first created a python string object and bound the reference “variable” to it, then we created a new int object and bound the reference “variable” to it. The first object no longer has any references bound to it so it’s garbage collected (e.g deleted).

Another great feature of python is duck typing by which we can perform the same operation on different python types to different ends. The classic example being the use of “+” for both concatenating strings and for adding numeric types.

variable = "hello world"
print variable + "I'm python"
hello world, I'm python

variable = 6
print variable + 2
8

Along with a comprehensive set of well-maintained modules for more complex classes, data manipulations and statistics, dynamic typing and duck typing makes writing a complex python script something that can be achieved in minutes rather than days.

Slower to run

The penalty you pay for dynamic typing is that the python code must be interpreted, and a section of code is re-interpreted each time its run. In the, admittedly contrived, example below we can clearly see why the code in the loop needs to be interpreted each time anew.

dict_1 = {1:"bar", 0:"foo"}
list_1 = ["bar", "foo"]

objects = (dict_1, list_1)
for obj in (objects):
    print obj[0]

foo
bar

In the first loop, obj is a dictionary. The “[0]” index in a dictionary returns the value for key “0”, which in this case is “foo”. In the second loop, obj is a list. The “[0]” operation on a list slices the list to the xth element (0-indexed), which in this case is the first element, “bar”.

Reducing the interpretation overhead

The reinterpretation of code takes a lot of time and python is therefore considerably slower than languages like C. And this is where cython (see what they did there) comes in.  Cython allows you to statically determine the type of an object reducing the work of the python interpreter.

Now to return to the original problem with our script. For a background on our UMI-tools see (CGAT blog – UMIs). The dedup_umi.py script in question takes the alignment file (SAM/BAM format) and returns an alignment with the duplicated reads removed. In my hands this was taking about 10 seconds for 1 million reads which seems perfectly reasonable. For Ilgar however, it was taking around 12 minutes per 1 million reads which is less than ideal when alignment files may contain >100M reads.

To identify the bottleneck I first used a bit of trial and error to identify a single read start position which was particularly troublesome. In hindsight it was hardly surprising this was an area of extreme coverage. With the use of of the python profiler module we can identify the part of the script which was taking the longest time.

python -m cProfile -s tottime dedup.umi.py -I subset.bam -S out.bam --method=adjacency > profile.log

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 26352342   47.813    0.000   76.867    0.000 dedup_umi.py:196(edit_dist)
 26352397   20.992    0.000   20.992    0.000 {zip}
 26352353    8.062    0.000    8.062    0.000 {sum}


So the majority of the time taken was the edit_dist function which was run 26352342 times! Looking back at the script, we can see this is because our adjacency-based methods take the list of UMIs observed and create a dictionary mapping each UMI to the UMIs which are a single edit distance apart (hamming distance = 1). This is an all vs. all operation so the time taken increases quadratically with the number of UMIs observed. The obvious solution was therefore to improve the speed of the adjacency dictionary building. Let’s look at relevant portion of the original code.

def edit_dist(first, second):
    ''' returns the edit distance/hamming distances between
    its two arguments '''

    dist = sum([not a == b for a, b in zip(first, second)])
    return dist

def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi2) <= threshold]
                for umi in umis}

The _get_adj_list_adjacency method uses a dictionary comprehension with the values generated by a list comprehension so the edit_distance function is called on every combination of UMIS. For a quick improvement we can put the code for the edit distance calculation inside the list comprehension so we avoid recalling the function, but the gains are minimal. Alternatively, as you may have noticed, we’re only interested in UMIs where the distance is >= threshold so we could optimise the edit_distance method by returning as soon as the distance is over the threshold:

def edit_dist(first, second, threshold):
    ''' returns True if hamming distance is less than threshold '''

    dist = 0
    for a, b in in zip(first, second):
        if a != b:
            dist += 1
        if dist > threshold:
            return 0
    return 1

def _get_adj_list_adjacency(self, umis, counts, threshold):
        '''identify all umis within hamming distance threshold'''

        return {umi: [umi2 for umi2 in umis if
                      edit_dist(umi, umi2, threshold)]
                for umi in umis}

This improves the speed ~2-fold which is nice but not really enough!

Speaking to Andreas about this problem, he quickly identified cython as simple solution since the hamming distance calculation is very simple. Here’s the cython code snippet and the pertinent code in dedup.py

_dedup_umi.pyx:

cpdef int edit_distance(a, b):
    cdef char * aa = a
    cdef char * bb = b

    cdef int k, l, c
    c = 0
    l = len(a)
    for k from 0 <= k < l:
        if aa[k] != bb[k]:
            c += 1
    return c

dedup_umi.py:

...
import pyximport
pyximport.install(build_in_temp=False)
from _dedup_umi import edit_distance
...
def _get_adj_list_adjacency(self, umis, counts, threshold):
        ''' identify all umis within hamming distance threshold '''

        return {umi: [umi2 for umi2 in umis if
                      edit_distance(umi, umi) <= threshold]
                for umi in umis}

In the cython function we statically define the type for the two C strings a & b and the ints k, l & c. Then we count the positions in each string where the characters are different. In dedup_umi.py we just need to import the pyximport module and use it to compile _dedup_umi.py. Then we import the edit_distance function from _dedup_umi.py and use it like any other python function. This simple modification reduces the run time on the troublesome genomic coordinate from 91 secs to 11 secs – the most time-consuming function for this particular region is now remove_umis.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6432    5.902    0.001    5.902    0.001 dedup_umi.py:217(remove_umis)
       11    3.626    0.330    5.544    0.504 dedup_umi.py:296(<dictcomp>)
 26359658    1.918    0.000    1.918    0.000 {_dedup_umi.edit_distance}

The cython function could be further improved along the lines of the python function by returning as soon as the threshold is exceeded. I’m sure there are other improvements we could make to the run time of the script. For example, transforming the UMI strings to 2bit arrays would make the hamming distance calculation a very cheap xor bitwise operation. However, I’ve not tested yet if the cost of the transformation outweighs the gains in the edit distance calculation.

This is the first time I’ve used cython – something that Ian Sudbury (UMI-tools co-author) had suggested might be worth considering a while ago. To be honest, the solution was so straightforward that Andreas ended up implementing it himself whilst he demonstrated it. I think it might be time for me to read that cython documentation fully now!

Unique Molecular Identifiers – the problem, the solution and the proof

[EDIT: UMI-Tools open access publication is now out]

I’ve been working with Dr. Ian Sudbery (former CGAT fellow, currently Lecturer in Bioinformatics at the University of Sheffield) on a script called dedup_umi.py to correctly remove PCR duplicates from alignment files when using Unique Molecular Identifiers (UMIs). The script is available from GitHub as part of the UMI-tools repository. As part of the development I’ve been running some simulations to compare different methods which have proved quite illuminating. Before I go into the details though, some background.

PCR duplication artifacts

Many sequencing-based techniques rely upon an amplification step to increase the concentration of the library generated from the fragmented DNA prior to sequencing. Following alignment to the genome, it’s usually preferable to identify and remove PCR duplicates as there are inherent biases in the amplification step as some sequences become overrepresented in the finally library compared to their true abundance in the fragmented DNA. The brilliant Picard tool is commonly used to identify and remove these PCR duplicates using their genomic coordinates. Of course, it’s also possible that two sequence reads with the same genomic coordinates will have the originated from two independent fragments. However, where the alignment density is essentially random (as in DNA-Seq) this is unlikely, even more so if paired end sequencing has been performed. The human genome has ~3 x 109 bases and the insert size will vary so the combinations of possible genomic coordinates for the paired ends is so vast we don’t expect to sequence two fragments with the same genomic coordinates.

Conversely, in RNA-Seq, some cDNA sequences will be highly abundant and we are therefore much more likely to sequence multiple fragments with the exact same genomic coordinates. This problem is most acute where a high number of PCR cycles have been used to generate the sequencing library, for example where the concentration of starting material in low, such as in single cell RNA-Seq as this necessitates many PCR cycles.

Using unique molecular identifiers to correctly identify PCR duplicates

Unique molecular identifiers (UMIs; also called Random Molecular Tags (RMTs)) are random sequences of bases used to tag each molecule (fragment) prior to library amplification (see figure below and Kivioja et al (2012), thereby aiding in the identification of PCR duplicates.  If two reads align to the same location and have the same UMI, it is highly likely that they are PCR duplicates originating from the same fragment prior to amplification. We can therefore collapse all the reads with the same genomic coordinates and UMI into a single representative read and obtain an accurate estimate of the relative concentration of the fragments in the original sample.

nmeth.2772-F1This figure is adapted from Islam et al (2014)

The problem

The use of UMIs as described above would work perfectly if it were not for base-calling errors, which erroneously create sequencing reads with the same genomic coordinates and UMIs and that are identical except for the base at which the error occurred. Base-calling errors therefore inflate the apparent numbers of unique fragments sequenced (see below). An additional source of errors comes from the amplification itself which can introduce errors during the DNA duplication step.

schematic_1

The solution

Ian initially identified this issue in his own data and wrote a python script utilising pysam to parse an alignment BAM file and inspect the UMIs of reads with the same genomic coordinates. He reasoned that sequencing errors would produce UMIs which are a single edit distance away from another UMI. Therefore, by forming an adjacency matrix it should be possible to identify the sequencing errors and collapse the reads to the true number of unique UMIs present (see schematic below for a simple example). The first step is to identify connected components, where UMIs are represented by nodes and edges represent links between nodes which are a single edit distance apart. The counts of each node are also retained. Within each component, the most abundant node is selected and all nodes within a single edit distance are removed. This process is repeated using the second most abundant node and so on until all edges are removed and the number of nodes remaining is inferred to be the number of unique fragments sequenced. We refer to this method as the “adjacency” method.

schematic_2

It turns out this works pretty well, however, one can envisage situations where this may not correctly identify the true number of unique fragments. For example,  an error during the PCR amplication may create an apparently distinct UMI from which sequencing errors could created further UMIs, all of which are actually from a single unique fragment (see schematic below). A similar situation could also arise if one particular position in the reads was more likely to contain an error (this is quite common in illumina sequencing). In this case, one would see lots of UMIs with a single edit distance due to an error at the common position and further errors may be present off of this UMI.

schematic_3In order to deal with these more complicated networks of connected UMIs, I developed the “directional_adjacency” method. I reasoned that we could resolve these networks by forming edges between nodes where the count of node A was >= 2 x node B. Now, each connected component is inferred to originate from a single unique fragment.

In addition to these methods, we also wanted to try a modification of the “adjacency” method in which the whole connected component is inferred to originate from a single unique fragment (“cluster”).

This issue has been previously observed by Islam et al (2014) who proposed that UMIs with less than 1% of the average counts should be removed “to account for UMIs that stem from PCR-induced mutations or sequencing errors”. We implemented this method within our tool for comparison (“percentile”).

Finally, we implemented a “unique” method, where all unique UMIs were retained. Note, this appears to be the method which is most commonly used. The schematic below shows a comparison of the various methods (I’ve obviously rigged it to make directional adjacency look the best).

schematic_2

The proof

We’re taking two approaches to prove the value of improved detection of PCR duplicates using the adjacency methods implemented in dedup_umi.py. Firstly, we’ve conducted simulations to show they consistently estimate the true number of unique fragments more accurately. Secondly, we’ve applied dedup_umi.py to multiple single-cell RNA-Seq studies and show an improvement, with respect to the distribution of UMIs, the quantification of spike-in RNAs and the clustering of cells. I’ll focus on the simulation for now and leave the single-cell RNA-Seq to another post when we’ve submitted the manuscript.

In order to compare the various methods, we wanted to know how the methods fared across a range of UMI lengths and depths of sequencing.  For this I turned to python to replicate the process of amplification and sequencing in silico and model the impact of the various parameters on the efficacy of the methods. The whole process is contained in a standalone ipython notebook which you can view here. If you would like to download and run the notebook yourself you can do so from here.

In brief, the simulation starts with a set number of unique molecules (fragments), our “ground truth”. A UMI is then randomly assigned to each molecule and the pool of molecules amplified by duplicating each molecules for every in silico PCR cycle. The final pool of molecules are then “sequenced” to produce the reads. Within this process, base-calling errors are incorporated during the sequencing and substitutions are incorporated during the PCR cycles. Furthermore, amplification efficiency of each UMI varies to naively model the impact of amplification biases. Various methods are then applied to the reads generated to simulate the de-duplication process within dedup_umi.py and an estimate of the number of unique molecules is obtained. To compare the methods I used 2 metrics:

  • The difference between this estimate and the ground truth
  • The variability of the estimate for which I used the coefficient of variation (CV; sd/mean)

Each of the following plots below show the impact of changing one of the simulation parameters on theses two metrics. In each plot, the red dashed line shows the parameter value that was used in the other simulations.

We can see that under the assumptions of the simulations, the directional adjacency method is the least sensitive to variations in the input parameters.

The unique method (the method used in most publications) severely overestimates the number of unique molecules since each sequencing error or PCR error erroneously increases the estimate. As the number of starting molecules increase, all methods eventually struggle to identify the true number of unique molecules, with the adjacency, directional adjacency and percentile methods showing the best performance.  As we would expect, the percentile method tends to overestimate the number of unique molecules since it fails to remove sequencing errors unless they are very rare events. The adjacency method tends towards an overestimation too, as it can only identify erroneous UMIs within a single edit distance. Interestingly, the directional adjacency method tends towards an underestimation as it can link together a string of UMIs each separated by a single edit distance. The directional aspect of the network is designed to avoid this happening between truly unique UMIs, however occasionally these will be collapsed together. The cluster method tends towards a more severe underestimation as it does not take counts into account at all and collapses whole networks into a single UMI.

Overall, the directional adjacency method shows very little sensitivity to the number of starting molecules and outperforms all other methods. As expected, the percentile method becomes more consistent as the number of molecules increase, since the variability between the counts in each one of the individual iterations is reduced. In contrast, the cluster and adjacency methods become less consistent as the number of starting molecules increases, as the variability in the topology of the networks formed will also increase.

simulation_number_of_umis_xintercept_difference_cv_deduped

As the UMI length increases (from a initial very short 3bp), the methods which utilise a network quickly improve in terms of accuracy and variability. However, the percentile and unique methods actually become less accurate as a longer UMI obviously means more probability of errors accumulating. Perhaps counter intuitively then, if you’re not going to account for errors when using UMIs, you’re better off not using a very long UMI.

simulation_umi_length_xintercept_difference_cv_deduped

As the PCR cycles are increased, all methods except directional adjacency show a decrease in accuracy and an increase in variability, with adjacency being the second most robust to high PCR cycle numbers. We see the same as we increase the DNA polymerase error rate. This confirms that the directional adjacency method is better able to handle errors originating from the amplification stage.

simulation_pcr_cycles_xintercept_difference_cv_deduped

simulation_dna_pol_error_rate_xintercept_difference_cv_deduped

Is a non-heuristic solution possible?

The methods we propose are definitely an improvement on “unique” deduplication using UMIs within these simulations.  We’ve also shown that the application of our proposed methods can have a real impact on biological interpretation of single cell RNA-Seq data (more to come on that soon). However, all implemented methods are heuristic approaches to the problem. My initial expectation was that we would be able to apply some probabilistic approach to determine the most likely set of UMIs which would account for the UMIs observed at a given genomic location. So far, we’ve been unable to develop such a method. If anyone would like to discuss alternative probabilistic approaches, or the script in general, please get in touch.

Lessons from C

One of the first things all new CGAT fellows have to do is learn how to program.  We do this in Python, firstly because it’s lovely and intuitive, it has an amazing community using, supporting and developing it and teaches some of the basics of programming like data structures, flow-control, etc without requiring a background in computing or computer science.  Oh, and our code base is mostly in Python and it helps if everyone is singing from the same hymn sheet.  In essence it’s a great beginner language (certainly more so than the cryptic Perl).  This blog isn’t going to be all about technical aspects of what we do, but a few of us have been delving into C recently.  I thought I’d just cover a few of the cool things I’ve learnt from C, which I hope, will improve my Python programming and my overall understanding of computing.

Lesson 1: The difference between static and dynamic typing
Python variable assignment is really easy:

var = something

It doesn’t matter per se what the “something” is, be it a string, integer, float or a Python list or dictionary.  It’s all the same.  Nice.  Simple.  In C you can assign a variable a value in much the same way, but you have to declare the type of the variable at the start of the program and/or function where it is being used, and you can’t change it.  In Python var can start as an integer and later be re-assigned as a string or tuple.  Sometimes in Python you can accidentally switch variable type without realising, which can cause unexpected code behaviour. As Python does not know the type of a variable before it is actually used, any errors resulting from wrong variable types manifest themselves only when your program is executing. And it might have been running for a while. Not so in C. As the variable type is declared, the compiler can check at the compilation stage for any instances where you, for example, attempt to add a number to string. I now realize that with freedom comes responsibility and why best practise in coding is so important – even if you can reuse variable names in Python, it is usually not a good idea.

Lesson 2: computers work in bits

Perhaps a little obvious that computers only understand 0’s and 1’s at the end of the day, but understanding data representation in bits is also really important.  It’s rare that we would have to think in this way with our day-to-day Python programming.  Enter C and the bitwise operators (NB: if, like us, you are torturing yourself by learning C from “The C programming language” by Kernighan and Ritchie you may have also experienced this lesson).  Bitwise operators are pretty neat, you can do some pretty funky manipulations by just changing individual or groups of bits.  Also this was a great lesson in the Two’s complement number system and bit arithmetic.  Even if you aren’t learning C and you don’t know about this it’s worth checking out.  This helped me work out how the SAM bitwise flags work!

Lesson 3: characters are just numbers

Each ASCII character is represented by a number, which means you can do arithmetic on characters. The following is valid syntax in both Python and C:

print(1 + 'a')

In Python this will throw a TypeError because the operation of adding an integer value to a string is not defined. In C this is perfectly valid because it recognises ‘a’ as a constant of type char for character, which is simply shorthand to write the byte value of 97, so it’ll print 98!  This example helped me to understand how different languages map supposedly similar symbolic representations to different CPU instructions. In Python, it is necessary to convert the character to its ASCII code explicitely. The equivalent operation is

print (1 + ord('a'))

There is a lot of legacy from C, which is evident in some of the syntax for bash commands and other languages, such as R, which we use for statistical analysis.  I’m a big fan of working from first principles, and learning C is about as close as I can get without having to dive into assembly language.