Dr Andy Law on bioinformatics

The difficulties of explaining coding best practice to biologists, applications in genetic research and a potential career as sports professional.

Image
Dr Andy Law

Dr Andy Law’s group works to develop tools and resources for biologists who research the genetics of complex traits in animals. In this interview, he explains his work to Science Communication Intern Maggie Szymanska.

Could you tell me about your research in a nutshell?

I’m a bit of an outlier and possibly a slight fraud because I don’t really do research, per se. What I do is I play with other people’s data. More specifically, I build database systems for other people to manage their data in and I work with other scientists to write code to improve their analyses, helping them work more efficiently. That is really my niche.

The reason I’ve ended up doing that is that is my background is in biology but I’ve also always been a “computer guy”. As projects started requiring more and more data management, I became the person that people turned to and it became the central part of my job.

Over the past two decades I’ve spent a lot of time designing and building systems to help people manage their data, and my work currently involves a lot of communicating with others and working with them to help them manage their own data and code. A lot of bioinformatics code tends to be written by biologists rather than computer scientists. While it’s really easy to write code and to evolve it to do exactly what you need it to do right now, it needs a lot of background understanding of the way that code can be fitted together and a good deal of planning to make the code fully understandable, reliable, efficient and re-usable. There are many, many ways to write ugly (bad) code. I made most of those mistakes in my early years of writing code and had to pay the price of going back and having to re-engineer it all to be more flexible so we could add new features. So, I try to work with people to evangelise good habits from the start.

Someone once told me that there is nothing more zealous than a reformed smoker – I think sometimes that I’ve become the coding equivalent!

You said you have a background in biology?

Yes, in the past I’ve worked on reproductive physiology in sheep and then in cows, the molecular biology of Transforming Growth Factors Beta in chickens and then genome mapping – before I became “the database guy”. Fundamentally, I am a biologist at heart.

Several years ago when I was just about to fall off the end of a fixed-term contract, I was offered a job working with databases in a bank, which made me realise that although I am interested in the databases, it’s the data that sits underneath that – the biology – that really gives me the buzz.

However, I do love the predictability of coding. If a code doesn’t do what I expect it to do, it’s probably because I’ve written it wrong. And if I have a unit-test in place to exercise that particular piece of code, I can see that problem immediately. Whereas, if a biological experiment doesn’t work as anticipated, there are so many different possible reasons that need to be investigated and checked out. That’s fun too, but it’s a long process. I get my feedback much quicker in the coding realm and I’m fortunate to be able to combine that with the excitement of also dealing with the biology in the data.

Why did you decide to become a biologist?

I have always enjoyed working with animals – I worked on farms in my summer holidays when I was at school and it was always a “happy place”. I studied agriculture at university but quickly changed to animal science after the first term when I realised I wasn’t that interested in the plants. I guess that in the back of mind then was the thought that I would work in agriculture somewhere, but doing my honours project (on reproductive physiology in sheep) gave me a taste for research. In general, biological research is just an awesome thing to get paid to do. I find it utterly mind-numbing when I sit back and think about the complexity of DNA and genomes and plants and animals and the unbelievably complex set of interactions that go on inside every cell of every living organism very fraction of a second of every day. And we get paid to try to work out how it all fits together!

In terms of career paths – becoming a biologist and then a bioinformatician – I have to confess that there’s never been a plan of any kind. I’ve always just been lucky enough to be in the right place at the right time with the right set of skills. When the Institute needed someone who understood the biology but also knew the computing side of things, I was just fortunate to be there. I know that having a plan is much, much better than not having a plan, but so is grabbing opportunities when they arise. One of the most valuable pieces of advice I can offer to those starting out on their career paths is to tell them to find something that they enjoy doing and just go for it and do it (and if you can find someone to pay you for doing it then that’s an ideal outcome).

It seems that programming and coding are becoming ever more important in biological research.

Image
The genetic code of E. coli is converted to machine code

Yes, they certainly are. However, just being able to code a script isn’t really enough, it’s also important to know how to code well. There’s good software engineering and there’s bad software engineering and if you want your code to be useful for more than a week or two then it’s going to be to your advantage to learn the good habits.

An analogy I’ve used in the past is building a go-kart; most people could probably put together a go-kart out of some pram wheels and bits of wood that would go down a hill at least once. However, if you needed to reuse the go-kart several times then you might need to build it stronger. If you needed to add some extra functionality – like a motor so it could go up the hill – then some of the decisions that you made in the first phase might make that easier or harder. And then if you wanted to take that go-kart racing, you might need to look into how to deal with the aerodynamics, the suspension etc. all whilst retaining the core functionality and being certain that it was still a functional go-kart. It takes good engineering to do that, and the decisions that you make early on in the process have consequences for what can be done (and how easily) later on. It’s exactly the same with software.

So, recently I’ve been working with a colleague on some code that he developed and he asked if we could make the code run in parallel so it would work faster. But it turned out that we didn’t need to – by re-engineering the code and replacing some inefficient parts with sections that performed the same task but with better understanding of how the computer actually works internally, we were able to make the scripts run almost 100 times faster, meaning you could do 100 times more analyses or simulations in the same time as before. That is a really big difference! Understanding the architecture behind the code – knowing the consequences of different design decisions and knowing how to deconstruct and reconstruct code in a controlled manner so you can be confident that you still get the same answer for a given input – is really useful, especially when working with high-level scripting languages.

How do you see the future of your work here?

Hopefully much more of the same. One additional task that is likely to be passed on to me in the near future though is the responsibility for overseeing the research computing infrastructure. Again, that’s one of these roles where it’s vital to understand the relationship between the data and the types of analyses that the Institute need to be able to perform and the actual physical hardware that we use to run those analyses on. Obviously, resources are limited so it’s important to get the right balance between expensive fast disk and cheaper slow disk, nodes with huge amounts of memory vs larger numbers of faster nodes with less memory, CPUs vs computing on GPUs etc. So there will likely be a lot more time spent reviewing demand (and predicting demand) and then monitoring and managing systems to make sure we spend our money efficiently and have the right technology for the problems we face. As has been the pattern in my career so far, I’ve just ended up being in the right place at the right time.

Could you briefly describe a project you liked doing?

I think I can honestly say that I’ve only ever worked on one project that I didn’t enjoy. The less said about that one the better. But, actually, I am really enjoying a project I am currently working on. I am developing some code that tries to assess how important certain genes are in a particular biological process by looking at multiple different assays and noting how often those genes are identified as being involved.

It's a combination of an interesting biological question and an interesting coding challenge and there are multiple different levels in each. How important are the individual genes in the process? How likely is the assay to be able to detect them? How robust was the experimental design for one of those assays and can we identify that and weight it based on the genes that the specific publication turned up? Do I trust this assay more than the other types? Why is that gene in a particular position on the list? How many genes are involved in the process? Questions like that keep me intrigued.

Challenging myself, learning new things about biology and about coding is what has always got me out of bed and is what makes projects like this so much fun.

What is a challenge you often experience?

Explaining software engineering to biologist coders with no software engineering background and a deadline to meet.

In particular, a common question I hear is, “Why do I need to write unit tests?” and the answer always is “Just trust me, you do”. Unit tests are small, self-contained bits of code that check that equally small pieces (units) of your analysis code work properly. They check that the code produces the expected output or response, given a particular input. Obviously, you have to write the analysis code regardless but writing the set of unit tests is an additional exercise which takes additional time. However, the time spent up-front has a serious payback later on in development when you need to rebuild and extend what you’ve already written. It’s back to that “I started with a wooden kart and now I’m racing at Brands Hatch” analogy. Having a comprehensive set of unit tests allows you to take bits of your code apart and re-engineer them safe in the knowledge that the expected behaviour of the other component parts is not affected. That’s a difficult message to sell, because most biologist coders don’t particularly care about long-term re-use and – effectively – you’re asking them to put extra effort in up-front that they can’t see a return on. However, having the right unit tests in place (and I would be lying if I claimed that all unit tests were good) is essential in my opinion for anyone who is serious about code quality and auditability.

If you weren’t a scientist, what would you be?

My other big passion, my out of work passion, is sport. I’m sport mad and it’s an obsession that’s fortunately shared by the rest of my family. I played rugby for many years and I was a decent cricketer too. So, if I wasn’t a scientist, I would definitely be doing something related to sport.