In The Pipeline: January 2016

Wednesday, January 20, 2016

Good News for the Market

The great news for the market is that Bioinformatics is going to continue to grow with a CAGR (compound annual growth rate) approximately 20.4% by year 2020. Bad news is that bioinformatics market is hindered by a "lack of inoperability among data formats" and a lack of talented, skilled professionals.

So lets first address the "lack of inoperability among data formats". In other words ... as a profession we lack standardization. We write code to fit the test data we are using and the test data might not be in the same format as your data. So this leaves you having to either reformat the data or re-write the code. And this cycle repeats itself. If you have been following along with my posts already, you know that lack of standards drives me (and others) nuts! This is especially infuriating for researchers who are mostly interested in applied bioinformatics and half of the reason why almost every science lab needs a bioinformatician (the other half of the reason is that "gone are the days of Excel"). Two things are going to drive this change. First, clinical research and big pharm cannot afford to allow this nonsense to go on. This is why there is a big boom for commercial bioinformatics software designed to run and re-run the same analyses over and over again (a production pipeline vs. a R&D pipeline). I believe that those of us writing bioinformatics code hoping to one day be commercial will start writing in a standard that will be more appealing to buyers and thus creating a standard. Second, I really think there will eventually be a little group of bioinformatics professionals who will release guidelines to hold the profession accountable for. The only reason I can think as to why this hasn't happened is that the market is growing too rapidly for anyone to take a moment to approach this.

Now to address the lack of skilled and trained professionals. For all purposes, bioinformatics is a data science field and it competes for data scientists with other industries. As I mentioned in a previous post, there are colleagues of mine that I have personally known to have been trained in bioinformatics who find employment in other fields. As depressing as it is, when you Google "bioinformatics as a profession" you will stumble on at least one rant from someone leaving the profession for another career. It is true that the market is growing, but so is the market for data scientists in general. Bioinformatics programs combine data science training with training in biology (you have to know something about the biology). Glassdoor estimates that the national average salary for a bioinformatics scientist is $85,149. The national average salary for a data scientist is $118,709. Keep in mind: bioinformatics is still tied strongly to academia (where salaries in general are much lower) and there are certain biotech hubs where salaries for bioinformaticians are much more comparable to a data scientist. Every business is looking for a data scientist and the field of data science has also had to define itself and is still defining itself (here is a really fun discussion on YouTube by the Royal Statistical Society). Not only are salaries better, but these jobs are not always restricted to a few hubs. Plus there is less of an emphasis on degree and publication record for other data science positions outside of bioinformatics.

So how can we make bioinformatics more attractive?

1. Pay better. Although this is much harder to achieve given bioinformatics is still deeply rooted in academia and sadly biologist already lose out in earnings of PhDs (the group of people most often in need of applied bioinformatics).

2. Replace people with tools. R-bloggers approached this back in 2012 with data science and I would agree with this opinion. Bioinformatics isn't just technical, but sociological. There will always be a need for people who understand the principles and methods of bioinformatics to adequately communicate about the data, its findings and to advocate for future approaches.

It will be interesting to see how the market will look in the next 5 years and some real data on graduates after earning degrees in bioinformatics. I would like to think that as companies place emphasis on data scientists ability to save millions of dollars, the bioinformatics industry will place emphasis on bioinformaticians ability to improve quality of life. Or, more likely, health insurance companies placing emphasis on bioinformaticians saving them money with pre-diagnostics...

Monday, January 11, 2016

Blog Description - I think? At least for now? Like most of my life, it is an ever evolving pipeline.

I feel the need to sort of "define" this blog. Everyone seems to have a blog today. What makes this blog different from any other? I probably should have started with this post, but, despite my overwhelming need to plan everything out first, I still have an affinity to jump into things and then figure it out later. So ...

This blog will probably never be incredibly technical. I will talk about pipelines, software, general cool tools I get excited about, etc., but you probably won't see any code of mine up here. I feel as though there are plenty of resources already available and, if I do something technical, I'll post it as a vimeo (I have a few in the making, but time is one resource you cannot replenish and you always need more of). But I do see this blog developing into a resource for soft skills. And soft skills can be equally as important as the technical skills for career development! Now don't freak out if you are slightly socially inept (or very inept). There are plenty of careers for those types of people too. But in order to explain your work to an audience, soft skills are important. And a certain amount of soft skills are necessary for a general working environment with other people (specifically gives a knowing look to network guys).

One thing that I will share on this blog are some bits and pieces of conversations with my husband (Ace). He is a bit of an economics/statistics wizard! Seriously. And a lot of what we both do at our jobs overlaps. Data mining, analytics, database and reporting are comparable no matter what the data are to start with. In fact, it isn't uncommon to find someone with training in bioinformatics in another field such as banking. Why? Well to be blunt, these individuals have the same skill sets businesses are looking for, they make more money and there are more opportunities for career growth. Personally, I am passionate about the science. I get a high from working in biotech because I see it as much needed service for human advancement. Ace gets a high from saving companies millions of dollars. I get a high from helping with the research and development of biologics that will save millions of lives. To each their own.

But since Ace and I have similar skills and have different areas of expertise, we often look at data from different angles. This is super cool because we can pick each others brains, but the trick is that we have to explain our data in terms the other can understand. This is a great exercise for each of us because it helps us work on explaining complex concepts without all of the jargon. This can be fairly difficult for me because it also requires an explanation of the science behind the data and that adds to the complexity.

It really doesn't matter what job you have, at some point you will need to be able to adequately explain your work either to your boss, a client, etc. Given different backgrounds and expertise, this can be challenging. Data scientist, IT, bioinformaticians, etc. are usually considered black boxes. A black box means that other people at work have no idea what you do, but they know it is important. This can be a good thing because it usually means no one will ask you what you are doing and you sort of get a "free" pass with some things (meaning you are generally left alone to do your job which requires that ability to "tinker" with your computer all day). BUT a black box also comes with frustration. No one understand why you don't have results yet, why it is taking so long to fix a network issues, or why you cannot just use the same pipeline someone else published with. So, I try not to be a black box and explaining things without the jargon to Ace during dinner helps me hone those soft skills necessary to help explain myself to other people I work with.

This shirt was a vendor handout at the 2015 ACTG conference and plays on how bioinformatics is a black box. I challenge my soft skills to match my technical skills so that this shirt becomes just a joke and not general perception of my profession.

Friday, January 8, 2016

Another Paper on Reproducible Research

Here is yet another paper on our lack of reproducibility. OK, we hopefully now "get it" and understand that we need to do a better job of reporting and publishing our protocols, including bioinformatics pipelines. Can we please start publishing guidelines and standards to adhere by? And then making journals enforce that those practices get followed?

So today I'm going to give three suggestions on how to better capture your pipeline using my own experiences.

Cartoon by Sidney Harris (The New Yorker)

1. Flow Charts
I am often expressing my love and appreciation for flow charts. I love visuals. I love making lists. And I super love making a list into a visual - TaDa - flow chart. Look, it just helps me stay on track and clients appreciate having something to look at while I verbally run them through the pipeline I am using with their data. Creating a flow chart can take some time (depending on how OCD you are and how much pride in your work you have prior to giving something to a client), but I truly believe that the impact to your work and your client relationship is worth the time you put into it.

2. Version Control
Oh yes, version control is all the rave. Look, it is no secret that changes in versions can give different results. Version releases are important. Releases clean up mistakes, decrease compute time, help maintain relevancy as technology changes, etc. But even a small release where rounding a number can result in huge changes in outcome. So write down your versions and MAKE NOTES when you upgrade them! This is incredibly important for someone to replicate your results.

3. DeBugging
(scene set): You have identified a new, amazing bioinformatics tool. People are blowing up twitter about it, Bioinformatics has published on it, etc. And you have a light bulb moment: Ah .. ha maybe I can use this tool for my research since the data are some what similar to what other people are using this tool for! So you start out making a new FLOW CHART to include this new tool. Should be easy, but all of a sudden you keep getting errors that you can't figure out (so you post on Reddit). You go to the documentation, nothing helps. You search all of the forums, but you either hit a dead end or find out that everyone else is having issues. You email the tools contact only to find out that they are now hired by SAS and just don't have the time to help you.

You are now frustrated (probably because someone has you on a timeline or wrongly thinks you aren't working since there are "no results") and you just know if you could get this tool to work that the result will be INCREDIBLE (I don't know why, but I always think IT WILL BE INCREDIBLE). So you start debugging by trying everything you can possibly think would be wrong. This is where we generally shut down on documenting this process. If you are like me, you get some sort of sick kick out of debugging that is OBSESSIVE IN NATURE. You can't stop to write down what you just tried because your mind is already racing to the next trial. AND when you FINALLY get that thing to work (because you probably won't stop until you get it to break to your every whim) you have the BEST HIGH EVER. And if you are anything like me, you'll start high fiving yourself, anyone who is around you ... and you will immediately forget the details of the last several hours. BUMMER. But you don't care because you are still riding that high.

I know this about myself. I accept it. But I still need to document that process so that I don't have to repeat it (because several months later I might have to). PRO TIP: SCREEN RECORD THE PROCESS. I also like to include audio because it makes the videos more fun for me to watch later. Seriously, get some friends over later, have some drinks and put that video on. If you have geek friends or have an amusing narrative style, it will be the highlight of the party.

Thursday, January 7, 2016

Command Line - for Science

My morning routine usually includes my ONE cup of coffee (I'm incredibly proud of the fact I can get through an entire event filled day with just ONE cup of coffee. This is a huge decrease from where I started!) and a look through of the news. Today I went out of the norm and headed over to Twitter to start the day. I typically do not like Twitter in the morning because the majority of the Twitter activity happens in the afternoon. (See this great brief on how to post to social media!) However today was different and I'm glad I started with Twitter because I may have missed this tweet that took me to a kickstarter project for Learning the Command Line - for Science.

I posted some tips for novice Bioinformaticians a month ago and I mentioned how important it was to learn command line tools. There is a lot of emphasis on learning scripting (Perl) and/or programming (Java) languages and tons of resources available to help you. I think that is great! I also think it is great to learn command line ... and I mean learning more commands than how to change directories or create/remove directories. If your goal is to use bioinformatics tools and you are not interested in creating them, I would focus on learning Bash, a command line language such as AWK or SED and R. Other bioinformatics tools are already available to you and the three above suggestions will help you execute pretty much every pipeline available.

Bash:
Bash scripting is your default shell for both Linux and Mac. When you call 'cd' to change a directory, you are using Bash. There are LOTS of sweet commands that make Bash really useful for bioinformatics and I encourage you to check out what tutorials are already available to learn more.

Some of my favorite commands are 'find' and the loop commands such as 'for', 'while' and 'until'. 'Find' is great for finding files (genius, I know). This command is really helpful for servers with multiple users (because people tend to move files around and then forget where they have put them) and for doing things with all files in a directory and its child directories (such as moving all of your raw reads scattered in a project folder to another volume). Loop commands are really nice for bioinformatics tools that do not accept multiple inputs (mostly newer tools that haven't been through version changes). Instead of running the command to use a tool over and over again for each of your input files (ex. if you have 2+ samples that you want to process), you can just create a Bash loop once to call the command and set it up to give a new input each iteration.

Command Line Language:
I use AWK all of the time. Most of the files that are output for bioinformatics tools or files that you would download from a public depository are in a spreadsheet-like format - meaning they have rows and columns. AWK is a great manipulator for data in this format and it is faster than writing a Perl script to input a file and read each line to blahblahblah. You get my point.

Now I feel the need to address 'manipulator for data' since out of context this phrase could sound like I am changing data. I am absolutely not changing any data, but instead maybe changing the column order or data type. For example: A chromosome column may have chromosome denoted as either 'chr1' or '1'. They both mean chromosome 1, but their data types are different since 'chr1' is a string and '1' is an integer. Data tools hate when you mix data types which is why data type sometimes needs to be manipulated so that every time a chromosome is denoted it is always denoted in the same way (data type).

Bioinformatics pipelines change so often as new tools are created that output from one tool may not be formatted for input into another tool (i.e. output for column 1 should be in column 4 as input for the next tool). With AWK you can fix all of these things with a quick script and it has a fast execution.

SED is fairly comparable to AWK, I just happen to use AWK most of the time.

R - or (said with a pirate accent) AARRRR
R is an incredible environment for executing statistics and creating really nice graphics and visuals with your data. PLUS, there has been a real effort to make its use REPRODUCIBLE (insert angels singing). I'm just going to stop there because there are lots of dedicated R users who will tell of the glories of R. Just know it is definitely worth your while to learn!

So to summarize, if you want to just use bioinformatics tools you will benefit heavily from command, a command line language such as AWK and a great statistical environment with graphics capabilities like R. Every other step in a pipeline will be filled with a tool which is pretty much plug and play with different parameters (or arguments) that you should become fairly familiar with.

Disclaimer: The above post makes bioinformatics sound "easy". I am sorry to say that it probably won't be "easy". Any decent data scientist will tell you that data anything just isn't as easy as we tend to describe it. Data are messy and data tools can be a mess too. Cleaning up messes is really frustrating. So if you feel like you are wrangling cats, then you are probably doing it right.