Friday, June 24, 2016

See this is the sort of thing that drives me nuts ...

Look, I get that we (and I'm including me) are not perfect. I think that imperfections make us much more interesting and I appreciate that.

But can we PLEASE start to maintain our code/vignettes/library/tutorials or whatnot? I mean, it really burns me when I go through the trouble of reading the vignettes, reference materials and whatever tutorials have been released to the internet and ... after hours of my time ... I'm still scratching my head wondering what the BEEP.

It should also burn the creators of these documents that they have spent time making all of these supporting documents and people still cannot use their libraries.

Here's an example (and I apologize for picking on this particular library):

Chip Analysis Methylation Pipeline for Illumina HumanMethylation450 and EPIC

or ChAMP is one of two pipelines I can find that will analyze Illumina EPIC arrays in Bioconductor. I have two objectives: 1) analyze my data (obviously) and 2) compare both pipelines. Now to be fair, I'm having issues with the other pipeline (RnBeads) as well, but ChAMP has a great example how something so minor can be time consuming.


So there are two ways to approach this library:

1. You can run all modules using the function champ.process()
2. You can run each module and subsequent steps individually 


Since I am comparing libraries I am approaching things using both options. In theory both options should give you the same outcome given that you use the same arguments. When I run each module (option 2), I have to set up my directory and load my data with champ.load(). The loading function requires a sampleSheet.csv file. This file includes your sample names, Sentix ID, Sentix Postion, etc. All that good stuff that comes with analyzing Illumina arrays. No problems.

But then I get to champ.SVD() and I want to include an argument with even more sample information not contained in the SampleSheet.csv. I read the reference on the function and two things pop out:

Argument        Description
sampleSheet
If the data has not been loaded from .idat files and fromFile=TRUE then this points to the required sampleSheet. Default is "sampleSheet.txt"


studyInfoFile
If studyInfo =T, this file will include the additional study information. Default is "studyInfo.txt".
studyInfoFile
If studyInfo =T, this file will include the additional study information. Default is "studyInfo.txt".
Whoa ... wait a minute. Is the sampleSheet suppose to be a CSV or a TXT. To some people this may appear to be a small issue, but for me this is sort of big. CSV files are separated with a comma between fields (columns) whereas TXT files are separated with tabs. This really matters because data loads use parsers to tell the computer where each field can be found based on what separates fields (a comma or a tab). File format can seriously screw up a pipeline (we've all been there). 

So my thought process goes like this:

1. This library has been updated but no one updated the reference manual .. so all files now have to be CSV. So I create a studyInfo.csv and that fails. sampleInfo has to be a TXT file.

2. This library has been updated and sampleSheet now has to be a CSV file to reflect a new change but studyInfo still needs to be a TXT. Well that runs, but it doesn't seem to include anything form the studyInfo file. So .. that fails ... sort of.

3. This library has been updated but champ.SVD() has not been changed to reflect the changes to champ.load() (which now requires a sampleSheet.csv). Well this would be super dumb because it would require me to create two sampleSheets (both a CSV and a TXT). And this also fails.


So in the end, I spent 30 minutes changing file formats and I still didn't get the result I want. To be frank, I shouldn't have to spend any time tinkering with file formats because the reference manuals don't reflect the requirements for arguments and this is a bad practice for me to do since I am just enabling the problem. The problem being that we tinker and force code instead of making authors fix their code/vignette/reference manuals. This just means that the next person will have to face the same problem I just walked through. 

I would like to see libraries (including vignettes and reference manuals) updated and libraries that do not get updated as errors are reported get pulled down and become obsolete if authors are no longer maintaining code and these supporting documents.

But there is a silver lining ... these issues make me much more diligent about documenting my own code and pipelines. I now emphasize paying closer attending to details during my own documenting process.



Friday, April 22, 2016

Using Shiny Apps to Anticipate Questions from Your Audience

I like to think one of my strengths in what I do is presenting data in an interesting manner that tells a story AND anticipating questions that may arise from the presentation of data. When I was a graduate student I had an adviser who encouraged his students to add slides following a presentation for questions that we would anticipate from our audience. You couldn't cover all of the material in a 20 minute talk, but you could use the 10 minutes of questions/follow-up to add material if you anticipated those questions ahead.

Today, I apply the same method to data reports and data presentations. I cannot point out and answer all questions that may arise from data discovery in a data report, but I can use my data presentation to anticipate those questions.

Here's an example:

I have some data looking at stress levels, energy levels, etc. from six different locations. This data set will likely add more locations as it grows. The story that I want the data to show will be the data trends in each location. At the moment, I'm really not sure the importance of each location and I'm going to predict that importance of locations will change as I continue doing data discovery on this set.

Below is an example of a figure that I would add to a data report as a part of a dynamic document (see note below). With six different locations, the figure is kind of busy (and I will probably add more locations in the future)! To follow a location, you really need to outline the location and study it on its own. So I could split this figure into six figures with one location per figure. That's one solution, but if I want to compare it to the other locations I'm left having to deal with flipping between figures. Flipping figures gets messy and I know I have trouble keeping everything straight in my head.

So my solution is to include the above figure in the data report and to create a data dashboard with Shiny where I can highlight selected locations that I can then present later. This allows me to highlight all of the data from the second location to emphasize a point or answer a specific question concerning a specific location with a strong visual.


A plus to creating a Shiny dashboard for use in a presentation is that it can be given to your stakeholders to interact with at a later date. You can host Shiny apps through Shiny itself or create your own Shiny server. Hosting apps through Shiny usually makes them public and is limited to the number of apps you can host (unless you pay to use Shiny's cloud services). If you have your own hardware or use another cloud service, you can create your own Shiny server and host your apps from there. I choose the latter because I enjoy having admin rights.

Note: Dynamic documents are incredibly important for all data, but are especially important to data that will grow. I can create a single word document/HTML/presentation that will change all of my figures as new data is integrated (plus add my own unseen notes <- everybody should make dynamic documents).


SNID BITS of CODE:

I use the following function to create figures with shared legends. I found this off of stackoverflow:

grid_arrange_shared_legend <- function(...) {
    plots <- list(...)
    g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
    legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
    lheight <- sum(legend$height)
    grid.arrange(
        do.call(arrangeGrob, lapply(plots, function(x)
            x + theme(legend.position="none"))),
        legend,
        ncol = 1,
        heights = unit.c(unit(1, "npc") - lheight, lheight))
}

For my Shiny App, the UI is pretty straight forward:

ui <- shinyUI(fluidPage(
 
   # title
   titlePanel(""),
 
   # Conditional sidebars
   sidebarLayout(
     sidebarPanel(
      conditionalPanel(condition="input.conditionedPanels==1",
        radioButtons("", label = h3(""),
                     choices = list("" = 1, "" = 2,
                                    "" = 3, "" = 4),selected = 1)
      ),
      conditionalPanel(condition="input.conditionedPanels==2",
        radioButtons("",label=h3(""),
                     choices = list("" = 1, "" = 2, "" = 3), selected=1)
        ),
      conditionalPanel(condition="input.conditionedPanels==3",
                       radioButtons("Wellness",label=h3(""),
                                    choices=list(""=1,""=2,""=3,
                                                 ""=4,""=5,""=6), selected=1)),
      conditionalPanel(condition="input.conditionedPanels==4",
                       radioButtons("",label=h3(""),
                                    choices=list(""=1,""=2,""=3,
                                                 ""=4,""=5)))
      ),
   
      # Main panel separated into tabbed conditional panels
      mainPanel(
        tabsetPanel(type = "tabs",
         tabPanel("", value =1, plotOutput("")),
         tabPanel("", value=2, plotOutput("")),
         tabPanel("Wellness", value=3, plotOutput("WellnessPlot")),
         tabPanel("", value=4, plotOutput("")),
         id="conditionedPanels"
        )
      )
   )
))

Server (Redacted):

server <- shinyServer(function(input, output) {
  # Calling requirements
  require(dplyr)
  require(gridExtra)
  require(grid)
  require(ggplot2)

  grid_arrange_shared_legend <- function(...) {
   ...
  }

  # data calls

  wellness<-read.table("../wellness.txt", header=TRUE)

   output$WellnessPlot <- renderPlot({
   
     p1 <- ggplot(data= wellness %>% group_by(Test) %>% filter(Test=="Stress_level"), aes(x=Date, y=Ave_score, group=, colour=)) +
       geom_line() +
       geom_point() +
       theme_bw() +
       labs(y="Stress Level", x="Date") +
       theme(legend.text=element_text(face="bold", size=12)) +
       theme(axis.title = element_text(face="bold", size=12)) +
       theme(axis.text.x = element_text(face="bold", size=10)) +
       theme(axis.text.y = element_text(face="bold", size=10))
   
     p2 <- ggplot(data= wellness %>% group_by(Test) %>% filter(Test=="Perceived_health"), aes(x=Date, y=Ave_score, group=, colour=)) +
       geom_line() +
       geom_point() +
       theme_bw() +
       labs(y="Perceived Health", x="Date") +
       theme(legend.text=element_text(face="bold", size=12)) +
       theme(axis.title = element_text(face="bold", size=12)) +
       theme(axis.text.x = element_text(face="bold", size=10)) +
       theme(axis.text.y = element_text(face="bold", size=10))
   
     p3 <- ggplot(data= wellness %>% group_by(Test) %>% filter(Test=="Energy_level"), aes(x=Date, y=Ave_score, group=, colour=)) +
       geom_line() +
       geom_point() +
       theme_bw() +
       labs(y="Energy Level", x="Date") +
       theme(legend.text=element_text(face="bold", size=12)) +
       theme(axis.title = element_text(face="bold", size=12)) +
       theme(axis.text.x = element_text(face="bold", size=10)) +
       theme(axis.text.y = element_text(face="bold", size=10))
   
     p4 <- ggplot(data= wellness %>% group_by(Test) %>% filter(Test=="SatisfiedwHealth"), aes(x=Date, y=Ave_score, group=, colour=)) +
       geom_line() +
       geom_point() +
       theme_bw() +
       labs(y="% Satisfied w/ health", x="Date") +
       theme(legend.text=element_text(face="bold", size=12)) +
       theme(axis.title = element_text(face="bold", size=12)) +
       theme(axis.text.x = element_text(face="bold", size=10)) +
       theme(axis.text.y = element_text(face="bold", size=10))
   
   if(input$Wellness == 1) {
     p1 <- p1 + geom_line(data=wellness %>% group_by(Test) %>% filter(Test=="Stress_level" & Location==""),size =3)

     p2 <- p2 + geom_line(data=wellness %>% group_by(Test) %>% filter(Test=="Perceived_health" & Location==""),  size =3)

     p3 <- p3 + geom_line(data=wellness %>% group_by(Test) %>% filter(Test=="Energy_level" & Location==""), size =3)

     p4 <- p4 + geom_line(data=wellness %>% group_by(Test) %>% filter(Test=="SatisfiedwHealth" & Location==""),size =3)
     grid_arrange_shared_legend(p1,p2,p3,p4)
   } else if (input$Wellness == 2) {
     ...
   } else if (input$Wellness == 3) {
    ...
   } else if( input$Wellness == 4) {
     ...
   } else if (input$Wellness == 5) {
    ...
   } else {
     ...
   }
   })
})

Wednesday, January 20, 2016

Good News for the Market


The great news for the market is that Bioinformatics is going to continue to grow with a CAGR (compound annual growth rate) approximately 20.4% by year 2020. Bad news is that bioinformatics market is hindered by a "lack of inoperability among data formats" and a lack of talented, skilled professionals.

So lets first address the "lack of inoperability among data formats". In other words ... as a profession we lack standardization. We write code to fit the test data we are using and the test data might not be in the same format as your data. So this leaves you having to either reformat the data or re-write the code. And this cycle repeats itself. If you have been following along with my posts already, you know that lack of standards drives me (and others) nuts! This is especially infuriating for researchers who are mostly interested in applied bioinformatics and half of the reason why almost every science lab needs a bioinformatician (the other half of the reason is that "gone are the days of Excel"). Two things are going to drive this change. First, clinical research and big pharm cannot afford to allow this nonsense to go on. This is why there is a big boom for commercial bioinformatics software designed to run and re-run the same analyses over and over again (a production pipeline vs. a R&D pipeline). I believe that those of us writing bioinformatics code hoping to one day be commercial will start writing in a standard that will be more appealing to buyers and thus creating a standard. Second, I really think there will eventually be a little group of bioinformatics professionals who will release guidelines to hold the profession accountable for. The only reason I can think as to why this hasn't happened is that the market is growing too rapidly for anyone to take a moment to approach this.

Now to address the lack of skilled and trained professionals. For all purposes, bioinformatics is a data science field and it competes for data scientists with other industries. As I mentioned in a previous post, there are colleagues of mine that I have personally known to have been trained in bioinformatics who find employment in other fields. As depressing as it is, when you Google "bioinformatics as a profession" you will stumble on at least one rant from someone leaving the profession for another career. It is true that the market is growing, but so is the market for data scientists in general. Bioinformatics programs combine data science training with training in biology (you have to know something about the biology). Glassdoor estimates that the national average salary for a bioinformatics scientist is $85,149. The national average salary for a data scientist is $118,709. Keep in mind: bioinformatics is still tied strongly to academia (where salaries in general are much lower) and there are certain biotech hubs where salaries for bioinformaticians are much more comparable to a data scientist. Every business is looking for a data scientist and the field of data science has also had to define itself and is still defining itself (here is a really fun discussion on YouTube by the Royal Statistical Society). Not only are salaries better, but these jobs are not always restricted to a few hubs. Plus there is less of an emphasis on degree and publication record for other data science positions outside of bioinformatics.

So how can we make bioinformatics more attractive?

1. Pay better. Although this is much harder to achieve given bioinformatics is still deeply rooted in academia and sadly biologist already lose out in earnings of PhDs (the group of people most often in need of applied bioinformatics).

2. Replace people with tools. R-bloggers approached this back in 2012 with data science and I would agree with this opinion. Bioinformatics isn't just technical, but sociological. There will always be a need for people who understand the principles and methods of bioinformatics to adequately communicate about the data, its findings and to advocate for future approaches.

It will be interesting to see how the market will look in the next 5 years and some real data on graduates after earning degrees in bioinformatics. I would like to think that as companies place emphasis on data scientists ability to save millions of dollars, the bioinformatics industry will place emphasis on bioinformaticians ability to improve quality of life. Or, more likely, health insurance companies placing emphasis on bioinformaticians saving them money with pre-diagnostics... 


Monday, January 11, 2016

Blog Description - I think? At least for now? Like most of my life, it is an ever evolving pipeline.

I feel the need to sort of "define" this blog. Everyone seems to have a blog today. What makes this blog different from any other? I probably should have started with this post, but, despite my overwhelming need to plan everything out first, I still have an affinity to jump into things and then figure it out later. So ...

This blog will probably never be incredibly technical. I will talk about pipelines, software, general cool tools I get excited about, etc., but you probably won't see any code of mine up here. I feel as though there are plenty of resources already available and, if I do something technical, I'll post it as a vimeo (I have a few in the making, but time is one resource you cannot replenish and you always need more of). But I do see this blog developing into a resource for soft skills. And soft skills can be equally as important as the technical skills for career development! Now don't freak out if you are slightly socially inept (or very inept). There are plenty of careers for those types of people too. But in order to explain your work to an audience, soft skills are important. And a certain amount of soft skills are necessary for a general working environment with other people (specifically gives a knowing look to network guys).

One thing that I will share on this blog are some bits and pieces of conversations with my husband (Ace). He is a bit of an economics/statistics wizard! Seriously. And a lot of what we both do at our jobs overlaps. Data mining, analytics, database and reporting are comparable no matter what the data are to start with. In fact, it isn't uncommon to find someone with training in bioinformatics in another field such as banking. Why? Well to be blunt, these individuals have the same skill sets businesses are looking for, they make more money and there are more opportunities for career growth. Personally, I am passionate about the science. I get a high from working in biotech because I see it as much needed service for human advancement. Ace gets a high from saving companies millions of dollars. I get a high from helping with the research and development of biologics that will save millions of lives. To each their own.

But since Ace and I have similar skills and have different areas of expertise, we often look at data from different angles. This is super cool because we can pick each others brains, but the trick is that we have to explain our data in terms the other can understand. This is a great exercise for each of us because it helps us work on explaining complex concepts without all of the jargon. This can be fairly difficult for me because it also requires an explanation of the science behind the data and that adds to the complexity.

It really doesn't matter what job you have, at some point you will need to be able to adequately explain your work either to your boss, a client, etc. Given different backgrounds and expertise, this can be challenging. Data scientist, IT, bioinformaticians, etc. are usually considered black boxes. A black box means that other people at work have no idea what you do, but they know it is important. This can be a good thing because it usually means no one will ask you what you are doing and you sort of get a "free" pass with some things (meaning you are generally left alone to do your job which requires that ability to "tinker" with your computer all day). BUT a black box also comes with frustration. No one understand why you don't have results yet, why it is taking so long to fix a network issues, or why you cannot just use the same pipeline someone else published with. So, I try not to be a black box and explaining things without the jargon to Ace during dinner helps me hone those soft skills necessary to help explain myself to other people I work with.



This shirt was a vendor handout at the 2015 ACTG conference and plays on how bioinformatics is a black box. I challenge my soft skills to match my technical skills so that this shirt becomes just a joke and not general perception of my profession.

Friday, January 8, 2016

Another Paper on Reproducible Research

Here is yet another paper on our lack of reproducibility. OK, we hopefully now "get it" and understand that we need to do a better job of reporting and publishing our protocols, including bioinformatics pipelines. Can we please start publishing guidelines and standards to adhere by? And then making journals enforce that those practices get followed?

So today I'm going to give three suggestions on how to better capture your pipeline using my own experiences.


Cartoon by Sidney Harris (The New Yorker)


1. Flow Charts
I am often expressing my love and appreciation for flow charts. I love visuals. I love making lists. And I super love making a list into a visual - TaDa - flow chart. Look, it just helps me stay on track and clients appreciate having something to look at while I verbally run them through the pipeline I am using with their data. Creating a flow chart can take some time (depending on how OCD you are and how much pride in your work you have prior to giving something to a client), but I truly believe that the impact to your work and your client relationship is worth the time you put into it.



2. Version Control
Oh yes, version control is all the rave. Look, it is no secret that changes in versions can give different results. Version releases are important. Releases clean up mistakes, decrease compute time, help maintain relevancy as technology changes, etc. But even a small release where rounding a number can result in huge changes in outcome. So write down your versions and MAKE NOTES when you upgrade them! This is incredibly important for someone to replicate your results.

3. DeBugging
(scene set): You have identified a new, amazing bioinformatics tool. People are blowing up twitter about it, Bioinformatics has published on it, etc. And you have a light bulb moment: Ah .. ha maybe I can use this tool for my research since the data are some what similar to what other people are using this tool for! So you start out making a new FLOW CHART to include this new tool. Should be easy, but all of a sudden you keep getting errors that you can't figure out (so you post on Reddit). You go to the documentation, nothing helps. You search all of the forums, but you either hit a dead end or find out that everyone else is having issues. You email the tools contact only to find out that they are now hired by SAS and just don't have the time to help you.


You are now frustrated (probably because someone has you on a timeline or wrongly thinks you aren't working since there are "no results") and you just know if you could get this tool to work that the result will be INCREDIBLE (I don't know why, but I always think IT WILL BE INCREDIBLE). So you start debugging by trying everything you can possibly think would be wrong. This is where we generally shut down on documenting this process. If you are like me, you get some sort of sick kick out of debugging that is OBSESSIVE IN NATURE. You can't stop to write down what you just tried because your mind is already racing to the next trial. AND when you FINALLY get that thing to work (because you probably won't stop until you get it to break to your every whim) you have the BEST HIGH EVER. And if you are anything like me, you'll start high fiving yourself, anyone who is around you ... and you will immediately forget the details of the last several hours. BUMMER. But you don't care because you are still riding that high.


I know this about myself. I accept it. But I still need to document that process so that I don't have to repeat it (because several months later I might have to). PRO TIP: SCREEN RECORD THE PROCESS. I also like to include audio because it makes the videos more fun for me to watch later. Seriously, get some friends over later, have some drinks and put that video on. If you have geek friends or have an amusing narrative style, it will be the highlight of the party.





Thursday, January 7, 2016

Command Line - for Science

My morning routine usually includes my ONE cup of coffee (I'm incredibly proud of the fact I can get through an entire event filled day with just ONE cup of coffee. This is a huge decrease from where I started!) and a look through of the news. Today I went out of the norm and headed over to Twitter to start the day. I typically do not like Twitter in the morning because the majority of the Twitter activity happens in the afternoon. (See this great brief on how to post to social media!) However today was different and I'm glad I started with Twitter because I may have missed this tweet that took me to a kickstarter project for Learning the Command Line - for Science.

I posted some tips for novice Bioinformaticians a month ago and I mentioned how important it was to learn command line tools. There is a lot of emphasis on learning scripting (Perl) and/or programming (Java) languages and tons of resources available to help you. I think that is great! I also think it is great to learn command line ... and I mean learning more commands than how to change directories or create/remove directories. If your goal is to use bioinformatics tools and you are not interested in creating them, I would focus on learning Bash, a command line language such as AWK or SED and R. Other bioinformatics tools are already available to you and the three above suggestions will help you execute pretty much every pipeline available.

Bash:
Bash scripting is your default shell for both Linux and Mac. When you call 'cd' to change a directory, you are using Bash. There are LOTS of sweet commands that make Bash really useful for bioinformatics and I encourage you to check out what tutorials are already available to learn more.

Some of my favorite commands are 'find' and the loop commands such as 'for', 'while' and 'until'. 'Find' is great for finding files (genius, I know). This command is really helpful for servers with multiple users (because people tend to move files around and then forget where they have put them) and for doing things with all files in a directory and its child directories (such as moving all of your raw reads scattered in a project folder to another volume). Loop commands are really nice for bioinformatics tools that do not accept multiple inputs (mostly newer tools that haven't been through version changes). Instead of running the command to use a tool over and over again for each of your input files (ex. if you have 2+ samples that you want to process), you can just create a Bash loop once to call the command and set it up to give a new input each iteration.

Command Line Language:
I use AWK all of the time. Most of the files that are output for bioinformatics tools or files that you would download from a public depository are in a spreadsheet-like format - meaning they have rows and columns.  AWK is a great manipulator for data in this format and it is faster than writing a Perl script to input a file and read each line to blahblahblah. You get my point.

Now I feel the need to address 'manipulator for data' since out of context this phrase could sound like I am changing data. I am absolutely not changing any data, but instead maybe changing the column order or data type. For example: A chromosome column may have chromosome denoted as either 'chr1' or '1'. They both mean chromosome 1, but their data types are different since 'chr1' is a string and '1' is an integer. Data tools hate when you mix data types which is why data type sometimes needs to be manipulated so that every time a chromosome is denoted it is always denoted in the same way (data type).

Bioinformatics pipelines change so often as new tools are created that output from one tool may not be formatted for input into another tool (i.e. output for column 1 should be in column 4 as input for the next tool). With AWK you can fix all of these things with a quick script and it has a fast execution.

SED is fairly comparable to AWK, I just happen to use AWK most of the time.

R - or (said with a pirate accent) AARRRR
R is an incredible environment for executing statistics and creating really nice graphics and visuals with your data. PLUS, there has been a real effort to make its use REPRODUCIBLE (insert angels singing). I'm just going to stop there because there are lots of dedicated R users who will tell of the glories of R. Just know it is definitely worth your while to learn!

So to summarize, if you want to just use bioinformatics tools you will benefit heavily from command, a command line language such as AWK and a great statistical environment with graphics capabilities like R. Every other step in a pipeline will be filled with a tool which is pretty much plug and play with different parameters (or arguments) that you should become fairly familiar with.



Disclaimer: The above post makes bioinformatics sound "easy". I am sorry to say that it probably won't be "easy". Any decent data scientist will tell you that data anything just isn't as easy as we tend to describe it. Data are messy and data tools can be a mess too. Cleaning up messes is really frustrating. So if you feel like you are wrangling cats, then you are probably doing it right.


Thursday, December 10, 2015

Reproducibility is Key - suggestions for WGS from ICGC

Nature Communications published an article from the International Cancer Genome Consortium (ICGC) concerning characterizing somatic variant calling. The gist of the article concerns the issues of reproducibility. You probably don't need me to say this, but ...

REPRODUCIBILITY IS KEY.

Reproducibility is key to data science in any field (banks would be extremely upset if your pipeline couldn't return the same results with the same set of data). And reproducibility is a key element to the scientific method.

Yet, we have a little problem with reproducibility at the moment in the field. First, our pipelines are all different.

"Calling mutations with different pipelines on differently prepared sequence read sets resulted in a low level of consensus."

Different pipelines will give you different results. It surprises me that we have to publish studies to show that this is an issue. I mean, bioinformatics tools are being written and distributed because they are different from previous tools. So it sort of makes perfect sense your results would be different. Different aligners will give you different alignments. Different variant callers may call different variants if they use different statistical approaches and are also dependent on their input.

Second, the way we sequence to generate our data is not consistent.

"Using a standard pipeline had the potential of improving on this but still suffered from inadequate controls for library preparation and sequencing artefacts."

Well... Even if we standardize the bioinformatics pipeline, variation in the sequencing center will still diminish reproducibility (If you have a good sequencing center, I recommend you stay with that center).

So, what can we do? ICGC gives their recommendations for whole genome sequencing studies:

  • PCR-free library preparation
  • Tumour coverage >100 ×
  • Control coverage close to tumour coverage (±10%)
  • Reference genome hs37d5 (with decoy sequences) or GRCh38 (untested)
  • Optimize aligner/variant caller combination
  • Combine several mutation callers
  • Allow mutations in or near repeats (regions of the genome likely to be more prone to mutation)
  • Filter by mapping quality, strand bias, positional bias, presence of soft-clipping to minimize mapping artefacts
AND

They suggest calibrating bioinformatics tools to known results and developing additional benchmarks.

Some key points I absolutely love: 1) Designing a better experiment is important to any experiment. Making sure that you have your controls and that your sequencing center is being transparent with their standards means you can have confidence in your data going into your pipelines. Plus, there is this thing about statistics that really puts the pressure on you to make sure you have the right number and types of replicates. 2) I am also in favor of optimizing aligner/variant caller combinations. I would love to see efforts made on optimizing pre-existing tools over the development of new tools (I really don't see how people can keep up with all of the new tools). 3) Benchmarking sounds really great!

What we still don't discuss is how do we set "gold standards"? Now I realize that standards will change depending on the experiment and the model system, but who will develop each standard and who will maintain those standards? Several entities have already developed standards. I happen to like The ENCODE Consortium's guidelines for NGS experiments and recommend these guidelines to PIs when consulting prior to experimentation. The American College of Medical Genetics published their own standards for variant calling this May that I also like to use. But is letting each field establish their own sort of "gold standards" the best way to approach this issue? I'm not entirely sure what the right answer is. I do think there is a need for some entity in the field to publish and maintain a guideline. Some people may argue that we do this already through previous publications in the field. I'm sorry, but I just don't see how designing an experiment because "everyone else is doing it that way" is the best approach to setting guidelines. And the work isn't just about creating guidelines, Bioinformaticians need to be aware of these guidelines prior to analysis and they need to make it a priority.

We also do not discuss how important Bioinformaticians are to these studies. This paper demonstrates that the BIOINFORMATICS WORK REQUIRES A LOT MORE THAN PLUG AND PLAY PIPELINES.  Filtering, optimizing and benchmarking will take expertise. Reproducibility will also require Bioinformaticians to be committed version control and to publishing adequate methods. This comes down to good training and standards Bioinformaticians need to commit to their profession. I hope to see publications like this make a serious impact on both future science and bioinformatics as a profession.

FYI, if you don't want to read the nitty-gritty of the Nature Communications publication, Genomeweb did a good job summarizing the paper. Enjoy!