Thursday, December 10, 2015

Reproducibility is Key - suggestions for WGS from ICGC

Nature Communications published an article from the International Cancer Genome Consortium (ICGC) concerning characterizing somatic variant calling. The gist of the article concerns the issues of reproducibility. You probably don't need me to say this, but ...

REPRODUCIBILITY IS KEY.

Reproducibility is key to data science in any field (banks would be extremely upset if your pipeline couldn't return the same results with the same set of data). And reproducibility is a key element to the scientific method.

Yet, we have a little problem with reproducibility at the moment in the field. First, our pipelines are all different.

"Calling mutations with different pipelines on differently prepared sequence read sets resulted in a low level of consensus."

Different pipelines will give you different results. It surprises me that we have to publish studies to show that this is an issue. I mean, bioinformatics tools are being written and distributed because they are different from previous tools. So it sort of makes perfect sense your results would be different. Different aligners will give you different alignments. Different variant callers may call different variants if they use different statistical approaches and are also dependent on their input.

Second, the way we sequence to generate our data is not consistent.

"Using a standard pipeline had the potential of improving on this but still suffered from inadequate controls for library preparation and sequencing artefacts."

Well... Even if we standardize the bioinformatics pipeline, variation in the sequencing center will still diminish reproducibility (If you have a good sequencing center, I recommend you stay with that center).

So, what can we do? ICGC gives their recommendations for whole genome sequencing studies:

  • PCR-free library preparation
  • Tumour coverage >100 ×
  • Control coverage close to tumour coverage (±10%)
  • Reference genome hs37d5 (with decoy sequences) or GRCh38 (untested)
  • Optimize aligner/variant caller combination
  • Combine several mutation callers
  • Allow mutations in or near repeats (regions of the genome likely to be more prone to mutation)
  • Filter by mapping quality, strand bias, positional bias, presence of soft-clipping to minimize mapping artefacts
AND

They suggest calibrating bioinformatics tools to known results and developing additional benchmarks.

Some key points I absolutely love: 1) Designing a better experiment is important to any experiment. Making sure that you have your controls and that your sequencing center is being transparent with their standards means you can have confidence in your data going into your pipelines. Plus, there is this thing about statistics that really puts the pressure on you to make sure you have the right number and types of replicates. 2) I am also in favor of optimizing aligner/variant caller combinations. I would love to see efforts made on optimizing pre-existing tools over the development of new tools (I really don't see how people can keep up with all of the new tools). 3) Benchmarking sounds really great!

What we still don't discuss is how do we set "gold standards"? Now I realize that standards will change depending on the experiment and the model system, but who will develop each standard and who will maintain those standards? Several entities have already developed standards. I happen to like The ENCODE Consortium's guidelines for NGS experiments and recommend these guidelines to PIs when consulting prior to experimentation. The American College of Medical Genetics published their own standards for variant calling this May that I also like to use. But is letting each field establish their own sort of "gold standards" the best way to approach this issue? I'm not entirely sure what the right answer is. I do think there is a need for some entity in the field to publish and maintain a guideline. Some people may argue that we do this already through previous publications in the field. I'm sorry, but I just don't see how designing an experiment because "everyone else is doing it that way" is the best approach to setting guidelines. And the work isn't just about creating guidelines, Bioinformaticians need to be aware of these guidelines prior to analysis and they need to make it a priority.

We also do not discuss how important Bioinformaticians are to these studies. This paper demonstrates that the BIOINFORMATICS WORK REQUIRES A LOT MORE THAN PLUG AND PLAY PIPELINES.  Filtering, optimizing and benchmarking will take expertise. Reproducibility will also require Bioinformaticians to be committed version control and to publishing adequate methods. This comes down to good training and standards Bioinformaticians need to commit to their profession. I hope to see publications like this make a serious impact on both future science and bioinformatics as a profession.

FYI, if you don't want to read the nitty-gritty of the Nature Communications publication, Genomeweb did a good job summarizing the paper. Enjoy!

Friday, December 4, 2015

Bioinformatics, what is that? (Everyone starts here)

Oh boy, describing what I do for a living is one of my favorite parts of an introductory conversation. And pretty much every day I have to tell someone else what I do for work now that my husband and I have moved to Dallas, TX (I could count the number of people I knew in Dallas prior to moving on one hand). This whole exchange use to cause me a little bit of anxiety. Especially if my husband were around. A conversation would go:

Me: I'm a bioinformatics scientist and I process and analyze biological data using computers.
Newbie: What?
Husband (butting in): She makes woolly mammoths and turns chickens into dinosaurs. No, I am serious! There are these crazy scientists who want to make a real Jurassic Park.

*It is really hard to backtrack at this point to not make yourself look like a mad scientist. So you have two options: 1) change the subject or 2) just go with it and proceed to tell the Newbie how you are also working on making elephants miniature so that you sell them as pets.

So I don't actually like going around and making up (but sort of true) stories about my job. It makes for some great bar conversations, but at some point people want to know what you actually do with your life (my grandparents in particular). I am going to share what my go-to response is for someone like my 86 year old grandpappy (keep in mind that my pappy quit school at 9th grade - he went on to start his own successful business and continues to work to this day *my hero*):

_____________
I am a bioinformatics scientist and I process and analyze biological data using computers. Whenever you go to build something, you usually follow some sort of a blueprint or design. DNA is the blueprint or design for all living things. If you are building a house, the blueprint directs where the building materials, such as wood and cement, are suppose to go and how they are suppose to be arranged. Amino acids/proteins are the building materials for living things. If you are building a person, the DNA for that person would direct where the amino acids/proteins need to go and in what order so that a person is the end product. There are different blueprints for different houses in your neighborhood. Every person has slightly different DNA similar to all of the blueprints in your neighborhood. A blueprint for a bridge is very different from a blueprint for a house. Same thing applies for DNA for a plant and DNA for a person. Everything is still written on blue and white paper, but the directions are different.

Small changes in the blueprint can sometimes result in barely noticeable changes in the look of the house such as moving a window over by an inch. But sometimes small changes result in big changes such as moving a grade beam a few inches so it no longer supports a wall in your house. Changes like this could make the wall immediately fall down or could gradually deteriorate the structure of your house with time. This is similar to how some diseases occur, small changes to the DNA can have very little impact on a person or could cause disease. So how do you know if a small change will have a little or big impact? If we were still talking about a house, your builder would look over the blueprint with some prior knowledge of how houses should be built and find the error. DNA blueprints are much, much bigger. Imagine trying to find a change like this in all of the blueprints of houses in Dallas. There are over 450,000 houses in Dallas. This would take the builder a REALLY LONG TIME. And it is possible the builder may even get exhausted looking at all of those blueprints that they may miss a small change of a single grade beam that is off by a few inches. We humans are not very good at repetitive analyses like that, but computers are GREAT for things like this.

So I use my computer savvy to repeatedly look at all of the blueprints for houses in Dallas to find a small change. Or rather to look at a person's blueprint to find small changes in their DNA. I can of course do this for many, many people and find patterns that are similar to certain groups of people such as those afflicted with a certain disease.
_____________

This usually satisfies people who are really interested in what I do for a living. You should of course continuously engage with the person to quickly wrap up the conversation the moment they glaze over if you determine a recovery cannot be attained. Also, be prepared to remind the person that you are not a medical doctor and please refer them to their primary care physician the moment they start to describe their ailments. I study disease, but I do not diagnosis diseases.

It is both cliche and appropriate that my first post would be this post.

Advice for Young Bioinformaticians

If you browse the Internet you will find a lot of information concerning advice or good practices for bioinformaticians. Yet, I still see students struggle who either are ignorant to all of this FREE advice or who just like to make life hard for themselves (some people live for the struggle). I am going to assume these students are not masochist and are instead just ignorant to the good advice of their colleagues. So, here is another list of good practices that I find are most valuable to my everyday life.

1. No One Likes A Quitter

Don't quit because you are frustrated. In any field managing and analyzing data is frustrating. Data are like messy little toddlers that you have to wrangle with. Just like every parent goes through it, so do all bioinformaticians. It is just a fact of life. Go take a break (or go break something) and then come back and handle your situation. It is OK to change your approach. If something does not work, go find something that will get the job done.

2. Take Notes

Take the time to write down notes. Write some notes on your installs, on your pipelines... everything. I am 100% certain that you will have to repeat the same thing again and you will want those darn notes so that you don't have to beat your head against a wall for the second time.

3. Google, Search, Ask

You spend your life in front of a computer and your most valuable resource is your browser. There are 3 steps you should always do in this particular order until you find an answer: 1) Google it, 2) search a forum for it (Seqanswers, Biostar, etc.) and 3) ask the Internet for the answers you seek! And don't be ashamed that you are on Google searching for something. Here is a news flash, we all search for things. Our field is ever growing, ever changing and no one can be all knowledgeable where they never Google for an answer to help them out.

4. Organize

Organization will save you time. I am a huge lover of flowcharts. I have flowcharts hanging up on my wall because I love them so much. Before I do anything, I make two flowcharts. My first flowchart is going to include my pipeline (this includes testing/checks to make sure things are running correctly and reporting). My second flowchart includes how I plan on structuring my directory. NEVER put all of your files in a single directory dump. Learn to create child directories and store things in a neat sort of manner.

5. Hold Yourself to Standards

Be consistent. One of the biggest complaints about the field is the lack of standardization and consistency. You will one day be incredibly frustrated by this too ... and then you will look back at your own work and realize you are just as bad about your own standards. Develop your standards and base them on what the majority of your colleagues are doing in the field. Then stick to it!

6. Learn About Your Tools

Learn something about the tools you are using. Try to learn enough about your tool to answer the following: 1) Why am I using this tool over another? (and "because everyone else is publishing with this tool" should not be your only answer) and 2) If something breaks or if I have data that are not structured as the correct input, can I debug this issue? Also, if you are using proprietary software (nothing wrong with that), please learn about its toolkit. A lot of the tools used by these software are built from the freeware versions and repackaged in a much more user-friendly GUI. So if you plan on using 2+ softwares to validate each other, you may be stacking your deck unfairly if you do not know what freeware versions your proprietary software is built from.

7. Please, No More Redundant Tools

Do not reinvent the wheel. If there is a tool out there that you can use, take the time to learn how to use it instead of writing another one. We really don't need to be overwhelmed with tools. Seriously, how many alignment algorithms do we need? 10? 50? 200?

8. Be Relevant

Be relevant by continuing your education. You have chosen a field where the technology changes quickly. You too should be able to adapt to those changes and stay relevant.

9. Take Time to Think About How to Communicate

Most likely you are reporting your analyses and results back to someone else. Take a bit of time to think about how to best explain and describe what you are doing. And remember who your audience is so that you can communicate to their understanding.

10. Keep It Fun

Remember, your job is fun! I strongly believe that the most successful careers are the ones that you ENJOY and I hope you got into this field because you like data, computers and biology. There is no shame in taking some time to give yourself an air high-five.

Your BLAST completed after a week of anticipation - high five!
You changed all 'chr1' to '1' before making that rookie mistake (it gets us all) - high five! 
You gave a great explanation of mapping (my husband thinks all I do is map things) - high five!
You just made the most complicated looking data into a great visual - high stinking five! 

Relish in all the victories no matter how small.