In The Pipeline: Reproducibility is Key - suggestions for WGS from ICGC

Nature Communications published an article from the International Cancer Genome Consortium (ICGC) concerning characterizing somatic variant calling. The gist of the article concerns the issues of reproducibility. You probably don't need me to say this, but ...

REPRODUCIBILITY IS KEY.

Reproducibility is key to data science in any field (banks would be extremely upset if your pipeline couldn't return the same results with the same set of data). And reproducibility is a key element to the scientific method.

Yet, we have a little problem with reproducibility at the moment in the field. First, our pipelines are all different.

"Calling mutations with different pipelines on differently prepared sequence read sets resulted in a low level of consensus."

Different pipelines will give you different results. It surprises me that we have to publish studies to show that this is an issue. I mean, bioinformatics tools are being written and distributed because they are different from previous tools. So it sort of makes perfect sense your results would be different. Different aligners will give you different alignments. Different variant callers may call different variants if they use different statistical approaches and are also dependent on their input.

Second, the way we sequence to generate our data is not consistent.

"Using a standard pipeline had the potential of improving on this but still suffered from inadequate controls for library preparation and sequencing artefacts."

Well... Even if we standardize the bioinformatics pipeline, variation in the sequencing center will still diminish reproducibility (If you have a good sequencing center, I recommend you stay with that center).

So, what can we do? ICGC gives their recommendations for whole genome sequencing studies:

PCR-free library preparation
Tumour coverage >100 ×
Control coverage close to tumour coverage (±10%)
Reference genome hs37d5 (with decoy sequences) or GRCh38 (untested)
Optimize aligner/variant caller combination
Combine several mutation callers
Allow mutations in or near repeats (regions of the genome likely to be more prone to mutation)
Filter by mapping quality, strand bias, positional bias, presence of soft-clipping to minimize mapping artefacts

AND

They suggest calibrating bioinformatics tools to known results and developing additional benchmarks.

Some key points I absolutely love: 1) Designing a better experiment is important to any experiment. Making sure that you have your controls and that your sequencing center is being transparent with their standards means you can have confidence in your data going into your pipelines. Plus, there is this thing about statistics that really puts the pressure on you to make sure you have the right number and types of replicates. 2) I am also in favor of optimizing aligner/variant caller combinations. I would love to see efforts made on optimizing pre-existing tools over the development of new tools (I really don't see how people can keep up with all of the new tools). 3) Benchmarking sounds really great!

What we still don't discuss is how do we set "gold standards"? Now I realize that standards will change depending on the experiment and the model system, but who will develop each standard and who will maintain those standards? Several entities have already developed standards. I happen to like The ENCODE Consortium's guidelines for NGS experiments and recommend these guidelines to PIs when consulting prior to experimentation. The American College of Medical Genetics published their own standards for variant calling this May that I also like to use. But is letting each field establish their own sort of "gold standards" the best way to approach this issue? I'm not entirely sure what the right answer is. I do think there is a need for some entity in the field to publish and maintain a guideline. Some people may argue that we do this already through previous publications in the field. I'm sorry, but I just don't see how designing an experiment because "everyone else is doing it that way" is the best approach to setting guidelines. And the work isn't just about creating guidelines, Bioinformaticians need to be aware of these guidelines prior to analysis and they need to make it a priority.

We also do not discuss how important Bioinformaticians are to these studies. This paper demonstrates that the BIOINFORMATICS WORK REQUIRES A LOT MORE THAN PLUG AND PLAY PIPELINES. Filtering, optimizing and benchmarking will take expertise. Reproducibility will also require Bioinformaticians to be committed version control and to publishing adequate methods. This comes down to good training and standards Bioinformaticians need to commit to their profession. I hope to see publications like this make a serious impact on both future science and bioinformatics as a profession.

FYI, if you don't want to read the nitty-gritty of the Nature Communications publication, Genomeweb did a good job summarizing the paper. Enjoy!

In The Pipeline

Thursday, December 10, 2015

Reproducibility is Key - suggestions for WGS from ICGC

No comments:

Post a Comment