In The Pipeline: See this is the sort of thing that drives me nuts ...

Look, I get that we (and I'm including me) are not perfect. I think that imperfections make us much more interesting and I appreciate that.

But can we PLEASE start to maintain our code/vignettes/library/tutorials or whatnot? I mean, it really burns me when I go through the trouble of reading the vignettes, reference materials and whatever tutorials have been released to the internet and ... after hours of my time ... I'm still scratching my head wondering what the BEEP.

It should also burn the creators of these documents that they have spent time making all of these supporting documents and people still cannot use their libraries.

Here's an example (and I apologize for picking on this particular library):

Chip Analysis Methylation Pipeline for Illumina HumanMethylation450 and EPIC

or ChAMP is one of two pipelines I can find that will analyze Illumina EPIC arrays in Bioconductor. I have two objectives: 1) analyze my data (obviously) and 2) compare both pipelines. Now to be fair, I'm having issues with the other pipeline (RnBeads) as well, but ChAMP has a great example how something so minor can be time consuming.

So there are two ways to approach this library:

1. You can run all modules using the function champ.process()

2. You can run each module and subsequent steps individually

Since I am comparing libraries I am approaching things using both options. In theory both options should give you the same outcome given that you use the same arguments. When I run each module (option 2), I have to set up my directory and load my data with champ.load(). The loading function requires a sampleSheet.csv file. This file includes your sample names, Sentix ID, Sentix Postion, etc. All that good stuff that comes with analyzing Illumina arrays. No problems.

But then I get to champ.SVD() and I want to include an argument with even more sample information not contained in the SampleSheet.csv. I read the reference on the function and two things pop out:

Argument Description

sampleSheet

If the data has not been loaded from .idat files and fromFile=TRUE then this points to the required sampleSheet. Default is "sampleSheet.txt"

studyInfoFile

If studyInfo =T, this file will include the additional study information. Default is "studyInfo.txt".

studyInfoFile

If studyInfo =T, this file will include the additional study information. Default is "studyInfo.txt".

Whoa ... wait a minute. Is the sampleSheet suppose to be a CSV or a TXT. To some people this may appear to be a small issue, but for me this is sort of big. CSV files are separated with a comma between fields (columns) whereas TXT files are separated with tabs. This really matters because data loads use parsers to tell the computer where each field can be found based on what separates fields (a comma or a tab). File format can seriously screw up a pipeline (we've all been there).

So my thought process goes like this:

1. This library has been updated but no one updated the reference manual .. so all files now have to be CSV. So I create a studyInfo.csv and that fails. sampleInfo has to be a TXT file.

2. This library has been updated and sampleSheet now has to be a CSV file to reflect a new change but studyInfo still needs to be a TXT. Well that runs, but it doesn't seem to include anything form the studyInfo file. So .. that fails ... sort of.

3. This library has been updated but champ.SVD() has not been changed to reflect the changes to champ.load() (which now requires a sampleSheet.csv). Well this would be super dumb because it would require me to create two sampleSheets (both a CSV and a TXT). And this also fails.

So in the end, I spent 30 minutes changing file formats and I still didn't get the result I want. To be frank, I shouldn't have to spend any time tinkering with file formats because the reference manuals don't reflect the requirements for arguments and this is a bad practice for me to do since I am just enabling the problem. The problem being that we tinker and force code instead of making authors fix their code/vignette/reference manuals. This just means that the next person will have to face the same problem I just walked through.

I would like to see libraries (including vignettes and reference manuals) updated and libraries that do not get updated as errors are reported get pulled down and become obsolete if authors are no longer maintaining code and these supporting documents.

But there is a silver lining ... these issues make me much more diligent about documenting my own code and pipelines. I now emphasize paying closer attending to details during my own documenting process.

In The Pipeline

Friday, June 24, 2016

See this is the sort of thing that drives me nuts ...

Chip Analysis Methylation Pipeline for Illumina HumanMethylation450 and EPIC

No comments:

Post a Comment