Thursday, January 7, 2016

Command Line - for Science

My morning routine usually includes my ONE cup of coffee (I'm incredibly proud of the fact I can get through an entire event filled day with just ONE cup of coffee. This is a huge decrease from where I started!) and a look through of the news. Today I went out of the norm and headed over to Twitter to start the day. I typically do not like Twitter in the morning because the majority of the Twitter activity happens in the afternoon. (See this great brief on how to post to social media!) However today was different and I'm glad I started with Twitter because I may have missed this tweet that took me to a kickstarter project for Learning the Command Line - for Science.

I posted some tips for novice Bioinformaticians a month ago and I mentioned how important it was to learn command line tools. There is a lot of emphasis on learning scripting (Perl) and/or programming (Java) languages and tons of resources available to help you. I think that is great! I also think it is great to learn command line ... and I mean learning more commands than how to change directories or create/remove directories. If your goal is to use bioinformatics tools and you are not interested in creating them, I would focus on learning Bash, a command line language such as AWK or SED and R. Other bioinformatics tools are already available to you and the three above suggestions will help you execute pretty much every pipeline available.

Bash:
Bash scripting is your default shell for both Linux and Mac. When you call 'cd' to change a directory, you are using Bash. There are LOTS of sweet commands that make Bash really useful for bioinformatics and I encourage you to check out what tutorials are already available to learn more.

Some of my favorite commands are 'find' and the loop commands such as 'for', 'while' and 'until'. 'Find' is great for finding files (genius, I know). This command is really helpful for servers with multiple users (because people tend to move files around and then forget where they have put them) and for doing things with all files in a directory and its child directories (such as moving all of your raw reads scattered in a project folder to another volume). Loop commands are really nice for bioinformatics tools that do not accept multiple inputs (mostly newer tools that haven't been through version changes). Instead of running the command to use a tool over and over again for each of your input files (ex. if you have 2+ samples that you want to process), you can just create a Bash loop once to call the command and set it up to give a new input each iteration.

Command Line Language:
I use AWK all of the time. Most of the files that are output for bioinformatics tools or files that you would download from a public depository are in a spreadsheet-like format - meaning they have rows and columns.  AWK is a great manipulator for data in this format and it is faster than writing a Perl script to input a file and read each line to blahblahblah. You get my point.

Now I feel the need to address 'manipulator for data' since out of context this phrase could sound like I am changing data. I am absolutely not changing any data, but instead maybe changing the column order or data type. For example: A chromosome column may have chromosome denoted as either 'chr1' or '1'. They both mean chromosome 1, but their data types are different since 'chr1' is a string and '1' is an integer. Data tools hate when you mix data types which is why data type sometimes needs to be manipulated so that every time a chromosome is denoted it is always denoted in the same way (data type).

Bioinformatics pipelines change so often as new tools are created that output from one tool may not be formatted for input into another tool (i.e. output for column 1 should be in column 4 as input for the next tool). With AWK you can fix all of these things with a quick script and it has a fast execution.

SED is fairly comparable to AWK, I just happen to use AWK most of the time.

R - or (said with a pirate accent) AARRRR
R is an incredible environment for executing statistics and creating really nice graphics and visuals with your data. PLUS, there has been a real effort to make its use REPRODUCIBLE (insert angels singing). I'm just going to stop there because there are lots of dedicated R users who will tell of the glories of R. Just know it is definitely worth your while to learn!

So to summarize, if you want to just use bioinformatics tools you will benefit heavily from command, a command line language such as AWK and a great statistical environment with graphics capabilities like R. Every other step in a pipeline will be filled with a tool which is pretty much plug and play with different parameters (or arguments) that you should become fairly familiar with.



Disclaimer: The above post makes bioinformatics sound "easy". I am sorry to say that it probably won't be "easy". Any decent data scientist will tell you that data anything just isn't as easy as we tend to describe it. Data are messy and data tools can be a mess too. Cleaning up messes is really frustrating. So if you feel like you are wrangling cats, then you are probably doing it right.


No comments:

Post a Comment