Musings on Galaxy #1

Galaxy!

The final frontier is here and its dressed up in a web framework for biomedical research named Galaxy. As we stand at the crossroads between 2012/2013 the world is more technologically orientated and biomedical research is no different. In genetics (something I am familiar with), common procedure for a genetics experiment  now contain no ‘wet-lab’ work. Data from publicly accessible databases adds up into the petabytes ( big guess! but its a lot) and there is still value in the data with many new discoveries lurking in the 1s and 0s. 

So how does a Biomedical researcher in 2012 spend his valuable time? Less time with PCR(polymerase chain reaction) and Agar plates and more time sitting in front of a screen very similar to the screen I am writing this blogpost from. The data needs to be analysed and processed usually from a terminal. Science needs to be reproducible and your bash scripts that are unusable to others are not really going to cut it. Galaxy provides a framework for command-line tools to be accessed through a web browser in a easy and understandable user interface. For today I will attempt to go over galaxy tools in some detail.

Tools

A standard installation comes with many biomedical tools but of course as research is very diverse the standard set of tools may not be enough. Galaxy allows you to easily extend the tool list with their XML tool syntax. The parameters for the tool are specified in the Galaxy tool xml file and between the <command> tags your standard command line run is specified.

Example

<tool name=”my_nobel_prize” version=”1”>
<command>
./nobel_prize $input_nobel > $output_nobel
</command>
<inputs>
<param name=”input_nobel” type=”data” format=”madeupdatatype”/>
</inputs>
<outputs>
<data name=”output_nobel” format=”madeupdatatype”/>
</outputs>
</tool>

This imaginary tool will basically run the tool nobel_prize which takes one argument of madeupdatatype and outputs output_nobel which the user can view and download from galaxy. Galaxy tracks the format of data types and all relevant metadata making managing the diverse array of biomedical data formats manageable. Galaxy tools are as complicated as the command-line tool is to operate which means adding the tool into Galaxy is not always straightforward, don’t even get me started on interactive programs. Many tools require an external script usually written in bash or python which can perform post and pre processing of the data. But what makes it different from just running that tool from the command line apart from the obvious its easier to fill out a web form rather than use terminal (to which I would disagree anyway!).

The differences are in workflows, data sharing and histories. In this post I will briefly describe these things and go over them in detail in future posts. Histories are analyses in Galaxy that show all input, intermediate, and final datasets, as well as every step in the process and the settings used with each. Workflows enable plugging tools together in a series of steps this can be very useful for repeat analysis and does not require any scripting. Datasets, workflows and histories can also be shared in galaxy, this could make available a useful workflow that someone thinks would be useful for others performing similar experiments. 

If you would like to learn more about Galaxy follow this link.

https://wiki.galaxyproject.org/

Submitted by James Boocock on

Comments

Does XML really make things simpler?

Thanks James, this is really interesting. Looking at the website, it seems that the focus of Galaxy is really about facilitating "accessible, reproducible, and transparent computational biomedical research". Do you think that a besopke XML format could be a bit of a barrier to entry? I guess it's better than the most likely alternative, writing small scripts. The focus for Galaxy seems to be on maintaining provenance and could possibly sacrifice usability to some degree to achieve that.

Do you have any dealings with researchers and what they think of the system?

Yes

Hi Tim thanks for your comment.

Yes you are right it can be difficult at times to write the xml tool wrapper and I agree it is a barrier to entry. Keeping this in mind the galaxy mailing is very active and new users are constantly querying the list and getting help when starting out using galaxy. Seen as XML is rigid and well defined simple tools are really easy to create as the documentation on the syntax is easy to follow.

Many researchers are not in the business of creating for galaxy tools. Those who are skilled in bioinformatics ( know how to program ) often are the ones putting the tools into galaxy. Galaxy has support network for people to add custom to the galaxy tool shed and a galaxy maintainner can add those tools to their galaxy very easily. A simpler less flexible method of adding tools would not have the ability to add any command line into galaxy. 

The end-user who needs to perform a task on a set of data can do so from within galaxy provided the tool is avaliable or visit the toolshed/us in biochem at otago and get a tool added so they can progress with their research.

Researchers at Otago in the department are keen on using galaxy, some datasets have been added and shared with the users of our server and tools. These command-line tools used to be in a reference book that was passed around the lab whenever someone needed to perform that task.

Heres the tool xml specification if you are interested. 

http://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax?action=show&redirect=Admin%2FTools%2FTool+Config+Syntax

Second Opinion

Well that certainly sounds promising compared to what we had to go through in genetics. Believe it or not, barely 2 years back, in one of the "dry" labs, we were completing a PCR experiment, we got given the base sequence for several isoforms of a nicotinic acetyl choline receptor and were expected to manually go through and find the area of highest base similarity...quite painstaking for the class, even with the whole classing going at it, and quite error prone. When I asked why we couldn't just use a program to do it as we had for BLAST ect, he said there was not tool that they currently had access to which does that.

I ended up writing something rudamentry for this purpose and hosting it on myown personal site:
http://www.legionbre.4fd.us/gene.php

But it took quite a while and is not available to anyone really. Its nice to see that this will provide a more open means to share these programs so that future students don't bleed from the eyes while searching by hand for the mysterous line of highest sequence homology.

Anyway as a side-point in my ramblings, I graduated in Biomedical Science, so if you want a relatively lay person from an XML point of view to assess the usability, I am at your disposal.

Brendan

Too True.

HI Brendan,

I am also currently a genetics student and many of the things you have said also ring true for me. Thanks for your input by the way.

On a side note I tried to register for the game at your domain and the validation link does not work could you look into this 

Thanks!

Hi Brendan,

We would appreciate the feedback. We currently have a couple of labs using our tools at the moment here at Otago but any others are definately welcome. It currently isn't outside of Otago accessible but that will be changing shortly, when we do have an instance of Galaxy set up for NZ researchers, probably hosted by NeSI then we would definately appreciate any feedback. Tasks such as the one you listed above are already in place within Galaxy (added by the Galaxy developers) so hopefully people shouldn't struggle with those! Being able to perform difficult manual tasks with the aid of a computer which can avoid human-made errors is the reason we have worked hard on this project, Galaxy really is something to keep your eye on when doing research in this field. 

Thanks!
Ed