eResearch Symposium 2010: a fascinating beginning

With some 120 people registered for the first annual eResearch Symposium, the event wasn't exactly a small one, and it's a subject which is garnering increasing amounts of interest (and becoming increasingly important). 

First, though, a note of explanation: what, exactly, _is_ eResearch?  Still in its infancy in many ways, the field (and here I shall quote extensively, as the explanation is so good)  "works with and supports research communities, looking for approaches and insights that can apply across research disciplines, including the humanities and social sciences, physical and biological sciences, math and engineering. It includes the computing and software platforms that connect equipment, data, and other computing resources with people, along with collections management, platforms to run experiments, and advanced collaboration tools.  eResearch communities thrive on deep engagement with researchers, and aim to support the formation and operation of effective digitally-supported research communities."  

In the event's opening speech , Dr George Slim of MoRST talked about how,  in the future, all research will be eresearch, and the need to start dealing with it now lest we all face problems with it in the future. MoRST has been involved with the field over the last few years,  both with  'easy wins' such as video conferencing and more under-the-radar work such as data access management.  He also talked about problems such as unpublished data languishing in archives and desk drawers, rather than being part of the knowledge base of science. 

He finished off by saying:

"We're looking at the [symposium] as the near-beginning of a long process in which eResearch has a tremendous impact on the research community, and where NZ continues to be a connected and functioning part of the world's research infrastructure, based on the abilities of modern information technology". 

The next speaker was Australia's Dr Andrew Treloar, who works for the Australian National Data Service (ANDS).  ANDS is particularly interested in encouraging and facilitating the re-use of data, given how much of it (as also mentioned by Dr Slim) never sees the light of day.   

Jaws throughout the audience dropped when he spoke about the origins of civilisation.  In a nutshell (or, possibly, clay sack), 'we have accountants to thank for civilisation'.  Language started, some 9,000 years ago, as a means of keeping track of contracts and ownership.  Fascinating stuff! 

For the last 200 years, however, data has been seen as the poor stepdaughter by journals, who prefer to concentrate on the papers which originate with it.  Indeed, he said that the state of data in many scholarly journals is terrible: it's inconvenient, imprisoned, invisible, inaccessible, and sometimes even ignored (this is a particular problem for research which generates negative results, or generates results already published).  In essence, much of the data from which research papers are written, can't be accessed by people who might want to use that data.  Even worse, the fact that most negative results or duplicated results are never published, means that researchers continually keep reinventing the wheel. 

Case studies including the Hubble Space Telescope programme show that making data accessible greatly enhances the number of papers which are published based on that data, and recent research has shown that open access data enhances the number of citations a paper's authors get, as well. 

So, what is driving this renewed focus on the importance of data?  According to Dr Treloar, two primary developments: the shift to more data intensive research, and far better (and continually improving) information systems.   

A good example of more data intensive research was discussed by another keynote speaker, Dr Nicole Cloonan, who is with the Centre for Medical Genomics, The University of Queensland. 

Dr Cloonan spoke of the challenges associated with mapping the 'cancer genome', the first of which is that every cancer - it being a 'disease of the genome' -  is different. This means that it simply isn't enough to sequence a few genomes: any understanding leading to an improved understanding of what goes wrong, and how best to treat it, will require the sequencing of the genome, transcriptome and epigenome of thousands of people. 

Why is this difficult?  Of course, the cost and time involved are major issues, but what had people at the Symposium sitting forward, open-mouthed, was the huge size of the data sets generated.  Each 'run' (sequencing of the genome) generates about 15 terabytes of data, with 4-6 runs needed per patient.  This means that if one is looking at 500 patients, storage is needed for about 30-40 petabytes, and regulations stipulate that the raw data is kept for 7 years!  Even improved compression of this raw data still results in enormous amounts of storage necessary. 

The field of eResearch is going to be valuable in many other ways, too.  Dr Shaun Hendy, Deputy Director of the MacDiarmid Institute for Advanced Materials and Nanotechnology talked about how eResearch could help New Zealand to become the 'city of four million people' necessary in order to boost our productivity.

As Hendy pointed out: the world is not flat.  Knowledge-intensive activities incur spatial transaction costs and therefore tend to accrete.  And there is increasing acceptance that New Zealand's future needs to be heavily bound up with knowledge-intensive activities, given our small size, and great distance from the much of the rest of the world.  

Key to such an endeavour would be the creation of strong collaborative networks throughout the country, as Dr Hendy's research has shown that innovation tracks strongly to population size and density.  So, while New Zealand cities are as productive per capita (in terms of patents, at least) as their Australian counterparts, Australian cities are bigger, and so tend to gather larger numbers of inventors.  Fortunately for New Zealand, however, collaborative networks don't actually need to be geographically clustered - he cited California as a perfect example. 

Networks and support infrastructure such as the KAREN network, BeSTGRID are already doing great work in helping Kiwi scientists work more collaboratively, faster, and with better data. 

Of course, this brings us back to the problem with data storage and re-use.  Professor William Michener, Professor and Director of e-Science Initiatives at Unversity Libraries at the University of New Mexico, and the final keynote speaker, spoke about cyberinfrastructre, and in particular how it is addressing this problem in the environmental sciences.  Projects such as the Data Earth Observation Network for Earth (DataONE) are generating vast amounts of information, but they are also using a new type of distributed, sustainable network which allows that information to be easily discoverable and useable. 

There's still some way go. The data storage issue is getting worse, rapidly.  This year alone, humanity is going to generate a yottabyte of data, for which we have only 600 exobytes of storage.  In other words, we're going to have to throw away 40% of it.  And that's just this year. Let alone building, maintaining, and getting people to use the systems which can make what data we can store, accessible to others. 

Far from being an overwhelming and intractable problem, the feeling amongst the speakers and attendees at the Symposium was that it was a problem which could be solved, and the opportunity to finally begin to give data its rightful place in research - something towards which the event was another great step. 

For a record of the the speeches, and the other events which took place, have a look for the #nzes tweets, here

The presentations, complete with speakers' slides, are available here on the eResearchNZ website, and the podcasts can be found here.

Submitted by aimee on