In the event's opening speech , Dr George Slim of MoRST talked about how, in the future, all research will be eresearch, and the need to start dealing with it now lest we all face problems with it in the future. MoRST has been involved with the field over the last few years, both with 'easy wins' such as video conferencing and more under-the-radar work such as data access management. He also talked about problems such as unpublished data languishing in archives and desk drawers, rather than being part of the knowledge base of science.
eResearch Symposium 2010 Presentations and Recordings
Research data is increasingly becoming important in its own right, not just as the means to deriving a publication. We have been dealing with the data deluge since the turn of the millennium, and the scale of the challenges continue to increase. This presentation will review how we got to where we are today, looking at the pivotal role of data and data management in the history of communication. It will then move to consider the present role of data in scholarly communication by examining a range of problems in the published literature. It will conclude by examining some of the initiatives being taken to start to fix the future of data, and the sorts of services and approaches that will be required.
The data challenges in the environmental sciences lie in discovering relevant data, dealing with extreme data heterogeneity, and converting data to information and knowledge. Addressing these challenges requires new approaches for managing, preserving, analyzing, and sharing data. In this talk, I introduce several environmental science challenges and relate those to current cyberinfrastructure challenges. Second, I introduce DataONE (Data Observation Network for Earth), which represents a new virtual organization that will enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it. DataONE encompasses a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data. Third, I conclude by presenting several opportunities for international collaboration in the environmental sciences and cyberinfrastructure areas.
Most commentators agree that the only way forward for New Zealand is to forge a high-productivity knowledge-based economy. However, in the late twentieth and early twenty-first centuries, it is the large global cities that have driven innovation and the generation of knowledge. If New Zealand is to take the high productivity path, it must overcome its geographical isolation and low population density by learning to act like a city of four million people. In this talk, I will discuss the nature and magnitude of this challenge by looking quantitatively at innovation and the generation of knowledge around the world. I will discuss how eResearch will play an essential role in building scale and collaboration within New Zealand and in extending the Kiwi knowledge network around the world.
Cancer is Australia's largest disease burden, and arises as from the accumulation of genetic damage. Typically cancers accumulate multiple mutations, and these will vary from one cancer type to another, from person to person, and may even vary between different tumour sites in the same person. This variation could mean the best treatment for one patient might have no effect for another, or that a treatment that worked in the past might have no effect upon on a cancer relapse. The ultimate dream for cancer patients would be to determine exactly what mutations caused the disease, and exactly what treatments would work the best - a concept known as personalized medical genomics. Although conceptually simple, collecting, storing, and analysing the large scale biological data generated as part of medical genomics studies represents a huge informatics challenge - eclipsed only by the challenge of integrating this data with existing biological resources and knowledge.
For twelve hours on 9 May 2010, a combination of six radio telescopes in Australia and New Zealand (including the first SKA dish in Western Australia and the AUT radio telescope in Warkworth), observed the core of the radio galaxy Centaurus A. A few weeks earlier the same set of Australian and NZ radio telescopes successfully observed an active galactic nucleus with a supermassive black hole and relativistic jet structure (PKS 1934-638). Both Centaurus A and PKS 1934-638 are the objects of greatest scientific interest. Following the installation of the KAREN connection at the AUT radio telescope, Warkworth data was transferred to Western Australia, where it was correlated, calibrated and imaged. The main objective of this activity was to virtually create a “skeleton” of the Australasian SKA to demonstrate the advantage of the 5500 km baseline and provide the first science from this Australasian SKA “prototype”. It was achieved on time and resulted in significant science return. Warkworth is now available to the international radio astronomy community for the VLBI (very long baseline interferometry) and its real-time eResearch version, eVLBI, as a part of the Australian Long Baseline Array. Challenges and future plans for this exciting international eResearch are outlined.
Deep engagement between biologists, clinicians and computational experts greatly increases the amount of biological understanding we can gain from high-- content human data. In an example of this approach, we have performed a meta-- analysis of breast cancer microarray data from around the world. Several novel analytic methods were applied to this data set, most of which would not be feasible without the use of collaboration tools, grid computing and high performance computing.
Drs Lance Miller (Wake Forest University, USA), Anita Muthukaruppan (University of Auckland) and Mik Black (University of Otago) assembled and annotated a collection of Affymetrix microarray datasets comprising breast tumours from 950 women (including NZ women) with extensive clinical details. This data set was large enough to allow studies of small subgroups of breast cancer patients that previously we could not explore with any degree of statistical power. Four novel analysis methods were developed in collaboration with clinicians and applied to this data set, to answer key questions about breast cancer:
1) What transcription factors in tumours are most relevant to the survival of breast cancer patients?
2) What gene networks are active in breast cancer patients?
3) What clinically significant genes are amplified in breast cancer?
4) What genes modulate transcription factor activity in breast cancer?
The use of these novel, computationally intensive methods to analyse a large clinical data set provides a good example of the power of generating a small collaborative eResearch community focused on a specific problem. With publicly available collections of clinical and molecular data continuing to grow rapidly, there are tremendous opportunities for biological discovery using approaches such as those outlined here: where eResearch tools are essential for this work.
The classical approach to drug discovery and development is to test large collections of chemical compounds for therapeutic activity in "Wet Labs" in solution within biological assays that report on a disease specific target. Once compounds active in the assays, "hits", have been identified a medicinal chemistry programme is initiated that explores the chemical space around a molecule by making a large series of directly related compounds, known as analogues. It is the resulting chemical structure and biological activity relationship that identifies the drug lead. This is known as hit to lead development, and while this phase can be accomplished within an academic setting, engaging in hit discovery is much less accessible, limited by the absence of High Throughput Chemical Screening facilities in New Zealand, and the cost of accessing Australian facilities.
When the 3-dimensional structure of a target is known at atomic resolution, it is possible to use this information to screen digital libraries of compounds by matching computed physico-chemical properties - this process is known as virtual screening. This virtual screening process is a digital equivalent to the High Throughput Chemical Screening approach, and with low setup costs represents a more readily accessed discovery platform capable of stimulating wet lab drug discovery within an academic setting by identifying small numbers of likely active compounds.
Within some of the drug discovery and development programmes at the Auckland Cancer Society Research Centre, virtual screening has proved useful for new "hit" discovery where the atomic structure of the target was known, with one screen taking approximately 7 weeks to complete on a desktop machine using 1 cpu. In a collaboration between the Auckland Cancer Society Research Centre and the Centre for eResearch, we are building on this discovery success by using the high performance computing environment provided by BeSTGRID to develop a large scale virtual screening environment based on the Grisu framework that will facilitate an increase in hit discovery performance in the University of Auckland environment.
Early results are showing significant improvements in time to discovery, with the current increases in scale of computing environments leading to a 7x speed up in analysis run times. Moreover, it has facilitated a change in the use of virtual screening, to include concurrent focussed screening around specific chemical features of hit compounds. Our current plans further increase the scale of analysis possible, both in terms of the digital libraries of compounds and with respect to the number of projects enabled by both Grisu and the scientific technology.
The use of calculated molecular descriptors is now an integral part of drug design. Nowadays medicinal chemists can rely on defined areas within chemical space such as fragment space, lead-like chemical space, drug-like chemical space and known drug space based on these descriptors. The molecular descriptors reflect the physicochemical properties of the chemicals under investigation, which in turn affects their pharmacokinetic profile. The molecular descriptors employed are based on empirical data or are even simple counts of phenomena in the chemical structure leading to a large theoretical uncertainty. In order to put the definition of regions within chemical space on a robust theoretical footing quantum chemical calculation were performed, which can be directly related to the physical properties of the molecules under investigation, e.g., polarisability and dipole moments. In order to perform these calculations a robust computational infrastructure is essential.
Biomedical problems are often found in complex environments consisting of vast numbers of interrelated components at multiple temporal and spatial scales. The exponentially expanding number of possible interactions renders the development of understanding of many phenomena impossible through conventional experimentation alone. Systems Biology offers an alternative approach using computational analysis of quantitative models in order to discern the causes and effects of emergent behaviour in complex systems. Adopting this strategy of enquiry, the limiting factor for analysis leading to insight generation becomes one of available computer processing cycles. At present, our science is artificially limited to the questions we can tractably pursue in the computer time available.
BeSTGRID, providing online access to distributed computer resource, represents a significant scientific opportunity. Here we have developed a prototype model analysis software tool and user interface for developers of quantitative biomedical models. The tool understands the model-exchange protocol CellML, a proven format for Systems and Synthetic Biology. Each model is marked up with metadata providing simulation instructions, and instances of the model, each parameterised with pre-determined parameter sets, are scheduled on BeSTGRID via a Java-based user interface built using the Grisu framework. These BeSTGRID jobs are then computed using the open-source CellML Simulator, before time-course results are compiled and returned to a user-accessible area.
To verify the applicability of this prototype, we conducted a duplicate of the computationally expensive portion of the analysis of a significant intracellular signalling system in cardiac myocytes - previously performed using more limited computational resources. Briefly, a model of a signal transduction pathway was analysed to determine the molecular species and reactions most significant to cells’ production of a key signalling molecule (inositol 1,4,5-trisphosphate). Testable parameter sets were determined via the Morris method as previously, and BeSTGRID jobs scheduled to perform the ~600,000 model simulations required. As an indicator, the simulation computation was completed in less than 50 hours real-time on BeSTGRID, as opposed to 2 weeks using the high performance computing platform in the original study.
The ~6.7-fold decrease in the time taken to generate results for signalling research scientist enables the examination of systems of greater complexity. Where it was possible to analyse one pathway previously, we can now examine the behaviour of a network of 6 or 7 interacting pathways. Moreover, since we have used a generic model format and analysis technology, research speed increases here are applicable not just to signalling but any biomedical research model that can be expressed in CellML, including tissue- and organ-level work.
The increase in computational power will now enable additional innovations in model simulation technology. Our future work includes developing the simulation software and associated metadata specifications to include more flexible specification of the simulations and post-processing. This will facilitate the examining of not just one cellular activity at a time, but life-cycles of biological entities such as cells, tissues, organs or complete organisms. This is essential to the development of meaningful biomedical insights into the functioning of whole systems of life.