Data-intensive science - trends in data reuse and management

The reuse and management of research data is becoming increasingly important. Data-intensive science represents a transition from traditional hypothesis and experimentation, to identifying patterns, and undertaking modelling and simulation using increasingly massive volumes of data collected by thousands of researchers the world over. This means more breakthroughs across research discipline boundaries, and more bang for the research buck.

Because of this, data reuse is rapidly becoming a focus of policy and funding agencies, internationally, and in New Zealand. Open data is now government policy1. Managing research data well is also becoming essential in ensuring the integrity, transparency and robustness of research, so it can be defended against criticism and attack.

This article explores the trends in research data reuse and management. Further posts will look at the current policy context in New Zealand, the future requirements for institutional  data management, the risks of doing it poorly, and the benefits of doing it well.

This post is a part of consultation on the Framework for eResearch Adoption project. As such your comments and feedback are very welcome, will be considered thoughtfully, and will inform the framework, and future government policy and investment in eResearch.

What are the implications of these trends for eResearch and data reuse and management in New Zealand? What are the differences between Crown Research Institute’s (CRIs) and Universities in this regard? What do we need in place institutionally and nationally to support improved data management, and uptake of data-intensive scientific methods?

Please comment at the end of this article, or email to give feedback.

Global Trends in Research Data Reuse and Management

The most important global trend impacting on research data management is the emergence of data-intensive science, or the ‘Fourth Paradigm2’. This involves:

  • The transition from science being about 1) empirical observation of the natural world, to 2) theoretically based with hypotheses and experimentation, to 3) computational using modelling and simulation, and now to 4) data-intensive ‘eScience’.
  • Collecting more and more data through automated systems including sensor networks, large and small instruments, DNA sequences, satellite imagery. This means data is not just collected in a bare minimum sense specifically for each research project, instead oceans of data are becoming available for use by many different researchers.
  • The huge increase in data volumes enables researchers to sift through the data, identify patterns, draw connections, develop and test hypotheses, and run experiments ‘in-virtuo’ using simulations.
  • This unification through information systems of the processes of observation, hypothesis, experimentation and simulation means science can tackle bigger problems, on larger scales, and involving greater numbers of researchers across the globe.

The transition to data-intensive science includes a number of specific trends in the research sector. These are as follows.

Data collection and aggregation:

  • The proliferation of automated data capture and collection technologies in almost every field of research (e.g. fMRI in neuroscience, EM soil mapping, bathymetry, and satellite imagery such as NZ’s Land Cover Database)
  • Increased use of national level and global level discipline specific data repositories (e.g. Genbank, GBIF, GEOSS3, NZ Social Science Data Service4)
  • Different research disciplines adopting data sharing and reuse, and the fourth paradigm in general, at different paces, often driven by the existence, or not, of very large scale infrastructure (e.g. Hubble Space Telescope, Large Hadron Collider) that by default stores data centrally, and driven also by the emergence of professional norms around central deposit of data on publication (e.g. Genbank).

Changes in the scale of simulation:

  • The nature of models is changing from representing small local systems to much broader spatial/temporal scopes, and simulations are being used to understand larger scale phenomena and make predictions (e.g. global climate simulations, detailed simulations of the human heart).
  • Models are becoming larger than any one researcher can program themselves, and are large collaborative efforts
  • Desktop computers are no longer sufficient to run models, and simulations require high performance computers and clusters, and require and generate massive amounts of data which needs to be moved between research institutions across the globe

Changes in the nature of and demand for verification/defensibility:

  • The grand challenges facing humanity require science to be done at large scales, and to challenge current consumption and behaviour patterns. This increasingly generates tension, and the scientific process is coming under much greater levels of scrutiny. Data on which conclusions are based, and the methods used to produce those results have to be available for independent verification
  • Scientific workflow technologies are emerging to automate and allow replication of data aggregation, analysis, interpretation and results again opening research and the data underpinning that research to greater examination.

Discovery and access:

  • Discovery of relevant data is becoming an issue as the number of data sets, and the volume of data grows. Metadata catalogues and federated data search engines are becoming essential, as are data preservation and curation activities.
  • Researchers are starting to require the ability to trawl and do automated comparisons of datasets to see if they’re like theirs, and then be able to drill down, look at the attributes they measured, disambiguate terms, and determine to what extent the datasets are comparable.
  • Ways of structuring data to support discovery, access and comparison are being rapidly developed and adopted, including Linked Data, structured ontologies, and the semantic web.

Increased collaboration (nationally, internationally, and cross-disciplinary):

  • Increased specialisation in research expertise, and in support functions such as informatics & data management, means bigger research teams are necessary and in turn require more collaboration/coordination and processes to allow data sharing and reuse.
  • Methods such as remote diagnostics are being developed, where data will move between someone who needs to know an answer, and a specialist in another part of the world (e.g. high resolution video imaging in real time through a stereo microscope at a shipping port to a biosystematics expert in another country).
  • Increased engagement of ‘citizen scientists’, doing some of the work of data gathering, and crowdsourcing of data analysis (e.g. species observation networks, Gold Corp opening its prospecting data and offering a reward to geologists, prospectors and academics worldwide who helped them locate deposits)

Shifts in publication processes:

  • In some fields the number of publications from reuse of data is starting to outstrip the number based on primary collection of the data
  • Publishers are requiring datasets supporting the research to be lodged at the time of publication, given unique identifiers, and in some cases made available for review before papers are published.
  • Methods such as dataset citation, scientific workflows emerging to cope with the need to manage complex data attribution chains

These global trends in science are also driven by technology trends, including:

  • User expectations about search, discovery, visualisation, and collaboration tools are being set by global scale consumer level providers such as Google, Amazon, Facebook who are funded commercially (e.g. Google’s annual R&D budget is NZD $2B, about the same as NZ’s entire science system spend).
  • Cloud computing is emerging as a way to significantly reduce cost, massively increase scalability of, and increase access to commodity computing and data storage infrastructure.
  • Mobile devices and their accompanying sensors, data storage and display technologies are being rapidly advanced by the consumer market (e.g. digital cameras, iPhones).
  • Software is increasingly shifting to the web as a delivery vehicle/user interface, software as a service is becoming more pervasive, and in many areas open source has become the dominant mode of software production

National Trends

Research data management is also impacted by national level trends, in other countries and in New Zealand specifically. These include:

  • The establishment of national centres to provide expert advice and services on data preservation, collection, curation and reuse (e.g. the Australian National Data Service, the UK Digital Curation Centre)
  • Increased coordination and sharing of data management infrastructure and tools across research institutions
  • The emerging requirement from research funding agencies for data management planning to be included in funding bids (e.g. US National Science Foundation announced this in May 20105, the UK Natural Environment Research Council has this as a requirement)
  • The rapid development of the ‘Open and Transparent Government’ movement in the last two years meaning elevated expectations about data access from the public and politicians, and more public money being put into data infrastructure (e.g. the US Open Government Initiative)
  • Open access licencing frameworks being adopted by individual countries, often based on Creative Commons and/or open source licences (e.g. the UK Open Government Licence, the New Zealand Government Open Access and Licensing (NZGOAL)6 framework)
  • Increasing use of open data in public consultation processes (e.g. the recent NZ National Environmental Standard for Plantation Forestry in New Zealand7, used an online discussion forum and provided access to relevant government datasets)
  • The establishment of an ‘open data’ community outside of government and research organisations, who have the skills and desire to take publicly funded data and develop value added tools and services (e.g. GreenVictoria8, a service aimed at increasing public awareness of climate change, using water consumption data and other Australian Government datasets; SwimWhere9, an NZ mashup and iPhone app using water quality data)

In New Zealand the government has strongly signalled a move towards coordination and sharing of ICT systems, resources and data across the public sector. This is expressed in the recently released ‘Directions and Priorities for Government ICT’1. This mandates the use of shared services where they are available. It also has a particular focus on open data, covered in Direction 2 ‘Support open and transparent government’ which includes the following priority:

“Support the public, communities and business to contribute to policy development and performance improvement”

It is accompanied by the following statements:

“Open and active release of government data will create opportunities for innovation, and encourage the public and government organisations to engage in joint efforts to improve service delivery.”

“Government data effectively belongs to the New Zealand public, and its release and re-use has the potential to:

  • allow greater participation in government policy development by offering insight and expert knowledge on released data (e.g. using geospatial data to analyse patterns of crime in communities)
  • enable educational, research, and scientific communities to build on existing data to gain knowledge and expertise and use it for new purposes”

Government agencies and research organisations in New Zealand are being encouraged to use NZGOAL rather than ‘all rights reserved’ copyright licences. There is an expectation from government that publicly funded research data be made openly available unless there are very good reasons not to (e.g. public safety, privacy, commercial sensitivity, exclusive use until after publication).

Archives New Zealand is currently planning a Government Digital Archive. This will enable Archives New Zealand to take in large-scale transfers of government agency digital records, such as email messages, videos, databases and electronic documents. This may also be able to take in historical research datasets where organisations are not able to archive and publish these themselves. This project is being done in collaboration with the National Library of New Zealand and the existing infrastructure of the Library’s National Digital Heritage Archive (NDHA) will be leveraged to provide a full solution for digital public archives.

In the New Zealand research sector the National eScience Infrastructure (NeSI)10 business case has recently been approved by Cabinet. NeSI represents the most significant infrastructure investment for New Zealand’s Science System in the last twenty years. It will provide a nationally networked virtual high performance computing and data infrastructure facility distributed across NZ’s research institutions. NeSI is an initiative led by Canterbury University, Auckland University, NIWA and AgResearch , and is supported  by Otago University and Landcare Research. It will coordinate access to high performance computing facilities at these institutions, and the BeSTGRID eScience data fabric, research tools and applications, and community engagement.

What do you think?

So, what are the implications of these trends for eResearch and data reuse and management in New Zealand? What are the differences between Crown Research Institute’s (CRIs) and Universities in this regard? What do we need in place institutionally and nationally to support improved data management, and uptake of data-intensive scientific methods?

Please share your thoughts by commenting on this post, or by emailing with your thoughts. Feedback will be incorporated into the Framework for eResearch Adoption project.


  1. Directions and Priorities for Government ICT
  2. Hey, Tony; Stewart Tansley and Kristin Tolle, Eds. “The Fourth Paradigm: Data-Intensive Scientific Discovery.” Microsoft Research. Redmond, Wash: 2009. PDF at
  3. The Global Earth Observation System of Systems (GEOSS) Geoportal
  5. Scientists Seeking NSF Funding Will Soon Be Required to Submit Data Management Plans
  6. New Zealand Government Open Access and Licensing (NZGOAL) framework
  7. National Environmental Standard for Plantation Forestry in New Zealand
  8. GreenVictoria
  9. SwimWhere
  10. NeSI
Submitted by julian.carver on


international links & Science Magazine - Special Issue on Data


Julian, great to see this enquiring overview of the scientific data context in NZ.

Whatever happens in New Zealand, it would be important to build and retain connections with international initiatives. I like many others see it as critical that NZ researchers and research institutions establish common yet localised approaches shared with our international research data management colleagues. You've highlighted the DCC and JISC too have an excellent programme (current iteration just completing in March:, with their key findings and connections they've made to the international community being discussed at a workshop in late March:

Interesting to see the special issue of Science on Data ( right now.

I'll be pondering the questions you pose further, and will aim to come back here to provide more comment at some point.

cheers, Nick Jones

Which CRIs have most to offer, and what are the challenges

Hi Julian

Some top-of-the-head comments from me as an individual, rather than a representative of Plant & Food Research. I think that you've done a great job of laying out the trends driving us toward open data, and the benefit of an open data approach. My comments are focussed more on which areas of science have the most to offer in terms of opening their data, and on some issues that need to be addressed on the way.

One can think of the CRIs on a kind of spectrum from industry-aligned (AgResearch, Plant & Food Research, Scion, IRL, parts of NIWA, parts of GNS) through to natural science (Landcare, parts of NIWA, parts of GNS) with ESR having a strong service component.

Much of the data collected by industry-aligned research organisations comes from controlled experiments (eg a trial of the response of a winter wheat crop in Canterbury to various formulations of the same kind of fertiliser). The range of treatments can be huge, and in order to make sense of the data one needs to understand the context of the experiment. The scope for, and ease of, resusing such data is quite limited compared to, for example, the reuse measurements on a proton (as I understand it, protons are pretty much the same everywhere in the universe). With a fertiliser trial, in order to combine the data with other trial data, one would need to somehow take account of the wheat variety, the soil type in that particular field, the weather over the season, the timing of the fertiliser applications, what was growing in the field the year before ...

I think that there is more scope for data reuse on the natural science side, and that this is reflected in the new Environmental Data policy

This is reflected in a kind of data flow within CRIs from the "natural science" end of the spectrum to the more industry-aligned research end. So, for example, Plant & Food Research is a consumer of NIWA climate data and of Landcare Research soils data. There isn't much flow in the opposite direction.

The industry-aligned CRIs are encouraged to commercialise research, and in some cases might be reluctant to jeopardise, for example, patents by releasing experimental results.

In some cases there is also likely to be reluctance from the commercial partners/funders of industry-aligned CRIs to the release of data because of IP issues, and the potential for use or misuse of the data by competitors.

You've mentioned the need for attribution of data, and of treating the publication of data in a similar way to the peer reviewing of journal papers, and I think that this is an important area. Those going through the long process of securing research funding, and then perhaps getting up at all hours and in all weathers to collect data, and then taking the time to ensure the quality of the data would want to be sure that their efforts and initiative were adequately acknowledged by re-users.

Measurement standards become important when data is combined and re-used. It can be a real challenge to ensure that even "straight-forward" data, such as air temperatures from different sources can reasonably be compared.

The focus of the Public Records Act on maintaining "provenance" and tracking any changes to the data is helpful from the point of view of ensuring data quality

Lastly is the question of how data publication should be funded. Web services such as the NZ Organisms Register and Web applications such as the Ocean Survey 20/20 portal are costly to develop and require ongoing maintenance and support. Science funders (and science providers) are typically reluctant to fund ongoing maintenance and support activity because they see their pot of money eventually being consumed by business as usual activities. Some discussion of possible business models for ongoing support of open data services would be warranted in my view.

Regards Matthew Laurenson