About the project

The MaRDI-Gross project is funded by JISC, as part of its Managing Research Data Programme 2011-13.

Big Science’ – that is, facilities science with large data volumes and multi-national investments – handles its data differently from other disciplines.  Largely because of the data volumes, the data management systems are typically and rationally bespoke, but this means that the planning for data management and preservation (DMP) must be bespoke, too.  This in turn means that these disciplines cannot benefit from the considerable effort (much from JISC) going in to providing technical and software support for DMP planning, and a recent JISC-funded study of data at this scale (MRD-GW, see below) suggested that the most useful prescription that funders could make was to require projects to digest and profile the OAIS standard.

This is much more reasonable than it might at first appear, because of the extent of the technical resources available to such projects, but even so, the process can be assisted by providing a methodological ‘toolkit’, including overviews, case-studies and costing models.  The MaRDI-Gross project will deliver this toolkit, targeted primarily at projects’ engineering managers, but intending also to help funders collaborate on DMP plans which satisfy the requirements imposed on them.

Background

One of the priorities of the JISC MRD programme is to ”improv[e] practice in research data management planning”, and as such it is part of a coordinated approach to the academic, technical and political concerns about data management and preservation (DMP) planning.  There appears to be a rough paradigm in the DMP industry within which the central concerns are about the usability of repositories, the challenges of persuading researchers to deposit their data in repositories, how best to manage the citation of data in repositories, with a side-order of worries about how those researchers may best receive professional credit for the data they have lodged in the repositories.  These are all important concerns, and various organisations fund research in how to address them: primarily, in the UK,  JISC’s MRD programme and the various arms of the DCC.

Within this paradigm, however, there appears to be a rather simple conceptual model of what it is that researcher-users actually do to create the data.  Researchers (i)~obtain grants, which (ii)~they use to generate data which (iii)~they curate and then (iv)~share either as datasets or linked to publications.  Much of the interest in research information systems, within this area, presumes a rather simple relationship between~(i), (ii) and~(iv), and much of the DMP effort appears to be concerned with persuading researchers to do step~(iii) better, possibly with suitable institutional assistance, cajoling or prescription.   The DCC data lifecycle model can very naturally be read with this paradigm in mind.

We do not suggest that  this model is wrong (and it is of course necessary to address simple cases before complicated ones), but we do believe that it is incomplete.  The large-scale physical sciences – ‘big science’ – have decades of experience with data management and sharing, at scale, incorporating a data management workflow which is different from this paradigmatic one under each of its four headings.  This incompleteness suggests firstly that the DMP solutions created under this paradigm, when applied to other disciplines, may not be as generally applicable as expected; and secondly (and more positively) that there are data-management problems outside those automatically considered by that paradigm, which are nonetheless well-understood, and for which practical solutions already exist.

The recent project “Managing Research Data – Gravitational Waves” (MRD-GW, funded under the JISC MRD programme; see that project’s final report) discussed the way in which big science manages a set of problems which are significantly and interestingly different from the paradigm described above. Preservation policy and practice in big science deals (i) with large volumes of data, (ii) in large (100s to 1000s) collaborations, with (iii) technically sophisticated users and computing support.  Of these features, the data volume is the least significant in the present context, since it is ‘only’ a technical problem; the other two features change the game.

For our purposes, ‘Big Science’ projects tend to share many features which distinguish them from the way that experimental science has worked in the past.  These features include being large collaborations, with large volumes of complicated and instrument-specific data (1–10 PB/year, with exabyte/year rates anticipated in the next decade), and elaborate internal organisations.

The key feature, from the point of view of this project, is that this is facilities science – there is a core facility, with multinational funders, a multi-decadal existence, and a conceptual and administrative separation between the elaborately-engineered resource and the research scientists.

Particle physics has the longest experience with this model of doing science, but gravitational wave physics (LIGO) and radio astronomy (SKA) have or will have similar collaboration sizes and data volumes. Other areas of astronomy have long cultural experience with internationally shared facilities, though working at a different scale; and nuclear physics and structural sciences are moving towards this model of working.  There is a reason why STFC is the Science and Technology Facilities Council.

This scale of working produces some simplifications: (i)~It is well resourced – data management and preservation is not the responsibility of quarter-time junior researchers, but a key   concern of the project’s engineering management. (ii)~There is a collaborative ethos, which has data sharing (though initially only within the collaboration) at the core of it.  Data, once acquired, goes directly into the archive, and is retrieved from there for processing by researchers.

However the scale also produces a variety of complications:

  • There will be multiple funders in multiple countries, imposing  various, and sometimes conflicting, requirements on data management  and dissemination.
  • The multiplicity of funders often means that no one funder can reasonably dictate terms.
  • Experiments and their datasets are governed by networks of MoUs and SLAs, and in-collaboration decision-making processes which, however intricate the process, are fundamentally consensus-based.
  • The IP on the data is often complex.

The complexity of the funding landscape, combined with the fact that the data management systems will (because of the data volume) usually be bespoke, mean that it is essentially infeasible to produce any reusable repository, or to produce useful step-by-step guidance or training.

The MRD-GW project studied the data-management culture of science at this scale.  That report’s recommendations, bearing in mind the scale and available technical expertise within big-science projects, included (Sect. 2.6, p25):

  • 2. Funders should simply require that a project develop a  high-level DMP as a suitable profile of the OAIS specification.
  • 3. Funders should support projects in creating per-project OAIS  profiles which are appropriate to the project and meet funders’ strategic priorities and responsibilities.

That is, the ‘elevator pitch’ version of these recommendations is

this: funders of such projects can most effectively and appropriately discharge their data-preservation responsibilities by saying to large projects “here’s a copy of the OAIS spec; get on with it!”

In many disciplines, this would be dreadful advice, but facilities-scale science projects have the financial and engineering resources, and technical expertise, to produce bespoke DMP plans for bespoke data-management systems.  What must be avoided, however, is pointless reinvention, and so there is an outstanding need for a fast-track toan optimal solution.  This is where funder support can be helpful, in supporting the relevant technical personnel by connecting them to high-level DMP best practice. The \textbf{MaRDI-Gross project} will build on these recommendationsby describing what such funder-provided support would consist of.  It aims to build capacity within the greater sphere of large-scale science DMP planning, by giving the planners a rapid boost  towards relevant disciplinary best practice.

It is a consequence of the above features, that the infrastructure here (perhaps ‘superstructure’ is a better term) exists in a multinational context, and a fortiori works on a bigger scale than a single institution. LHC data is distributed from the single Tier-0 site (CERN) to national Tier-1 sites (RAL in the UK) and thence to institutional or regional Tier-2 sites, currently including around 16 HEFCE-funded institutions, including Lancaster. This model is becoming common in other big-data projects.

Thus, it is within institutions that the actual data use takes place, and where the data management experience is ultimately located and preserved; but the DMP development must be undertaken in a multi-institutional context if it is to be at all meaningful, and this is why the MaRDI-Gross project must have a project focus, led by a site with key UK experience of GridPP data preservation challenges. The multi-institutional nature of big-science data management  arguably implements the KRDS report’s vision of seamlessly interoperating institutional repositories. Although this has happened for top-down reasons rather than bottom-up ones, there are lessons to be learned, and experience to be gained within single institutions, which can use this intellectual infrastructure to support the leadership of similar DMP projects in other institutional contexts.

2 thoughts on “About the project

  1. Pingback: The MaRDI-Gross project » Blog Archive » Initial draft of project report

  2. Pingback: DMP Planning for Large Projects | Sonar

Comments are closed.