Friday, 24 January 2014

Cite what you use

Poster for "Screen as Landscape" Exhibition at the Stanley Picker Gallery, Kingston University, December 2011, and "Screen as Landscape", Dan Hays, PhD thesis, 2012, Kingston University. (From

What do you cite, the dataset or the data article? Or should it be both?

There's a lot of confusion about this, mainly stemming from the whole notion that the data article is a direct citation substitute (or proxy) for the dataset it describes (which, to be fair, it can be). Citing both the dataset and the data article gives rise to accusations of "salami-slicing" and double accounting, whereas citing only the dataset could be seen as taking citations away from the article (or vice versa).

The way I see it is that the dataset, and its corresponding data article are two separate, though related, things. It's time for another analogy!

Consider the Fine Arts. If you were wanting to do a PhD in the Fine Arts, you would need to produce a Work of Art (or possibly several, depending on your chosen form of Art) and you would also need to write a thesis about that Work, providing information about how you created the Work, why you did it the way you did, the context and reasoning behind it, and all that sort of important background information.

Now, if I was wanting to write a critique of your Work of Art, I could do so without ever reading your thesis. In that case it'd be entirely appropriate to cite the Work, but I'd have no need to cite the thesis.

If, on the other hand, I was wanting to write an article about the history and practice of a technique you used to create your Work of Art, and I read and used information from your thesis to support my argument, then I'd definitely need to cite your thesis. (I could chose to cite the Work of Art as well, in passing, but might not need to. After all, anyone wanting to find out about the Work can read the thesis I've cited and get to it that way. And I'm not actually discussing the Work itself.)

With me so far?

Ok, so the Work of Art is the dataset, and the thesis is the data article. It starts getting a bit murky in the data world, because often there isn't enough contextualising information in the dataset itself to allow it to be used/critiqued/whatever easily, and that information is captured and published in the data article (which is one of the main reasons for having data articles - to make that sort of important information and metadata available!).

Historically, in many disciplines (in the dark days before data citation), important datasets were cited by proxy - i.e. the authors of the dataset published a paper about it, and then others cited that paper as a stand-in for the dataset. The citation counts for that paper then became the citation counts for the dataset, which had the virtue of being simple enough and a valid work-around to the problem of the lack of a common practice of data citation.

But now we have the situation where a dataset can be cited independently from its data article. And we have the following situations:
  1. Both dataset and article are cited. Data creator is very happy (two citations!). Data publisher is happy (citation!). Data article publisher is happy (citation!). Reader of the citing article may not be happy (potential accusations of double counting of citations and salami-slicing...) Publisher of citing article might not be happy (not enough space in reference lists, potentially two citations that look like they're for the same thing).
  2. Only the dataset is cited.  Data creator is happy (citation!). Data publisher is happy (citation!). Data article publisher is not happy (though might be mollified by the fact that there are links from  the dataset back to the data article). Reader of the citing article may not be happy (may want more info about the dataset that is only provided in the data article). Publisher of citing article is probably not bothered one way or another (depending on journal policies for citing data).
  3. Only the data article is cited. Data creator is happy (citation!). Data publisher is not so happy (but probably resigned, no citation, but link from data article to dataset, so not as bad as old days with no link to the data at all). Data article publisher is happy (citation!). Reader of the citing article may not be happy (may want a direct link to the data). Publisher of citing article is content (situation normal).
It's a balancing act!

Honestly? I do think cultural norms will evolve within the different research domains over time. We should be prepared to give them a gentle nudge if they look like they're going completely haywire, but for the most part I'd say let them grow.

And for me, when asked "But what should I cite?!?", my default answer will be "Cite what you use".

  • If you use a data article to understand and make use of a dataset, cite them both.
  • If you use a dataset, but don't use any of the extra information given in the data article, cite the dataset.
  • If you use a data article, but don't do anything with the dataset, cite the article.

Cite what you use!

Tuesday, 26 November 2013

Citing dynamic data

Beautiful animation from - go check out the larger versions!
Citing dynamic data is a topic that just keeps coming around, every time data citation is mentioned, usually as a way of pointing out that data citation is not like text citation, because people can and will want to get their hands on the most recent data in a dataset, and simply don't want to wait for a frozen version. There's also confusion about what makes something citeable or not (see "DOI != citeable" by Carl Boettiger), tied into the whole DOI for citation thing and the requirements for a dataset to have a DOI assigned.

As I've said many times before, citing data is all about using it to support the scholarly record. We have other methods of linking data to papers, or data to other data - that's what the Internet is all about after all. I maintain that citation is all about getting back to exactly the thing the author of the article was talking about when they put the citation in the article.

If you’re citing something so you can simply point to it ("the most recent version of the dataset can be found at blah"), and aren’t really that worried about whether it’s changed since the pointer was made, then you can do that easily with a citation with a http link in it. That way you go automatically to the most recent version of the dataset. 

If however, you need to be sure that the user gets back to exactly the same data each time, because that's the data you used in your analysis, then that data becomes part of the scientific record and needs to be frozen. How you get back to that exact version is up to the dataset archive – it can be done via frozen snapshots, or by backing out changes on a database – whatever works.

(For a more in-depth discussion of frozen data versus active data, see the previous post here.)

Even if you’re using a DOI to get to a frozen version of the dataset, there should still be a link on the DOI landing page which points to the most recent version of the dataset. So if a scientist wants to get to the most recent version of the dataset, but only has a DOI to a frozen version, then they can still get to the most recent version in a couple of hops.

It is (theoretically) possible to record all changes to a dynamic dataset and guarantee (audited by someone) that, if needed, the data repository could back out all those changes to recreate the original dataset as it was on a certain date. However, the BODC did a few tests a while back, and discovered that backing out the changes made to their database would take weeks, depending on how long ago the requested version was. (This is a technical issue though, so I’m sure people are already working on solving it.)

You could instigate a system where citation is simply a unique reference based on a database identifier and the timestamp of extraction – as is already used in some cases. The main issue with this (in my opinion) is convincing users and journal editors that this is an appropriate way to cite the data. It’s been done in some fields (e.g. accession numbers) but hasn’t really gained world-wide traction. I know from our own experience at BADC that telling people to cite our data using our own (permanent) URLs didn’t get anywhere because people don’t trust urls. (To be fair, we were telling them this at a time when data citation was even less used than it is now, so that might change here and now.)

Frozen data is definitely the easiest and safest type to cite. But, we regularly manage datasets that are continually being updated, and for a long term time series, we can't afford to wait the twenty odd years for the series to be finished and frozen before we start using and citing it.

So we've got a few work-arounds.
  1. For the long running dataset, we break the dataset up into appropriate chunks, and assign DOIs to those chunks. These chunks are generally defined on a time basis (yearly, monthly), and this works particularly well for datasets where new data is continually being appended, but the old data isn't being changed. (Using a dead-tree analogy, the chunks are volumes of the same work which is released in a series and at different times - think of the novels in the series A Song of Ice and Fire for example - now that's a long running dataset which is still being updated*)
    1. A related method is the ONS (Office for National Statistics) model, where the database is cited with a DOI and an access date, on the understanding that the database is only changed by appending new data to it – hence any data from before the access date will not have changed between now and when the citation was made. As soon as old data is updated, the database is frozen and archived, and a new DOI is assigned to the new version. 
  2. For datasets where the data is continually being updated, and old measurements are being changed as well as new measurements appended, we take snapshots of the dataset at a given point in time, and those snapshots are frozen, and have the DOIs assigned to them. This is effectively what we do when we have a changing dataset, but the dataset is subject to version control. It also parallels the system used for software releases.
It's worth noting that we're not the only group thinking about these issues, there's a lot of clever people out there trying to come up with solutions. The key thing there is bringing them all together so that the different solutions can work together rather than against each other - one of the key tenets of the RDA.  

DOIs aren’t suitable for everything, and citing dynamic data is a problem that we have to get our heads around. It may well turn out that citing frozen datasets is a special case, in which case we’ll need to come up with another solution. But we need to get people used to citing data first!

So, in summary – if all you want from a citation is a link to the current version of the data: use a url. If you want to get back to the exact version of the data used in the paper so that you can check and verify their results: that’s when you need a DOI.

* Pushing the analogy a bit further - I'd bet there's hordes of "Game of Thrones" fans out there who'd dearly love to get their hands on the active version of the next book in "A Song of Ice and Fire", but I'm pretty sure George R.R. Martin would prefer they didn't!

Frozen Datasets are Useful, So are Active ones

Frozen Raspberry are Tasty
"Frozen Raspberry are Tasty" by

I think there's a crucial distinction we need to draw between data that is "active" or "working" and data that is  "finished" or "frozen"*, i.e. suitable for publication/consumption by others.

There's a lot of parallels that can be drawn between writing a novel (or a text book, or an article, or a blog post) and creating a dataset. When I sit down to write a blog post, sometimes I start at the beginning and write until I reach the end. In which case, if I was doing it interactively, then it might be useful for a reader to watch me type, and get access to the post as I'm adding to it. I'm not that disciplined a writer however - I reread and rewrite things. I go back, I shuffle text around, and to be honest, it'd get very confusing for someone watching the whole process. (Not to mention the fact that I don't really want people to watch while I'm writing - it'd feel a bit uncomfortable and odd.)

In fact, this post has just been created as a separate entity in its own right - it was originally part of the next post on citing dynamic data  - so if the reader wanted to cite the above paragraph and was only accessing the working draft of the dynamic data post, well, when they came back to the dynamic data post, that paragraph wouldn't be there anymore.

It's only when the blog post is what I consider to be finished, and is spell-checked and proofread, that I hit the publish button.

Now, sometimes I write collaboratively. I recently put in a grant proposal which involved coordinating people from all around the world, and I wrote the proposal text openly on a Google document with the help of a lot of other people. That text was constantly in flux, with additions and changes being made all the time. But it was only finally nailed down and finished just before I hit the submit button and sent it in to the funders. Now that that's done, the text is frozen, and is the official version of record, as (if it gets funded) it will become part of the official project documentation.

The process of creating a dataset can be a lot like that. Researchers understandably want to check their data before making it available to other people, in case of others finding errors. They work collaboratively in group workspaces, where a dataset may be changed lots very quickly, without proper version control, and that's ok. There has to be a process that says "this dataset is now suitable for use by other people and is a version of record" - i.e. hitting the submit, or the publish button.

But at the same time, creating datasets can be more like writing a multi-volume epic than a blog post. They take time, and need to be released in stages (or versions, or volumes, if you'd prefer). But each of those volumes/versions is a "finished" thing in its own right.

I'm a firm believer that if you cite something, you're using it to support your argument. In that case, any reader who reads your argument needs to be able to get to the thing you've used to support it. If that thing doesn't exist anymore, or has changed since you cited it, then your argument immediately falls flat. And that is why it's dangerous to cite active datasets. If you're using data to support your argument, that data needs to be part of the record, and it needs to be frozen. Yes, it can be superseded, or flat out wrong, but the data still has to be there.

You don't have this issue when citing articles - an article is always frozen before it is published. The closest analogy in the text world for active data is things like wiki pages, but they're generally not accepted in scholarly publishing to be suitable citation sources, because they change.

But if you're not looking to use data to support your argument, you're just doing the equivalent of saying "the dataset can be found at blah", well, that's when a link to a working dataset might be more appropriate.

My main point here is that you need to know whether the dataset is active or frozen before you link/cite it, as that can determine how you do the linking/citing. The user of the link/citation needs to know whether the dataset is active or not as well.

In the text world, a reader can tell from the citation (usually the publisher info) whether the cited text is active or frozen. For example, a paper from the Journal of Really Important Stuff (probably linked with a DOI), will be frozen, whereas a Wikipedia page (linked with a URL) won't be. For datasets, the publishers are likely to be the same (the host repository) whether the data is frozen or not - hence ideally we need a method of determining the "frozen-ness" of the data from the citation string text.

In the NERC data centres, it's easy. If the text after the "Please cite this dataset as:" bit on the dataset catalogue page has a DOI in it, then the dataset is frozen, and won't be changed. If it's got a URL, the dataset is still active. Users can still cite it, but the caveat there is that it will change over time.

We'll always have active datasets and we'll want to link to them (and potentially even freeze bits of them to cite). We (and others) are still trying to figure out the best ways to do this, and we haven't figured it out completely yet, but we're getting there! Stay tuned for the next blog post, all about citing dynamic (i.e. active) data.

In the meantime, when you're thinking of citing data, just take a moment to think about whether it's active or not, and how that will affect your citing method. Active versus frozen is an important distinction!

* I love analogies and terminology. Even in this situation, calling something frozen implies that you can de-frost it and refreeze it (but once that's done, is it still the same thing?) More to ponder...

Thursday, 14 November 2013

Presentations, presentations, presentations...

Scruffy Duck helps me prepare my slides before LCPD13, Malta
Long time, no post and all that - but I'm still here!

The past few months have been a bit busy, what with the RDA Second Plenary, the DataCite Summer Meeting, and the CODATA and Force 11 Task Groups on Data Citation meetings in Washington DC, followed by Linking and Contextualising Publications and Datasets, in Malta, and a quick side trip to CERN for the ODIN codesprint and first year conference. (My slides from the presentations at the DataCite, LCPD and ODIN meetings are all up on their respective sites.)

On top of that I also managed to decide it'd be a good idea to apply for a COST Action on data publication. Thankfully 48 other people from 25 different countries decided that it'd be a good idea too, and the proposal got submitted last Friday (and now we wait...) Oh, and I put a few papers in for the International Digital Curation Conference being held in San Francisco in February next year.

Anyway, they're all my excuse for not having blogged for a while, despite the list I've been building up of things to blog about. This post is really by way of an update, and also to break the dry spell. Normal service (or whatever passes for it 'round these parts) will be resumed shortly.

And just to make it interesting, a couple of my presentations this year were videoed. So, you can hear me present about the CODATA TG on data citation's report "Out of Cite, Out of Mind" here. And the lecture I gave on data management for the OpenAIRE workshop May 28, Ghent Belgium can be found here.

Friday, 6 September 2013

My Story Collider story - now available for all your listening needs

Way back last year, I was lucky/brave/foolhardy enough to take part in a Story Collider event where I stood on stage in front of a microphone and told a story about my life in science*.

And here is that very recording! With many thanks to the fine folk at the Story Collider for agreeing to let me post it on my blog.

*This was right in the middle of my three month long missing voice period, so I sound a bit croaky.

Monday, 12 August 2013

How to review a dataset: a couple of case studies

"Same graph as last year, but now I have an additional dot"

As part of the PREPARDE project, I've been doing some thinking recently about how exactly one would go about peer-reviewing data. So far, the project (and friends and other interested parties) have come up with some general principles, which are still being discussed and will be published soon. 

Being more of a pragmatic and experimental bent myself, I thought I'd try to actually review some publicly accessible datasets out there and see what I could learn from the process. Standard disclaimers: with a sample size of 2, and an admittedly biased way of choosing what datasets to review, this is not going to be statistically valid!

I'm also bypassing a bit of the review process that would probably be done by the journal's editorial assistant, asking important questions like: 
  • Does the dataset have a permanent identifier? 
  • Does it have a landing page (or README file or similar) with additional information/metadata, which allows you to determine that this is indeed the dataset you're looking for?
  • Is it in an accredited/trusted repository?*
  • Is the dataset accessible? If not, are the terms and conditions for access clearly defined?
If the answer to any of those questions is no, then the editorial assistant should just bounce the dataset back to the author without even sending it to scientific review, as the poor scientific reviewer will have no chance of either accessing the data, or understanding it.

In my opinion, the main purpose of peer-review of data is to check for obvious mistakes and determine if the dataset (or article) is of value to the scientific community. I also err on the side of pragmatism - for most things, quality is assessed over the long term by how much the thing is used. Data's no different. So, for the most part, the purpose of the scientific peer review is to determine if there's enough information with the data to allow it to be reused.

Dataset 1: Institute of Meteorology and Geophysics (2013): Air temperature and precipitation time series from weather station Obergurgl, 1953-1959. University of Innsbruck, doi:10.1594/PANGAEA.806618,

I found this dataset by going to and typing "precipitation" into their search box, and then looking at the search results until I found a title that I liked the sound of and thought I'd have the domain expertise to review. (Told you the process was biased!)

Then I started poking around an asking myself a few questions:
  • Are the access terms an conditions appropriate? 
    • Open access and downloadable with a click of a button, so yes. It also clearly stated that the license for the data is CC-BY 3.0
  • Is the format of the data acceptable? 
    • You can download the dataset as tab-delimited text in a wide variety of standards that you can choose from a drop down menu. You can also view the first 2,000 rows in a nicely formatted html table on the webpage.
  • Does the format conform to community standards?
    • I'm used to stuff in netCDF, but I suspect tab delimited text is more generic.
  • Can I open the files and view the data? (If not, reject straight away)
    • I can view the first 2,000 lines on the webpage. Downloading the file was no problem, but the .tab extension confused my computer. I tried opening it in notepad first (which looked terrible) but then quickly figured out that I could open the file in Excel and it would format it nicely for me.
  • Is the metadata appropriate? Does it accurately describe the data?
    • Yes. I can't spot any glaring errors, and short of going to the measurement site itself and measuring, I have to trust that the latitude and longitude are correct, but that's to be expected.
  • Are there unexplained/non-standard acronyms in the dataset title/metadata?
    • No. I like the way parameter DATE/TIME is linked out to a description of the format that it follows.
  • Is the data calibrated? If so, is the calibration supplied?
    • No mention of callibration, but these are old measurements from the 1950s, so I'm not surprised.
  • Is information/metadata given about how/why the dataset was collected? (This may be found in publications associated with the dataset)
  • Are the variable names clear and unambiguous, and defined (with their units)?
    • Yes, in a Parameter(s) table on the landing page. I'm not sure why they decided to call temperature "TTT", but it's easy enough to figure out, given the units are given next to the variable name. 
    • It also took me a minute to figure out what the 7-21h and 21-7h meant in the table next to the Precipitation, sum - but looking at the date/time of the measurements made me realise that it meant the precipitation was summed over the time between 7am and 9pm for one measurement and 9pm and 7am (the following morning) for the other - an artefact of when the measurements were actually taken.
    • The metadata gives the height above ground of the sensor, but doesn't give the height above mean sea level for the measurements station - you have to go to the dataset collection page to find that out. It does say that location is in the Central Alps though.
  • Is there enough information provided so that data can be reused by another researcher?
    • Yes, I think so
  • Is the data of value to the scientific community? 
    • Yes, it's measurement data that can't be repeated.
  • Does the data have obvious mistakes? 
    • Not that I can see. The precision of the precipitation measurement is 0.1mm, which is small, but plausible. 
  • Does the data stay within expected ranges?
    • Yes. I can't spot any negative rainrates, or temperatures in the minus values in the middle of summer.
  • If the dataset contains multiple data variables, is it clear how they relate to each other?
    • Yes - the temperature and precipitation measurements are related according to the time of the measurement. 
Verdict: Accept. I'm pretty sure I'd be able to use this data, if I ever needed precipitation measurements from the 1950s in the Austrian Alps.

I found this dataset in a versy similar way as before, i.e. by going to and typing "precipitation" into their search box, ticking the box in the advance search to restrict to datasets, and then picking the first appropriate sounding title.

At first glance, I haven't a clue what this dataset is about. The data itself is easily viewed on the webpage as a table with some location codes (explained a bit in the description - I think they're in the USA?) and some figures for annual rainfall and coefficients of variation.

Going through my questions:
  • Are the access terms an conditions appropriate? 
    •  Don't know. It's obviously open, but I don't know what license it's under (if any)
  • Is the format of the data acceptable? 
    •  I can easily download it as an Excel spreadsheet (make comments as you'd like regarding Excel and proprietary formats and backwards compatibility...)
  • Does the format conform to community standards?
    •  No, but I can open them easily, so it's not too bad
  • Can I open the files and view the data? (If not, reject straight away)
    •  Yes
  • Is the metadata appropriate? Does it accurately describe the data?
    •  No
  • Are there unexplained/non-standard acronyms in the dataset title/metadata?
    •  Yes
  • Is the data calibrated? If so, is the calibration supplied?
    •  No idea
  • Is information/metadata given about how/why the dataset was collected? (This may be found in publications associated with the dataset)
  • Are the variable names clear and unambiguous, and defined (with their units)?
    •  No
  • Is there enough information provided so that data can be reused by another researcher?
    •  No
  • Is the data of value to the scientific community? 
    •  I have no idea
  • Does the data have obvious mistakes? 
    •  No idea
  • Does the data stay within expected ranges?
    •  Well, there's no minus rainfall - other than that, who knows?
  • If the dataset contains multiple data variables, is it clear how they relate to each other?
    •  Not clear
Verdict: Reject. On the figshare site, there simply isn't enough metadata to review the dataset, or even figure out what the data is. Yes, "Annual rainfall (mm)" is clear enough, but that makes me ask: for what year? Or is is averaged? Or what?

But! Looking at the paper which is linked to the dataset reveals an awful lot more information. This dataset is the figures behind table 1 of the paper, shared in a way that makes them easier to use in other work (which I approve of). The paper also has a paragraph about the precipitation data in the table, describing what it is and how it was created. 

It turns out the main purpose of this dataset was to study the plant resource use by populations of desert tortoises (Gopherus agassizii) across a precipitation gradient in the Sonoran Desert of Arizona, USA. And, from the look of the paper (very much outside my field!) it did the job it was supposed to, and might be of use for other people studying animals in that region. My main concern is if that dataset ever becomes disconnected from that paper, then the dataset as it is now would be pretty much worthless.

Here's a picture of a desert tortoise:
Desert Tortoise (Gopherus agassizii) in Rainbow Basin near Barstow, California. Photograph taken by Mark A. Wilson (Department of Geology, The College of Wooster). Public Domain


So, what have I learned from this little experiment?
  1. There's an awful lot of metadata and information in a journal article that relates to a dataset (which is good) and linking the two is vital if you're not going to duplicate information from the paper in the same location as the dataset. BUT! if the link between the dataset and the paper is broken, you've lost all the information about the dataset, rendering it useless.
  2. Having standard (and possibly mandatory) metadata fields which have to be filled out before the dataset is stored in the repository means that you've got a far better chance of being able to understand the dataset without having to look elsewhere for information (that might be spread across multiple publications). The down side of this is that it increases the effort needed to deposit the data in the repository, duplicates metadata and may increase the chances of error (when metadata with the dataset is different from that in the publication).
  3. I picked a pair of fairly easy datasets to review, and it took me about 3 hours (admittedly, there was a large proportion of that which was devoted to writing this post). 
  4. Having a list of questions to answer does help very much with the data review process. The questions above are ones I've come up with myself, based on my knowledge of datasets and also of observational measurements. They'll not be applicable for every scientific domain, so I think they're only really guidelines. But I'd be surprised if there weren't some common questions there.
  5. Data review probably isn't as tricky as people are worried. Besides, there's always the option of rejecting stuff out of hand, if, for example, you can't open the downloaded data file. It's the dataset authors' responsibility (with some help from the data repository) to make the dataset usable and understandable if they want it to be published.
  6. Searching for standard terms like "precipitation" in data repositories can return some really strange results.
  7. Desert tortoises are cute!
I'd very much like to thank the authors who's datasets I've reviewed (assuming they ever see this). They put their data out there, open to everyone, and I'm profoundly grateful! Even in the case where I'd reject the dataset as not being suitable to publish in a data journal, I still think the authors did the right thing in making it available, seeing as it's an essential part of another published article.
* Believe me, we've had a lot of discussions about what exactly it means to be an accredited/trusted repository. I'll be blogging about it later.

Monday, 17 June 2013

NFDP13: Plenary Panel Two: Where do we want to go?

This is my final post about the Now and Future of Data Publishing symposium and is a write-up of my speaking notes from the last plenary panel of the day.

As before, I didn't have any slides, but used the above xkcd picture as a backdrop, because I thought it summed things up nicely!

My topic was: "Data and society - how can we ensure future political discussions are evidence led?"

"I work for the British Atmospheric Data Centre, and we happen to be one of the data nodes hosting the data from the 5th Climate Model Intercomparison Project (CMIP5). What this means is that we're hosting and managing about 2 Petabytes worth of climate model output, which will feed into the next Intergovernmental Panel on Climate Change's Assessment Report and will be used national and local governments to set policy given future projections of climate change.

But if we attempted to show politicians the raw data from these model runs, they'd probably need to go and have a quiet lie down in a darkened room somewhere. The raw data is just too much and too complicated, for anyone other than the experts. That's why we need to provide tools and services. But we also need to keep the raw data so the outputs of those tools and services can be verified.

Communication is difficult. It's hard enough to cross scientific domains, let alone the scientist/non-scientist divide. A repositories, we collect metadata bout the datasets in our archives, but this metadata is often far to specific and specialised for a member of the general public or a science journalist to understand. Data papers allow users to read the story of the dataset and find out details of how and why it was made, while putting it into context. And data papers are a lot easier for humans to read that an xml catalogue page.

Data publication can help us with transparency and trust. Politicians can't be scientific experts - they need to be political experts. So they need to rely on advisors who are scientists or science journalists for that advice - and preferably more than one advisor.

Making researchers' data open means that it can be checked by others. Publishing data (in to formal data journal sense) means that it'll be peer-reviewed, which (in theory at least) will cut down on fraud. It's harder to fake a dataset than a graph - I know this personally, because I spent my PhD trying to simulate radar measurements of rain fields, with limited success!)

With data publishing, researchers can publish negative results. The dataset is what it is and can be published even if it doesn't support a hypothesis - helpful when it comes to avoiding going down the wrong research track.

As for what we, personally can do? I'd say: lead by example. If you're a researcher, be open with your data (if you can. Not all data should be open, for very good reason, for example if it's health data and personal details are involved). If you're and editor, reviewer, or funder, simply ask the question: "where's the data?"

And everyone: pick on your MP. Query the statistics reported by them (and the press), ask for evidence. Remember, the Freedom of Information Act is your friend.

And never forget, 87.3% of statistics are made up on the spot!"


After a question from the audience, I did need to make clear that when you're pestering people about where their data is, be nice about it! Don't buttonhole people at parties or back them into corners. Instead of yelling "where's your data?!?" ask questions like: "Your work sounds really interesting. I'd love to know more, do you make your data available anywhere?" "Did you hear about this new data publication thing? Yeah, it means you can publish your data in a trusted repository and get a paper out of it to show the promotion committee." Things like that.

If you're talking the talk, don't forget to walk the walk.