Friday, 13 June 2014

Link roundup

In no particular order, some interesting stuff that has been cluttering up my browser tabs...


"Preamble
Sound, reproducible scholarship rests upon a foundation of robust, accessible data.  For this to be so in practice as well as theory, data must be accorded due importance in the practice of scholarship and in the enduring scholarly record.  In other words, data should be considered legitimate, citable products of research.  Data citation, like the citation of other evidence and sources, is good research practice and is part of the scholarly ecosystem supporting data reuse.

In support of this assertion, and to encourage good practice, we offer a set of guiding principles for data within scholarly literature, another dataset, or any other research object."

I strongly recommend that everyone with and interest in data citation endorses these principles, either on an individual basis, or on behalf of their organisation!

For Comment: The Role of Publishers in Access to Data

"Call to Action
We envision a future information ecosystem where research data is considered an integral part of scholarly communications. We propose a new metaphor to characterize our vision: a social contract. This contract is an agreement amongst all stakeholders based on shared, governing principles: data should be preserved, discoverable, measured, and integrated into evaluation processes; and data sharing is a fundamental practice. Adherence to this social contract will entail dramatic changes to existing workflows; technologies; and social norms for all the members of the research ecosystem."


"Scientists can be reluctant to share data because of the need to publish journal articles and receive recognition. But what if the data sets were actually a better way of getting credit for your work? Chris Belter measured the impact of a few openly accessible data sets and compared to journal articles in his field. His results provide hard evidence that the production, archival, and sharing of data may actually be a more effective way to contribute to the advancement of scientific knowledge."

DOIs and the danger of data “quality”

"...NERC state “by assigning a DOI the [Environmental Data Centre] are giving it a ‘data center stamp of approval’”. Effectively they see a DOI name (or by implication any other form of Persistent Uniform Resource Locator (PURL)) as a quality check-mark in addition to its role as a reference to an object. Except the DOI system isn’t designed to suggest the “quality goes in before the name goes on”. Just to remind myself, I quickly looked at the International DOI Foundation handbook and it doesn’t mention data quality. Identification, yes. Resolution, yes. Management, yes. Quality, no."

(With a response from myself)

Citation Rates Highlight Uphill Battle for Women in Research Careers

"One of the most important and institutionalized forms of science communication is the peer-reviewed journal article. These articles are essential to disseminating information among researchers in specific fields of study, and the extent to which those journal articles are cited by researchers in later articles is of enormous professional importance to researchers – particularly researchers who work in academic settings. But it appears that many researchers face an uphill battle when it comes to getting citations and related professional benefits. Specifically, researchers who are women."

What’s the Point of Academic Publishing?

"In December 2013, Nobel Prize-winning physicist Peter Higgs made a startling announcement. “Today I wouldn't get an academic job,” he told The Guardian. “It's as simple as that. I don't think I would be regarded as productive enough.”

Higgs noted that quantity, not quality, is the metric by which success in the sciences in measured. Unlike in 1964, when he was hired, scientists are now pressured to churn out as many papers as possible in order to retain their jobs. Had he not been nominated for the Nobel, Higgs says, he would have been fired. His scientific discovery was made possible by his era’s relatively lax publishing norms, which left him time to think, dream, and discover."

Scientists losing data at a rapid rate

"In their parents' attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data. Such practices mean that data are being lost to science at a rapid rate, a study has now found.

The authors of the study, which is published today in Current Biology, looked for the data behind 516 ecology papers published between 1991 and 2011. The researchers selected studies that involved measuring characteristics associated with the size and form of plants and animals, something that has been done in the same way for decades. By contacting the authors of the papers, they found that, whereas data for almost all studies published just two years ago were still accessible, the chance of them being so fell by 17% per year. Availability dropped to as little as 20% for research from the early 1990s."

Guidelines / Recommendations for Citing Data

An excellent set of resources from the Virtual Solar Observatory.

The Robot Army of Good Enough

"Pretty much any organization of any size has certain themes, beliefs and outlooks baked into them. Some of them might be obvious from the outside. Others are so inherent that the members might not even notice they’re completely steeped in it.

At the Internet Archive, there’s a philosophy set about access and acceptance of materials and presentation of said materials that’s pretty inherent throughout the engineering and the website. Paraphrased, in my own words, it’s this:

  • Always provide the original.
  • Never ask why a user wants something.
  • Now is better than tomorrow.
  • We can hold it.
  • Many inexpensively is better than one or none luxuriously.
  • Never send a person where a machine can go.
  • Enjoy yourself."


"Being the largest land predator, the fearsome and enigmatic Polar Bear is seen by many as a powerful symbol to highlight of the threats to the environment through global warming. With a new publication on the Polar Bear genome out last week in Cell, they surprisingly are also an impressive example of how far data publication and citation has come in the last few years, and help debunk many of the negative arguments about the early release of datasets in this manner."

How Bitcoin’s Technology Could Revolutionize Intellectual Property Rights

"The bitcoin block chain is well known for its use as a ledger for digital currency transactions, but it has the potential for other, more radical uses too – uses that are only now beginning to be explored.

The online service Proof of Existence is an example of how the power of this new technology can have applications far beyond the world of finance, in this case, giving a glimpse of how bitcoin could one day have a substantial impact in the fields of intellectual property and law.

Although in its initial stages, Proof of Existence can be used to demonstrate document ownership without revealing the information it contains, and to provide proof that a document was authored at a particular time."

LIBRE

"is a free open peer review platform developed by a growing community of volunteer research scholars who envision a new era of openness and transparency in scholarly evaluation and communication. Join us and let’s liberate research together!"

Frontiers for Young Minds

"Frontiers in Neuroscience for Young Minds is a scientific journal that includes young people (from 8 to 15) in the review of articles. This has the double benefit of bringing kids into the world of scientific research – many of them for the first time – and offering active scientists a platform for reaching out to the broadest of all publics.

All articles in Frontiers for Young Minds will be reviewed and approved for publication by young people themselves. Established neuroscientists will mentor these young Review Editors and help them review the manuscript and focus their queries to authors. To avoid overburdening the young Review Editors, revised manuscripts will in turn be reviewed by one of the stellar Associate Editors of Frontiers for Young Minds."



Friday, 2 May 2014

Who owns the data?

Its mine MINE ITS ALL MINE!!!
http://cheezburger.com/3724637184

So, who does own a dataset, anyway?

Is it the researcher who sets up the instrument and makes the measurement?
Is it the company that built the instrument?
Is it the organisation that operates the instrument (from whom the researcher has bought instrument time)?
Is it the researcher's institution, who employs the researcher to make measurements?
Is it the institution's data repository, who publish the data, or restrict access to it?
Is it the funder whose grant pays the institution for the researcher to make the measurement?
Is it the government, who provides the funder with the budget to hand out grants?
Is it the tax payer, whose taxes fund the government?*

Like so many things in life, the answers to these questions are "well, it depends..."

Ownership is a social construct. I own a car because I have a document in my filing cabinet giving details of the car make and model, saying that I do. This document is also registered in a national database (the DVLA) saying that the car specified is mine. The car itself sits outside my house, and I have the key, which means I can use it, and other people can't without my express permission. If the car gets stolen, it's uniquely registered, so there's a good chance that (barring an experience with a quick respray and fake plates) it'll still be identifiable as mine.

I also have many books. These are mine, because I bought them. But they're not uniquely identified - most don't even have my name written on them, and I don't have a register of them, not even one independently verified by an external body. If a desperate book thief were to come and nick one of my books, well, I'd be very unlikely to get exactly that same volume back again. Yet I still own them, and feel possessive about them.

[Edited to add: my better half points out that if someone steals a book from me, they take away my ability to read that book. If someone steals a digital object, like a dataset, they're stealing a copy, and unless they destroy the original, then it's still available for use by the original owner.]

And that feeling of possession is key to how people react to data. The person who feels the most strongly about the data is the researcher who created it (part of the IKEA effect, that leads to people valuing things that they assemble, customize or build themselves more highly than premade, finished goods**) But an owner of something can have no feelings for it at all, as witnessed by all those paintings locked in a vault somewhere until their value improves. 

That's why I think ownership is not a helpful thing to think about when it comes to data. Ownership focuses on possession - who has the data now. With it being so easy to make copies of datasets, many people can be "owners" - i.e. have the dataset in their possession. Ownership for data then becomes about who holds the "one, true dataset", and can then assert rights based on this***. 

As for the responsibilities of owners, well, I may be having a failure of imagination here, but I can't really think of any. I am perfectly within my rights to burn my book without asking anyone's permission (though causing a nuisance to to the neighbours with the smoke wouldn't be good). And if someone nicks my car and goes joyriding, I'm not responsible for the damage they do. If I own a dataset, I can delete it, change it, whatever. Other people might want to use it, but tough. I own it. I get to decide what to do with it.

It's better, then, to think more about the other roles involved in data, the roles that have responsibilities as well as rights. Roles like the data creator (the researcher who made the measurement), who is responsible for the contents of the dataset and the supporting information around it, and deserves credit for their work. Roles like data publisher, (the data repository and/or library), who is responsible for releasing the data to defined subsets of the population. Roles like data licenser, the party responsible for determining what other parts of the population are allowed access to the dataset, and under what conditions. Roles like data archiver, who decides whether a dataset should still be kept or should be deleted as it's no longer useful. 

These roles don't have to be carried out by individuals, institutions are capable of doing them as well. For example, the Unseen University could act as the licenser, corporate author and publisher of data that it holds. Corporate authorship is particularly useful for datasets with large numbers of creators, as it enables credit while keeping the number of names in the citation string to meaningful levels (see as an example the list of volunteers for Galaxy Zoo at http://authors.galaxyzoo.org/ - note that the url for the list names them all as authors!)

So, when discussing data, especially with the people who have put weeks, months and years of their life into the datasets they've created, it's a good idea to think about more than ownership of the data. Think and talk about those other roles and responsibilities. That way it becomes less about asserting rights and possessiveness, and more about the data itself.

And, in the future, as data becomes more open, and the mechanisms exist for giving the data creators (and their employers, funders and support staff) the credit they deserve, then hopefully the issue of ownership won't be so much of a problem.

________________________________
* This happens to be my personal opinion. The results of publicly funded research should be made available for the benefit of all. In other words, open, unless there's a damn good reason not to.
** The proper link to the paper publishing this study is http://dx.doi.org/10.1016/j.jcps.2011.08.002 , but it's paywalled.
*** I'm sure I'm missing out all sort of technical, legal stuff here...

Friday, 24 January 2014

Cite what you use



Poster for "Screen as Landscape" Exhibition at the Stanley Picker Gallery, Kingston University, December 2011, and "Screen as Landscape", Dan Hays, PhD thesis, 2012, Kingston University. (From http://danhays.org)

What do you cite, the dataset or the data article? Or should it be both?

There's a lot of confusion about this, mainly stemming from the whole notion that the data article is a direct citation substitute (or proxy) for the dataset it describes (which, to be fair, it can be). Citing both the dataset and the data article gives rise to accusations of "salami-slicing" and double accounting, whereas citing only the dataset could be seen as taking citations away from the article (or vice versa).

The way I see it is that the dataset, and its corresponding data article are two separate, though related, things. It's time for another analogy!

Consider the Fine Arts. If you were wanting to do a PhD in the Fine Arts, you would need to produce a Work of Art (or possibly several, depending on your chosen form of Art) and you would also need to write a thesis about that Work, providing information about how you created the Work, why you did it the way you did, the context and reasoning behind it, and all that sort of important background information.

Now, if I was wanting to write a critique of your Work of Art, I could do so without ever reading your thesis. In that case it'd be entirely appropriate to cite the Work, but I'd have no need to cite the thesis.

If, on the other hand, I was wanting to write an article about the history and practice of a technique you used to create your Work of Art, and I read and used information from your thesis to support my argument, then I'd definitely need to cite your thesis. (I could chose to cite the Work of Art as well, in passing, but might not need to. After all, anyone wanting to find out about the Work can read the thesis I've cited and get to it that way. And I'm not actually discussing the Work itself.)

With me so far?

Ok, so the Work of Art is the dataset, and the thesis is the data article. It starts getting a bit murky in the data world, because often there isn't enough contextualising information in the dataset itself to allow it to be used/critiqued/whatever easily, and that information is captured and published in the data article (which is one of the main reasons for having data articles - to make that sort of important information and metadata available!).

Historically, in many disciplines (in the dark days before data citation), important datasets were cited by proxy - i.e. the authors of the dataset published a paper about it, and then others cited that paper as a stand-in for the dataset. The citation counts for that paper then became the citation counts for the dataset, which had the virtue of being simple enough and a valid work-around to the problem of the lack of a common practice of data citation.

But now we have the situation where a dataset can be cited independently from its data article. And we have the following situations:
  1. Both dataset and article are cited. Data creator is very happy (two citations!). Data publisher is happy (citation!). Data article publisher is happy (citation!). Reader of the citing article may not be happy (potential accusations of double counting of citations and salami-slicing...) Publisher of citing article might not be happy (not enough space in reference lists, potentially two citations that look like they're for the same thing).
  2. Only the dataset is cited.  Data creator is happy (citation!). Data publisher is happy (citation!). Data article publisher is not happy (though might be mollified by the fact that there are links from  the dataset back to the data article). Reader of the citing article may not be happy (may want more info about the dataset that is only provided in the data article). Publisher of citing article is probably not bothered one way or another (depending on journal policies for citing data).
  3. Only the data article is cited. Data creator is happy (citation!). Data publisher is not so happy (but probably resigned, no citation, but link from data article to dataset, so not as bad as old days with no link to the data at all). Data article publisher is happy (citation!). Reader of the citing article may not be happy (may want a direct link to the data). Publisher of citing article is content (situation normal).
It's a balancing act!

Honestly? I do think cultural norms will evolve within the different research domains over time. We should be prepared to give them a gentle nudge if they look like they're going completely haywire, but for the most part I'd say let them grow.

And for me, when asked "But what should I cite?!?", my default answer will be "Cite what you use".

  • If you use a data article to understand and make use of a dataset, cite them both.
  • If you use a dataset, but don't use any of the extra information given in the data article, cite the dataset.
  • If you use a data article, but don't do anything with the dataset, cite the article.


Cite what you use!

Tuesday, 26 November 2013

Citing dynamic data


Beautiful animation from http://uxblog.idvsolutions.com/2013/07/a-breathing-earth.html - go check out the larger versions!
Citing dynamic data is a topic that just keeps coming around, every time data citation is mentioned, usually as a way of pointing out that data citation is not like text citation, because people can and will want to get their hands on the most recent data in a dataset, and simply don't want to wait for a frozen version. There's also confusion about what makes something citeable or not (see "DOI != citeable" by Carl Boettiger), tied into the whole DOI for citation thing and the requirements for a dataset to have a DOI assigned.

As I've said many times before, citing data is all about using it to support the scholarly record. We have other methods of linking data to papers, or data to other data - that's what the Internet is all about after all. I maintain that citation is all about getting back to exactly the thing the author of the article was talking about when they put the citation in the article.

If you’re citing something so you can simply point to it ("the most recent version of the dataset can be found at blah"), and aren’t really that worried about whether it’s changed since the pointer was made, then you can do that easily with a citation with a http link in it. That way you go automatically to the most recent version of the dataset. 

If however, you need to be sure that the user gets back to exactly the same data each time, because that's the data you used in your analysis, then that data becomes part of the scientific record and needs to be frozen. How you get back to that exact version is up to the dataset archive – it can be done via frozen snapshots, or by backing out changes on a database – whatever works.

(For a more in-depth discussion of frozen data versus active data, see the previous post here.)

Even if you’re using a DOI to get to a frozen version of the dataset, there should still be a link on the DOI landing page which points to the most recent version of the dataset. So if a scientist wants to get to the most recent version of the dataset, but only has a DOI to a frozen version, then they can still get to the most recent version in a couple of hops.

It is (theoretically) possible to record all changes to a dynamic dataset and guarantee (audited by someone) that, if needed, the data repository could back out all those changes to recreate the original dataset as it was on a certain date. However, the BODC did a few tests a while back, and discovered that backing out the changes made to their database would take weeks, depending on how long ago the requested version was. (This is a technical issue though, so I’m sure people are already working on solving it.)

You could instigate a system where citation is simply a unique reference based on a database identifier and the timestamp of extraction – as is already used in some cases. The main issue with this (in my opinion) is convincing users and journal editors that this is an appropriate way to cite the data. It’s been done in some fields (e.g. accession numbers) but hasn’t really gained world-wide traction. I know from our own experience at BADC that telling people to cite our data using our own (permanent) URLs didn’t get anywhere because people don’t trust urls. (To be fair, we were telling them this at a time when data citation was even less used than it is now, so that might change here and now.)

Frozen data is definitely the easiest and safest type to cite. But, we regularly manage datasets that are continually being updated, and for a long term time series, we can't afford to wait the twenty odd years for the series to be finished and frozen before we start using and citing it.

So we've got a few work-arounds.
  1. For the long running dataset, we break the dataset up into appropriate chunks, and assign DOIs to those chunks. These chunks are generally defined on a time basis (yearly, monthly), and this works particularly well for datasets where new data is continually being appended, but the old data isn't being changed. (Using a dead-tree analogy, the chunks are volumes of the same work which is released in a series and at different times - think of the novels in the series A Song of Ice and Fire for example - now that's a long running dataset which is still being updated*)
    1. A related method is the ONS (Office for National Statistics) model, where the database is cited with a DOI and an access date, on the understanding that the database is only changed by appending new data to it – hence any data from before the access date will not have changed between now and when the citation was made. As soon as old data is updated, the database is frozen and archived, and a new DOI is assigned to the new version. 
  2. For datasets where the data is continually being updated, and old measurements are being changed as well as new measurements appended, we take snapshots of the dataset at a given point in time, and those snapshots are frozen, and have the DOIs assigned to them. This is effectively what we do when we have a changing dataset, but the dataset is subject to version control. It also parallels the system used for software releases.
It's worth noting that we're not the only group thinking about these issues, there's a lot of clever people out there trying to come up with solutions. The key thing there is bringing them all together so that the different solutions can work together rather than against each other - one of the key tenets of the RDA.  

DOIs aren’t suitable for everything, and citing dynamic data is a problem that we have to get our heads around. It may well turn out that citing frozen datasets is a special case, in which case we’ll need to come up with another solution. But we need to get people used to citing data first!

So, in summary – if all you want from a citation is a link to the current version of the data: use a url. If you want to get back to the exact version of the data used in the paper so that you can check and verify their results: that’s when you need a DOI.

_________________________________________
* Pushing the analogy a bit further - I'd bet there's hordes of "Game of Thrones" fans out there who'd dearly love to get their hands on the active version of the next book in "A Song of Ice and Fire", but I'm pretty sure George R.R. Martin would prefer they didn't!

Frozen Datasets are Useful, So are Active ones

Frozen Raspberry are Tasty
"Frozen Raspberry are Tasty" by epSos.de

I think there's a crucial distinction we need to draw between data that is "active" or "working" and data that is  "finished" or "frozen"*, i.e. suitable for publication/consumption by others.

There's a lot of parallels that can be drawn between writing a novel (or a text book, or an article, or a blog post) and creating a dataset. When I sit down to write a blog post, sometimes I start at the beginning and write until I reach the end. In which case, if I was doing it interactively, then it might be useful for a reader to watch me type, and get access to the post as I'm adding to it. I'm not that disciplined a writer however - I reread and rewrite things. I go back, I shuffle text around, and to be honest, it'd get very confusing for someone watching the whole process. (Not to mention the fact that I don't really want people to watch while I'm writing - it'd feel a bit uncomfortable and odd.)

In fact, this post has just been created as a separate entity in its own right - it was originally part of the next post on citing dynamic data  - so if the reader wanted to cite the above paragraph and was only accessing the working draft of the dynamic data post, well, when they came back to the dynamic data post, that paragraph wouldn't be there anymore.

It's only when the blog post is what I consider to be finished, and is spell-checked and proofread, that I hit the publish button.

Now, sometimes I write collaboratively. I recently put in a grant proposal which involved coordinating people from all around the world, and I wrote the proposal text openly on a Google document with the help of a lot of other people. That text was constantly in flux, with additions and changes being made all the time. But it was only finally nailed down and finished just before I hit the submit button and sent it in to the funders. Now that that's done, the text is frozen, and is the official version of record, as (if it gets funded) it will become part of the official project documentation.

The process of creating a dataset can be a lot like that. Researchers understandably want to check their data before making it available to other people, in case of others finding errors. They work collaboratively in group workspaces, where a dataset may be changed lots very quickly, without proper version control, and that's ok. There has to be a process that says "this dataset is now suitable for use by other people and is a version of record" - i.e. hitting the submit, or the publish button.

But at the same time, creating datasets can be more like writing a multi-volume epic than a blog post. They take time, and need to be released in stages (or versions, or volumes, if you'd prefer). But each of those volumes/versions is a "finished" thing in its own right.

I'm a firm believer that if you cite something, you're using it to support your argument. In that case, any reader who reads your argument needs to be able to get to the thing you've used to support it. If that thing doesn't exist anymore, or has changed since you cited it, then your argument immediately falls flat. And that is why it's dangerous to cite active datasets. If you're using data to support your argument, that data needs to be part of the record, and it needs to be frozen. Yes, it can be superseded, or flat out wrong, but the data still has to be there.

You don't have this issue when citing articles - an article is always frozen before it is published. The closest analogy in the text world for active data is things like wiki pages, but they're generally not accepted in scholarly publishing to be suitable citation sources, because they change.

But if you're not looking to use data to support your argument, you're just doing the equivalent of saying "the dataset can be found at blah", well, that's when a link to a working dataset might be more appropriate.

My main point here is that you need to know whether the dataset is active or frozen before you link/cite it, as that can determine how you do the linking/citing. The user of the link/citation needs to know whether the dataset is active or not as well.

In the text world, a reader can tell from the citation (usually the publisher info) whether the cited text is active or frozen. For example, a paper from the Journal of Really Important Stuff (probably linked with a DOI), will be frozen, whereas a Wikipedia page (linked with a URL) won't be. For datasets, the publishers are likely to be the same (the host repository) whether the data is frozen or not - hence ideally we need a method of determining the "frozen-ness" of the data from the citation string text.

In the NERC data centres, it's easy. If the text after the "Please cite this dataset as:" bit on the dataset catalogue page has a DOI in it, then the dataset is frozen, and won't be changed. If it's got a URL, the dataset is still active. Users can still cite it, but the caveat there is that it will change over time.

We'll always have active datasets and we'll want to link to them (and potentially even freeze bits of them to cite). We (and others) are still trying to figure out the best ways to do this, and we haven't figured it out completely yet, but we're getting there! Stay tuned for the next blog post, all about citing dynamic (i.e. active) data.

In the meantime, when you're thinking of citing data, just take a moment to think about whether it's active or not, and how that will affect your citing method. Active versus frozen is an important distinction!

____________________________
* I love analogies and terminology. Even in this situation, calling something frozen implies that you can de-frost it and refreeze it (but once that's done, is it still the same thing?) More to ponder...

Thursday, 14 November 2013

Presentations, presentations, presentations...

Scruffy Duck helps me prepare my slides before LCPD13, Malta
Long time, no post and all that - but I'm still here!

The past few months have been a bit busy, what with the RDA Second Plenary, the DataCite Summer Meeting, and the CODATA and Force 11 Task Groups on Data Citation meetings in Washington DC, followed by Linking and Contextualising Publications and Datasets, in Malta, and a quick side trip to CERN for the ODIN codesprint and first year conference. (My slides from the presentations at the DataCite, LCPD and ODIN meetings are all up on their respective sites.)

On top of that I also managed to decide it'd be a good idea to apply for a COST Action on data publication. Thankfully 48 other people from 25 different countries decided that it'd be a good idea too, and the proposal got submitted last Friday (and now we wait...) Oh, and I put a few papers in for the International Digital Curation Conference being held in San Francisco in February next year.

Anyway, they're all my excuse for not having blogged for a while, despite the list I've been building up of things to blog about. This post is really by way of an update, and also to break the dry spell. Normal service (or whatever passes for it 'round these parts) will be resumed shortly.

And just to make it interesting, a couple of my presentations this year were videoed. So, you can hear me present about the CODATA TG on data citation's report "Out of Cite, Out of Mind" here. And the lecture I gave on data management for the OpenAIRE workshop May 28, Ghent Belgium can be found here.

Friday, 6 September 2013

My Story Collider story - now available for all your listening needs

Way back last year, I was lucky/brave/foolhardy enough to take part in a Story Collider event where I stood on stage in front of a microphone and told a story about my life in science*.

And here is that very recording! With many thanks to the fine folk at the Story Collider for agreeing to let me post it on my blog.


_________________
*This was right in the middle of my three month long missing voice period, so I sound a bit croaky.