Tuesday 13 November 2012

Citing Sensitive Data - workshop report

"Burned" DVD, microwaved to ensure total elimination of private data.
"Burned" DVD, microwaved to ensure total elimination of private data , bNightRStar

On the 29th October, I went to the British Library for a workshop on the topic of managing and citing sensitive data, one of a series of workshops all about data citation.

I won't go into the detail of what was said during the presentations as all the slides are available on-line here, and there's a good blog post summarising the workshop here.

I will take the opportunity to re-iterate what I said in my previous post about how citation doesn't equal open. Though I will expand on it further and say that there needs to be extremely good reasons for keeping data closed when public money has funded its collection (reasons along the lines of patient confidentiality, saving endangered species, etc, not "but I need extra time to write a paper!")

After all the presentations, we were split up into groups, and made to do some work, it being a workshop and all. First of all, we had to come up with some example scenarios for how to cite data given certain access conditions or embargos, and then we had to swap these with another group and try to solve them. This turned out to be a lot of fun, though I did somehow manage to wind up in the group that was threatening to fire people left, right and centre if they didn't behave!

The Yellow group were looking at access conditions for a study where different participants had given different levels of consent. The solutions they came up with were: 1) have an umbrella DOI for the whole dataset with multiple DOIs for the subsets with different access conditions. 2) Have a hierarchical DOI, or 3) have an umbrella DOI linking to subsets. The trade-off here was clarity versus nuance, and it was generally agreed that communities in different disciplines would have to decide the best approach. We also can't draw an inference on a subset of the data without taking the whole dataset into account.

The Red group were looking at embargoed data. First up was "researchers want to gain more research credit". Suggestions included: early deposit, while the embargo still is in play; access by request during embargo; DOI minted on deposit; open landing page in the repository (so people know the data exists, even if they can't access it yet) with end of embargo date on it; and the metadata should be specified on deposit too.

Next the Red group looked at the situation of longitudinal cohort studies which may change and have multi-layered embargoes. Access to variables could be dependent on layers of the dataset, with access to layers potentially increasing in time. The suggestion was to have multiple DOIs for multiple layers, with links between the landing pages to show how the layers fit together.

The Green group also looked at embargoes - specifically the situation where there was retrospective withdrawal of permission for a dataset and the data was embargoed while an investigation took place. (The assumption was that the DOI had already been minted for the dataset.) Suggested action was: retain the same landing page, but add text to it detailing the embargo and the expected date when the investigations would end (compliant with the institution's investigations policy). A user option to register to get notified when the dataset becomes un-embargoed would be a nice thing to have. When the investigation is complete, update the metadata depending on the results. And, at the beginning of the data collection, make sure that the permissions and data policy are set out clearly!

The Blue group were looking at access criteria, in two cases. Firstly was "White rhino numbers and GPS tracking information". The suggestions were: assigning a DOI to the analysed rather than raw data, and apply access conditions to the raw data so as to verify user credentials. The format of the public dataset could be varied, e.g. releasing it as snapshots instead of time series, or delaying the release of the dataset until the death of the tagged rhinos. Some of the rich descriptive data might also be kept back from the DataCite metadata store in order to protect the subjects.

The second scenario the Blue group looked at was animal experiments - medical testing on guinea pigs with photos and survival times. This one was noted as being difficult - though there was agreement that releasing data should be guided by funders and ethics committees. The metadata should not name individuals, and the possibility of embargoing data, or publishing subsets (without photos?) should be investigated.


In the general discussion afterwards it was (quite rightly!) pointed out that it's ok to cite and make available different levels of data (raw/processed) as raw data might well be completely incomprehensible to non-experts. We also had a lot of discussion about those two favourite topics in data citation - granularity and versioning. Happily enough, they'll be the subject of the next workshop, booked for Mon 3rd Dec. 

No comments:

Post a Comment