Friday, 1 July 2016

COPE Seminar: An Introductions to Publication Ethics, 13th May 2016, Oxford

Old books in my local second hand bookshop

The COPE (Committee on Publication Ethics) Seminar: An Introductions to Publication Ethics, was held on Friday 13th May 2016, in Oxford. 

Being fairly new to this being an editor business, and the workshop being so local, I took the opportunity to go, and found it all really useful. Not only from my perspective as someone in charge of a journal, but also from the data management and publication point of view. A lot of the issues raised during the workshop, like attribution, authorship, plagiarism etc. are just as easily applied to datasets as they are to journal articles.

The workshop was a mixture of talks and discussion sessions, where we were given examples of actual cases that COPE had been told about, and we had to discuss and decide what the best course of action was. Then we were told what the response from the COPE members was in those particular cases - reassuringly we were pretty much in agreement in all cases!

Key notes that I jotted down during the day include:
The main take home message for me was that COPE have a lot of resources on their website, all free to use. 

Data visualisation and the future of academic publishing, Oxford, 10 June 2016

Once again wearing my Editor-in-Chief hat, I was invited to the "Data visualisation and the future of academic publishing" workshop, hosted by University of Oxford and Oxford University Press on Friday 10th June 2016.

It was a pretty standard workshop format - lots of talks, but there were a wide variety of speakers, coming from a wide spread of backgrounds, which really helped make people think about the issues involved in data visualisation. I particularly enjoyed the interactive demonstrations from the speakers from the BBC and the Financial Times - both saying things that seem really obvious in retrospect, but are worth remembering when doing your own data visualisations (like keep it simple, and self contained, and make sure it tells a story).

For those who are interested, I've copied my (slightly edited) notes from the workshop below. Hopefully they'll make sense!

Richard O’Beirne (Digital Strategy Group, Oxford University Press)

  • What is a figure? A scientific result converted into a collection of pixels
  • Steep growth in "data visualisation" in Web of Science, PubMed
  • Data visualisation in Review: Summary, Canada 2012 
  • Infographics tell a story about datasets
  • Preservation of visualisations is an issue
  • OUP got funding to identify suitable datasets to create visualisations (using 3rd party tools) and embed them in papers

Mark Hahnel (figshare)

  • Consistency of how you get to files on the internet is key
  • Institutional instances of figshare now happening globally e.g. /
  • Making files available in the internet allows the creation of a story
  • How do you get credit? Citation counts? Not being done yet
  • Files on the internet -> context -> visualisation
  • Data FAIRport initiative - to join and support existing communities that try to realise and enable a situation where valuable scientific data is ‘FAIR’ in the sense of being Findable, Accessible, Interoperable and Reusable
  • Hard to make visualisations scale!
  • Open data and APIs make it easier to understand the context behind the stories
  • Whose responsibility is it to look after these data visualisations?
  • Need to make files human and machine readable - add sufficient metadata!
  • Making things FAIR just allows people to build on stuff that has gone before - but it's easy to break if people don't share
  • How to deal with long-tail data? Standardisation...

John Walton (Senior Broadcast Journalist, BBC News)

  • Example of data visualisation of number of civilians killed by month in Syria 
  • Visualisation has to make things clear - the layer of annotation around a dataset is really important
  • Most interactive visualisations are bespoke
  • It's helpful to keep things simple and clear!
  • Explain the facts behind things with data visualisation, but not just to people who like hard numbers - also include human stories
  • Lots of BBC web users are on mobile devices - need to take that into account
  • Big driver for BBC content is sharing on social media - BBC spend time making the content rigourous and collaborating with academia
  • Jihadism: tracking a month of deadly attacks- during the month there was about 600 deaths and ~700 attacks around the world
  • Digest the information for your audience
  • Keep interaction simple - remember different devices are used to access content

Rowan Wilson (Research Technology Specialist, University of Oxford)

  • Creating cross walks for common types of research data to get it into Blender
  • People aren't that used to navigating around 3 dimensional data - example imported into Minecraft (as sizeable proportion of the population are comfortable with navigating around that environment)
  • Issues with confidentiality and data protection, data ownership, copyright and database rights, open licenses are good for data, but should consider waiving hard requirement for attribution, as cumbersome attribution lists will put people off using data
  • Meshlab - tool to convert scientific data into Blender format

Felix Krawatzek (Department of Politics and International Relations, University of Oxford)

  • Visualising 150 years of correspondence between the US and Germany
  • Letters (handwritten/typed) need significant resource and time to process them before they can be used 
  • Software produced to systematically correct OCR mistakes
  • Visualise the temporal dynamics of the letters
  • Visualisation of political attitudes 
  • Can correlate geographic data from the corpus with census data
  • Always questions about availability of time or resources
  • Crowdsourcing projects that tend to work are those that appeal to people's sense of wonder, or their human interest. Get more richly annotated data if can harness the power of crowds.
  • Zooniverse created a byline to give the public credit for their work in Zooniverse projects

Andrea Rota (Technical Lead and Data Scientist. Pattrn)

  • Origin of the platform: the Gaza platform - documenting atrocities of war, humanitarian and environmental crises
    • "improving the global understanding of human evil"
  • Not a data analysis tool - for visualisation and exploration
  • Data in google sheets (no setup needed)
  • Web-based editor to submit/approve new event data
  • Information and computational politics - Actor Network Theory - network of human and non-human actors - how to cope with loss
  • Pattrn platform for sharing of knowledge, data, tools and research, not for profit
  • Computational agency - what are we trading in exxchange for short term convenience?
  • "How to protect the future web from its founders' own frailty" Cory Doctorow 2016
  • Issues with private data backends e.g. dependency on cloud proprietary systems
  • Computational capacity - where do we run code? Computation is cheap, managing computation isn't easy

Alan Smith (Data Visualisation Editor, Financial Times)

  • Gave a lovely example of bad chart published in the Times, and how it should have been presented
  • Visuals need to carry the story
  • Avoid chart junk!
  • Good example of taking an academic chart and reformatting them to make the story clearer
  • Graphics have impact on accompanying copy
  • Opportunity to "start with the chart"
  • Self-contained = good for social media sharing
  • Fewer charts, but better
  • Content should adapt to different platforms
  • The Chart Doctor - monthly column in the FT
  • Visualisation has a grammar and a vocabulary, it needs to be read, like written text

Scott Hale (Data Scientist, Oxford Internet Institute, University of Oxford)

  • Making existing tools easy to use, online interfaces to move from data file to visualisation
  • Key: make it easy
  • Plugin to Gephi to export data as javascript plugin for website
  • - compiles straight to javascript - write code once - attach tables/plot to html element. Interactive environment that can go straight into html page

Alejandra Gonzalez-Beltran (Research Lecturer, Oxford e-Research Centre)

  • All about Scientific Data journal
  • Paper on survey about reproducibility - "More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments."
  • FAIR principles
  • isaexplorer to find and filter data descriptor documents

Philippa Matthews (Honorary Research Fellow, Nuffield Department of Medicine)

  • Work is accessible if you know where to look
  • Lots of researcher profiles on lots of different places  - LinkedIn, ResearchFish, ORCID,...
  • Times for publication are long
  • Spotted minor error with data in a supplementary data file - couldn't correct it
  • Want to be able to share things better - especially entering dialogue with patients and research participants
  • Want to publish a database of HBV epitopes - publish as a peer-reviewed journal aricle, but journals wary of publishing a live resource
    • my response to this was to query the underlying assumption that at database needs to be published like a paper - again a casualty of the "papers are the only true academic output" meme.
  • Public engagement - dynamic and engaging rather than static images e.g. Tropical medicine sketchbook  

3rd LEARN Workshop, Helsinki, June 2016

Cute bollard at Helsinki airport

The 3rd LEARN (Leaders Activating Research Networks) workshop on Research Data Management, “Make research data management policies work” was held in Helsinki on Tuesday 28th June. I was invited wearing my CODATA hat (as Editor-in-Chief for the Data Science Journal) to give the closing keynote about the Science International Accord "Open Data in a Big Data World".

The problem with doing closing talks is that so much of what I wanted to say had pretty much already been said by someone during the course of the day - sometimes even by me during the breakout sessions! Still, it was a really interesting workshop, with excellent discussion (despite the pall that Brexit cast over the coffee and lunchtime conversation - but that's a topic for another time).

There were three breakout session possibilities, of which the timings meant that you could go to two of them. 

I started with Group 3: Making possible and encouraging the reuse of data: incentives needed. This is my day job - taking data in from researchers, making it understandable and reusable, and figuring out ways to give them credit and rewards for doing so. And my group has been doing this for more than 2 decades, so I'm afraid I might have gone off on a bit of a rant. Regardless, we covered a lot, though mainly the old chestnuts of the promotion and tenure system being fixated on publications as the main academic output, the requirements for standards (especially for metadata - acknowledging just how difficult it would be to come up with a universal metadata standard applicable to all research data), and the fact that repositories can control (to a certain extent) the technology, but culture change still needs to happen. Though there were some positives on the culture change - I noted that journals are now pushing DOIs for data, and this has had an impact on people coming to us to get DOIs.

Next breakout group I went to was Group 1: Research Data services planning, implementation and governance. What surprised me in this session (maybe it shouldn't have) was just how far advanced the UK is when it comes to research data management policies and the likes, in comparison to other countries. This did mean that me and my other UK colleagues did get quizzed a fair bit about our experiences, which made sense. I had a bit of a different perspective from most of the other attendees - being a discipline-specific repository means that we can pick and choose what data we take in, unlike institutional repositories, who have to be more general. On being asked about what other services we provide, I did manage to name-drop JASMIN, in the context of a UK infrastructure for data analysis and storage. 

I think the key driver in the UK for getting research data management policies working was the Research Councils, and their policies, but also their willingness to stump up the cash to fund the work. A big push on institutional repositories was EPSRC's putting the onus on research institutions to manage EPSRC-funded research data. But the increasing importance of data, and people's increased interest in it, is coming from a wide range of drivers - funders, policies, journals, repositories, etc.

I understand that the talks and notes from the breakouts will be put up on the workshop website, but they're not up as of the time of me writing this. You can find the slides from my talk here.

Friday, 2 October 2015

RDA Plenary 6, DataCite and EPIC, and e-Infrastructures - Paris, September 2015

La Tour Eiffel
Last week was the 6th Plenary of the Research Data Alliance, held in Paris, France. It officially started on the Wednesday, but I was there from the Monday to take advantage of the other co-located events.

DataCite and EPIC -  Persistent Identifiers: Enabling Services for Data Intensive Research

Monday, September 21, 2015

This workshop consisted of a quick-fire selection of presentations  (12 of them!) all in the space of one afternoon, covering such topics as busting DOI myths; persistent identifiers other than DOIs; persistent identifiers for people (including ORCIDs and ISNI - including showing Brian May's ISNI account - linking his research with his music); persistent identifiers for use in climate science,the International GeoSample Number (ISGN) - persistent identifiers for physical samples; the THOR project - all about establishing seamless integration between articles, data, and researchers across the research lifecycle; and Making Data Count - a project to develop data level metrics.

(I also learned that DOIs are also assigned to movies, as part of their supply chain management)

Questions were collected via Google doc during the course of the workshop, and have all since been answered, which is very helpful! I understand that the slides presented at the workshop will also be collected and made available soon.

e-Infrastructures & RDA for data intensive science

Tuesday, 22 September, 2015

This was a day long event featuring several parallel streams. Of course, I went to the stream on Research Data infrastructures for Environmental related Societal Challenges, though I had to miss the afternoon session because of needing to be at the RDA co-chairs meeting (providing an update on my Working Group and also discussing important processes, like, what exactly happens when a Working Group finishes?) Thankfully, all the slides presented in that stream are available on the programme page.

Unsurprisingly, a lot of the presentations at this workshop dealt with the importance of e-infrastructures to address the big changes we'll need to face as a result of things like climate change. There was also talk about the importance of de-fragmenting the infrastructure, across geographical,  technological and domain boundaries (RDA being a key part of these efforts).

A common thing in this, and the other RDA meetings, were analogies between data infrastructures and other infrastructures, like for water, or electricity. Users aren't worried about how the water or power gets to them, or the pipes, agreements and standards are generated. They just want to be able to get water when they turn the tap, and electricity when they flick a switch. Another interesting point was that there's a false dichotomy between social and technical solutions, what we really have is a technical solution with a social choice attached to it.

Common themes across the presentations were the sheer complexity of the data we're managing now, whether it's from climate science, oceanography, agriculture, and the needs to standardise, and fill in those gaps in infrastructure that exist now.

RDA 6th Plenary

Wednesday 23 to Fri 25th September, 2015

As ever, the RDA plenaries are a glorious festival of data, with many, many parallel streams, and even more interesting people to talk to! It's impossible to capture the whole event, even with my pages of notes.

If I can pick out a few themes though, these are them:

  • Data is important to lots of people, and the RDA is a key part of keeping things going in the right direction.
  • Infrastructures that exist aren't always interoperable - this needs to be changed for the vast quantities of data we'll be getting in the future.
  • The RDA is all about building bridges, connecting people and creating solutions with people, not for them. 
  • Uncertainty is the enemy of investment – shared information reduces uncertainty

Axelle Lemaire, Minister of State for Digital Technology, French Ministry of Economy, Industry and Digital Technology, said that people say that data are the oil of the 21st century, but this isn't such a good comparison – better to compare it to light – the more light gets diffused, the better it is, and the more the curtains are open the more light gets in. She is launching a public consultation on a digital bill she's preparing and is looking for views from people outside of France - the RDA will distribute the information about this consultation at a later date.

It's interesting now that the RDA has matured to the point that several working groups are either finished, or will be finished by the next plenary (though there is still some uncertainty what "finished" actually means). Given the 18 month lifespan of the working groups - that's enough time to build/develop something, but the actual time to get the community to adopt those outputs will be a lot longer. So there was plenty of discussion about what outputs could/should be, and how the adoption phase could be handled. I suspect that, even with all our discussions, no definite solution was found, so we'll have another phase of seeing what the working groups decide to do over the next few months.

This is of particular relevance to me, as my working group on Bibliometrics for Data is due to finish before the next plenary in March. We had a packed meeting room (standing room only!) which was great, and we achieved my main aim for the session, which was to decide what the final group outputs would be, and how to achieve them. Now we have a plan - hopefully the plan will work for us!

A key part of that plan is collecting information about what metrics data repositories already collect - if you are part of a library/repository, please take a look at this spreadsheet and add things we might have missed!

I went to the following working group and Birds of a Feather meetings:

So, that was RDA Plenary 6. Next plenary will be held in Tokyo, Japan from the 1st to the 3rd of March 2016. In the meantime, we've got work to be getting on with!

Friday, 31 July 2015

Just because we can measure something...

What are you trying to tell me? - Day 138, Year 2

So, I recently finished a 100 day challenge, where I gave up chocolate, cake, biscuits, sweets, etc., attempted to be more healthy about my eating and exercise as often as I could. This was to see if I could keep off the sugar for 100 days, and also in the hopes that I'd lose some weight.

At the end of my 100 days, I stood on the bathroom scales, and I'd lost a grand total of... wait for it... 0 lb. Bum.

And my brain being what it is, I instantly thought "well, that was a waste of time, wasn't it? Why did I even bother?"

Then my inner physicist kicked in with: "I like not this metric! Bring me another!" (So I found more metrics about how many km I'd run in the hundred days, and how many personal bests had been achieved, and I felt better.)

But that all got me thinking about metrics, and about how easy it is to doom good work, simply because it doesn't meet expectations with regards to one number. Currently, research stands or falls by its citation count - and we're trying to apply this single metric to even more things.

And that got me thinking. What we want to know is: "how useful is our research?" But an awful lot of metrics come at it from another angle: "what can we measure and what does that mean?"

So, citations. We are counting the number of times a paper (which is a proxy for a large amounts of research work) is mentioned in other papers. That is all. We are assuming that those mentions actually mean something (and to be fair, they often do) but what that meaning is, isn't necessarily clear. Is the paper being cited because it's good, or because it's rubbish? Does the citer agree with the paper, or do they refute it? This is the sort of information we don't get when we count how many times a paper has been cited, though there are movements to quantifying a bit better what a citation actually means. See CiTO, the Citation Typing Ontology for example.

Similarly for Twitter, we can count the number of tweets that something gets, but figuring out what that number actually means is the hard part. I've been told that tweets don't correlate with citations, but then that begs the question, is that what we want to use tweet counts for? I'm not sure we do.

We can count citations, tweets, mentions in social media, bookmarks in reference managers, downloads, etc., etc., etc. But are they actually helping us figure out the fundamental question: "how useful is our research?" I don't think they are.

If we take it back to that question, "how useful is my research?" then that makes us rethink things. The question then becomes: "how useful is my research to industry?" or "how useful is my research to my scientific community?, or "to industry?", or "to education?". And once we start asking those questions, we can then think of metrics to answer those questions. 

It might be the case that for the research community, citation counts are a good indicator of how useful a piece of research is. It's definitely not going to work like that for education or industry! But if those sectors of society are important consumers of research, then we need to figure out how to quantify that usefulness. 

This being just a blog post, I don't have any answers. But maybe, looking at metrics from the point of view of "what we want to measure" rather than simply "what can we measure and what does it mean?" could get us thinking in a different way. 

(Now, if you'll excuse me, I have an important meeting with a piece of chocolate!)

Thursday, 30 April 2015

Data, Metadata and Cake

data cake

I saw this analogy and thought it was a good one - because of course you need to consume the information before it can become knowledge (and because cake - does anyone need another reason?)

And then, thinking about it a bit more, I developed the analogy further:

If we consider that the raw data, straight out of the instrument/wherever is the raw ingredients, then obviously there's a bit of processing to be done to turn it into something consumable, like this cake.

Recipe photo: Basic plain sponge cake

This dataset/cake looks very nice. Someone's obviously taken care with it, it's nice and level and not burned or anything. But it still looks a bit dry, and would definitely need something to go with it, a nice cup of tea, perhaps.

Now, if we consider adding a layer of metadata/icing around the outside of the dataset/cake...


Doesn't that look so much more appealing? (Or it does to me anyway - you might be someone who doesn't like chocolate, or strawberries, or cream...but the analogy still works for your preferred cake topping!)

Metadata makes your dataset easier to consume, and makes it more appealing too.

Of course, you get good metadata, that adds to the dataset, makes it look gorgeous and yummy and delicious...

And then there's the bad metadata, which, er... doesn't.

And the moral of my analogy? Your dataset might be tasty enough for people to consume without metadata, but adding a bit of metadata can make it even yummier!


Thursday, 26 February 2015

Just why is citation important anyway?

The four capital mistakes of open source
The four capital mistakes of open source by, on Flickr

I recently had it hammered home to me about just how important citations are in scientific research. This came about as the result of me reviewing a document* .

Me being me, the first thing I did was turn to the back to look at the bibliography**. It was a mess, but I can understand how citation strings get all mucked up. I remember when I was writing my PhD, I had to copy and paste, or even retype, all my citations into the files that were my thesis chapters (files - multiple, because Word couldn't cope with having all the chapters in the one file). Nowadays I have discovered the wonder that is Mendeley, and citations are so much easier to deal with - they even do data citations!

Then I read the document, and one point I said to myself, "Self, this equation looks a bit funny to me. Oh look, here's the citation for the paper it comes from - let's look at the original source to make sure that there's no copying errors in the equation." So verily, I looked up the cited paper, and yay! It was open and accessible. But could I find the quoted equation in the cited paper? Er, no.

There was another moment, where one of my publications was cited as the source for a particular figure. I looked at the figure, and at my name in the caption next to it, and went and checked the cited document. Again, this figure was not contained in the cited publication.

These were the only examples of mis-citation that I caught, but I did find myself scrawling [citation needed] repeatedly in various places throughout the whole work. And every time I did so, my confidence in the research being presented waned a little bit more.

(Unfortunately, it goes without saying that none of the data presented in this work was cited properly either...)

Yes, all researchers stand on the shoulders of giants, and use work that has been published before to support their arguments. But it's important to not rely on unsupported statements of fact being "stuff everyone knows". Yes, the report might be written for a specialist audience who do indeed know all that, and know the citations you'd use to support the statement, but they're not your only audience. And providing citations demonstrates that you've done your due diligence, and can back up your assertions properly.

At the end of the day, when I read a paper or report, I can't check everything that the author(s) have done, so I have to take a certain amount on trust. This trust can be damaged seriously by some silly little things, like too many typos or unreadable graphs (curves all printed in similar shades of grey), and by some serious things, like mis-citations, or no citations at all.

So, citations. They're not just for helping reproducibility, or assigning credit - they also act as a marker that the author(s) knows their background and pays attention to those tricky details that can easily catch you out in science. Honestly - citations are the easy part, but if you don't have the energy to care about them (even though they're annoying) then how can your reader be sure you've applied the same care to the "more important" bits of your research?

* I'm not going to give any names or details about the document, because that's not fair, and not the point of this post.

** Yes, I am a pedant!