Friday, 31 July 2015

Just because we can measure something...

What are you trying to tell me? - Day 138, Year 2

So, I recently finished a 100 day challenge, where I gave up chocolate, cake, biscuits, sweets, etc., attempted to be more healthy about my eating and exercise as often as I could. This was to see if I could keep off the sugar for 100 days, and also in the hopes that I'd lose some weight.

At the end of my 100 days, I stood on the bathroom scales, and I'd lost a grand total of... wait for it... 0 lb. Bum.

And my brain being what it is, I instantly thought "well, that was a waste of time, wasn't it? Why did I even bother?"

Then my inner physicist kicked in with: "I like not this metric! Bring me another!" (So I found more metrics about how many km I'd run in the hundred days, and how many personal bests had been achieved, and I felt better.)

But that all got me thinking about metrics, and about how easy it is to doom good work, simply because it doesn't meet expectations with regards to one number. Currently, research stands or falls by its citation count - and we're trying to apply this single metric to even more things.

And that got me thinking. What we want to know is: "how useful is our research?" But an awful lot of metrics come at it from another angle: "what can we measure and what does that mean?"

So, citations. We are counting the number of times a paper (which is a proxy for a large amounts of research work) is mentioned in other papers. That is all. We are assuming that those mentions actually mean something (and to be fair, they often do) but what that meaning is, isn't necessarily clear. Is the paper being cited because it's good, or because it's rubbish? Does the citer agree with the paper, or do they refute it? This is the sort of information we don't get when we count how many times a paper has been cited, though there are movements to quantifying a bit better what a citation actually means. See CiTO, the Citation Typing Ontology for example.

Similarly for Twitter, we can count the number of tweets that something gets, but figuring out what that number actually means is the hard part. I've been told that tweets don't correlate with citations, but then that begs the question, is that what we want to use tweet counts for? I'm not sure we do.

We can count citations, tweets, mentions in social media, bookmarks in reference managers, downloads, etc., etc., etc. But are they actually helping us figure out the fundamental question: "how useful is our research?" I don't think they are.

If we take it back to that question, "how useful is my research?" then that makes us rethink things. The question then becomes: "how useful is my research to industry?" or "how useful is my research to my scientific community?, or "to industry?", or "to education?". And once we start asking those questions, we can then think of metrics to answer those questions. 

It might be the case that for the research community, citation counts are a good indicator of how useful a piece of research is. It's definitely not going to work like that for education or industry! But if those sectors of society are important consumers of research, then we need to figure out how to quantify that usefulness. 

This being just a blog post, I don't have any answers. But maybe, looking at metrics from the point of view of "what we want to measure" rather than simply "what can we measure and what does it mean?" could get us thinking in a different way. 

(Now, if you'll excuse me, I have an important meeting with a piece of chocolate!)

Thursday, 30 April 2015

Data, Metadata and Cake

data cake

I saw this analogy and thought it was a good one - because of course you need to consume the information before it can become knowledge (and because cake - does anyone need another reason?)

And then, thinking about it a bit more, I developed the analogy further:

If we consider that the raw data, straight out of the instrument/wherever is the raw ingredients, then obviously there's a bit of processing to be done to turn it into something consumable, like this cake.

Recipe photo: Basic plain sponge cake

This dataset/cake looks very nice. Someone's obviously taken care with it, it's nice and level and not burned or anything. But it still looks a bit dry, and would definitely need something to go with it, a nice cup of tea, perhaps.

Now, if we consider adding a layer of metadata/icing around the outside of the dataset/cake...


Doesn't that look so much more appealing? (Or it does to me anyway - you might be someone who doesn't like chocolate, or strawberries, or cream...but the analogy still works for your preferred cake topping!)

Metadata makes your dataset easier to consume, and makes it more appealing too.

Of course, you get good metadata, that adds to the dataset, makes it look gorgeous and yummy and delicious...

And then there's the bad metadata, which, er... doesn't.

And the moral of my analogy? Your dataset might be tasty enough for people to consume without metadata, but adding a bit of metadata can make it even yummier!


Thursday, 26 February 2015

Just why is citation important anyway?

The four capital mistakes of open source
The four capital mistakes of open source by, on Flickr

I recently had it hammered home to me about just how important citations are in scientific research. This came about as the result of me reviewing a document* .

Me being me, the first thing I did was turn to the back to look at the bibliography**. It was a mess, but I can understand how citation strings get all mucked up. I remember when I was writing my PhD, I had to copy and paste, or even retype, all my citations into the files that were my thesis chapters (files - multiple, because Word couldn't cope with having all the chapters in the one file). Nowadays I have discovered the wonder that is Mendeley, and citations are so much easier to deal with - they even do data citations!

Then I read the document, and one point I said to myself, "Self, this equation looks a bit funny to me. Oh look, here's the citation for the paper it comes from - let's look at the original source to make sure that there's no copying errors in the equation." So verily, I looked up the cited paper, and yay! It was open and accessible. But could I find the quoted equation in the cited paper? Er, no.

There was another moment, where one of my publications was cited as the source for a particular figure. I looked at the figure, and at my name in the caption next to it, and went and checked the cited document. Again, this figure was not contained in the cited publication.

These were the only examples of mis-citation that I caught, but I did find myself scrawling [citation needed] repeatedly in various places throughout the whole work. And every time I did so, my confidence in the research being presented waned a little bit more.

(Unfortunately, it goes without saying that none of the data presented in this work was cited properly either...)

Yes, all researchers stand on the shoulders of giants, and use work that has been published before to support their arguments. But it's important to not rely on unsupported statements of fact being "stuff everyone knows". Yes, the report might be written for a specialist audience who do indeed know all that, and know the citations you'd use to support the statement, but they're not your only audience. And providing citations demonstrates that you've done your due diligence, and can back up your assertions properly.

At the end of the day, when I read a paper or report, I can't check everything that the author(s) have done, so I have to take a certain amount on trust. This trust can be damaged seriously by some silly little things, like too many typos or unreadable graphs (curves all printed in similar shades of grey), and by some serious things, like mis-citations, or no citations at all.

So, citations. They're not just for helping reproducibility, or assigning credit - they also act as a marker that the author(s) knows their background and pays attention to those tricky details that can easily catch you out in science. Honestly - citations are the easy part, but if you don't have the energy to care about them (even though they're annoying) then how can your reader be sure you've applied the same care to the "more important" bits of your research?

* I'm not going to give any names or details about the document, because that's not fair, and not the point of this post.

** Yes, I am a pedant!

My biography

Dr Sarah Callaghan is a senior scientific researcher and project manager for the British Atmospheric Data Centre, part of the Centre for Environmental Data Archival (CEDA), at STFC Rutherford Appleton, UK. CEDA also incorporates the IPCC Data Distribution Centre, and the NERC Earth Observation Data Centre, and with the STFC Scientific Computing Department, hosts the JASMIN super-data-computer.

She currently project manages several large scale projects including the EU FP7 project CLIPC and internal JASMIN development and operations. She is Communications Manager for the NERC Data Operations Group - working with members of the other NERC data centres.  She is also a co- chair of the CODATA-ICSTI Task Group on Data Citation, a co-chair of the RDA/WDS Working Group on Publishing Data Bibliometrics, and an associate editor for the scientific journals Atmospheric Science Letters and Geoscience Data Journal. She has experience of both creating and managing large datasets, and so understands well the frustrations that scientists can experience as a result of dealing with data!

Her publication list can be found here.


Monday, 1 December 2014

2nd Data Management Workshop, University of Cologne, 28-29 November 2014

Cologne cathedral at night (and in the rain)

I was very honoured to be invited as a guest speaker at the 2nd Data Management Workshop, held at the University of Cologne on the 28-29 November 2014. 

It was a very interesting workshop, with many excellent national and international speakers. What was particularly good was its focus on interactions between the attendees - the coffee and lunch breaks were particularly long, which gave everyone the chance to really look at the many posters that had been submitted to the workshop, and talk to the people who were presenting them. The workshop proceedings will also be published as a special issue on data management in ISPRS International Journal of Geo-Information - I'm expecting further details of that to be on the workshop website in due course.

I took about 8 pages of hand-scribbled notes from the talks, so I won't be inflicting them all on you. Instead I'll just pull out the highlights that jumped out at me. The talks themselves were videoed, and will be made available on-line too.

The workshop opened with a pair of presentations from Stefan Winkler-Nees and Brit Redohl, both from the German Research Foundation (DFG), discussing the funding mechanisms in Germany for funding data management activities.They seemed very keen to receive more applications for data management funding!

Kevin Ashley (Digital Curation Centre) was next, giving an overview of the landscape of data management - highlighting the DCC guidance documents and Jisc's Research Data Spring, as well as the need for good research data management to root out cases of fraud, and aid data reuse. A key quote I jotted down was "Often your data tells stories that your publications do not."

Arnulf Christl (Metaspatial) gave an amusing and informative talk about open source software and what we can learn from it when it comes to open data. He made the very valid point that scientific data should be clearly licensed, as this allows attribution and credit to be given to the creators. He also showed the following video, which everyone enjoyed!

Tomi Kauppinen (Aalto University School of Science) spoke about linked data and our need for online tools to visualise and assess data, as well as the fact that linked data makes data, and data about data, machine processible.

Jane Greenberg (Dryad) gave an overview of the data publishing system in operation at Dryad, their guidance on data citation, and the costs involved in creating the Dryad metadata records. (This discussion of data publication was a theme that kept coming back throughout the workshop.)

Cyril Pommier (French National Institute for Agricultural Research, INRA) gave a talk about the data management difficulties in coupling phenotype with plant genome studies, for studies into crop security, adaption to climate change, etc. (Being a physicist, a lot of the science went straight over my head, but what I found fascinating was the fact that the data management problems being described were the same ones that we get in atmospheric science, so we may have more in common from a data management point of view, than not. Which made me think - how many of the solutions are applicable cross domains? We need to find out!)

The second day of the workshop kicked off with a pair of archaeological talks. Firstly was Gerd-Christian Weniger (Neanderthal Museum) talking about making 3D scans of items from the Pleistocene period, including Neanderthal fossils. They use Confluence, which is a business wiki, as their repository software, as it allows easy up- and download of data. These scans, and the high resolution surface scans of rock art and stone tools, allow research to be done without having to travel to where the original tool or fossil is actually held - opening up the artifacts for study by schools and teacher training.

Katie Green (Archaeology Data Service) gave a talk about how the ADS does what it does, touching on their workflows for ingest and data publication (with the journal Internet Archaeology, who are also publishing data papers). She talked about the Jisc project, investigating the value of ADS to the community (a related project looked at the BADC last year) - a synthesis report can be found here.

Marjan Grootveld (Data Archiving and Networked Services) talked about how DANS operates, specifically about their front office - back office model for dealing with researchers, where the front office provide guidance and information, while the back office deal with the technical aspects of storage and preservation. DANS provide training for front office staff, who can be embedded in university libraries and other locations. Another quote that resonated with me was: "Data management planning is more important than the plan".

Wolfram Horstmann (State and University Library of Gottingen) discussed data services and policies from universities, funding bodies and journals. He also differentiated between a "post hoc data library" which is strong in service reputation, but weak in subject specific expertise, with an "ad hoc data library", which has good subject specific knowledge, but often no recurrent funding. Of course, hybrids of these two exist.

And Hans Pfeiffenberger (Alfred Wegener Institute and Earth Systems Science Data) finished off the workshop with a discussion about data publication, giving examples of lessons that were learned from data papers published in ESSD. He also showed us that all these data publication issues are not new - Kepler's laws were based on Tycho Brahe's data and observations, which Kepler only got access to after Brahe's death. ESSD requires authors to describe the provenance of the data, the methods used to create/collect it, the limitations of the data, and provide estimates of the error. Reviewers must look at the data, and assess the consistency of the data and the article.

I'd like to thank the organisers again for inviting me to the workshop - and I hope to visit Cologne again sometime!

Monday, 17 November 2014

The WISE Awards 2014

HRH Princess Anne presenting the RCUK-sponsored Champion award. 

The WISE awards are known as “the Oscars of the Scientific World” . They recognise and celebrate individual women and girls who are ideal role models to inspire the next generation of girls to go into STEM careers, as well as the teachers, careers advisers and women in leadership who support and grow the talent pipeline.

I was honoured to have been asked to be a judge for this year's awards, especially as it meant I could attend the gala dinner where the awards were presented to the winners by HRH Princess Anne. It was  a lovely evening, and I got the chance to meet and talk to some amazing and inspiring women in STEM. 

Further details of the awards, details of all the nominees, and the list of all the winners can be found on the WISE website. I'd also really like to encourage everyone who reads my blog to seriously think about their friends and colleagues and if they think someone would be a good fit for an award, then please nominate them! 

WISE also held a daytime event  at the Southbank Centre that day (Thursday 13th November), entitled “Time for Action: The STEM workforce we want to build for the next 30 years”, where they formally announced the release of their new report “ ‘Not for people like me?’ Under-represented groups in science, technology and engineering”   

That session was particularly interesting, as it made us think about how we describe ourselves (say, if we were at a speed-dating event). I describe myself by what I do, along with half of the people in the room at the time. The other half described themselves by what personal attributes they have. Job adverts recruiting for STEM posts need to reflect this.

Also in job adverts - the language used in them can be very off-putting for women. Things that are especially off-putting are if the company appears to be "arrogant", if the advert is unclear about what the job actually is, and if there is no salary quoted.

Another point was made that it's not enough to talk about the outputs of a piece of work ("we built a bridge"), but we should also talk about the outcomes ("and this joined a community together"). This resonated with me, because the reason I do science is because I want to change the world for the better, even if only in a little way. (Another motivator is "being an expert", which I admit works for me too!)

All of the points made (and there's many more in the report - well worth a read!) are backed up by full references. The author, Prof Averil Macdonald, really did a good job on making it accessible and readable, while at the same time backing up every assertion she makes.

WISE are pushing to get "1 million more" women into STEM, on the groups that that number would take the total women in STEM proportion up to 30%, which is generally accepted as critical mass. It's not going to easy, but tackling the way STEM is presented will be a good start. As Imran Khan (Chief Executive - British Science Association) said during the panel discussion: "It's not about changing the girls, it's about changing the science". And changing the way that science is taught in schools - we should be teaching the methods of science, not just teaching the facts.

Another session that was really great was the workshop presented by the Institution of Civil Engineers (ICE). It started with the usual statistics and plots, but then went and got a group of five young apprentices to tell their story about how they came to be an apprentice. These girls were amazing! All of them had a non-standard route into their jobs, common themes included failed exams, or not getting the right grades, or having chosen the wrong subjects at school (which makes me even more sure that the way the UK's school and exam system insists on specialising in a limited amount of subjects at age fifteen really does cause problems!) It's these type of stories that we need to be publicising. After all, if we want to do it all, we just have to do things a little differently.

So, to sum up!
  1. Read (and share) the report!
  2. And when the call for nominees for next year's awards are out - think about who you could nominate!

Wednesday, 12 November 2014

My donation to the Museum of Curiosity

Goin' nuts with the label maker
Goin' nuts with the label maker by Bryan Kennedy, on Flickr

No, unfortunately I haven't been asked to be on the popular Radio 4 radio show "The Museum of Curiosity". But just in case I ever am, I've already decided what I'd like to donate.

But first, a bit of background on the show. The Museum of Curiosity is a panel comedy show, but with a twist. Instead of funny people being funny, they get in funny people, and experts on all sorts of things (and sometimes those people are one and the same), and they have a bit of a chat with the show presenter about their life and work. And each of the guests then gets to donate something to the Museum.

In the show's own words:
"The idea of the show is to bring together the most interesting people we can find and ask them to submit one item each to fill the Museum's empty plinths"

The seventh season is being broadcast at the moment (you can catch up on the listen again part of the BBC Radio4 website), and over the seven seasons there have been such weird and wacky things donated as the alphabet, a pubic louse, silence, Father Christmas, nothing and Epping Forest.

So, dear producers of the Museum of Curiosity. If I were ever to be invited to make a donation to the Museum, here's what it'd be:

(drum roll please!)

A Telepathic Label Maker!

(Note, this is a label maker that makes telepathic labels, not a label maker that is telepathic.)

And as for my reason for donating it - well, the Museum has an awful lot of stuff in it already, with more being added to it all the time. And some of the things (like silence, or nothing) are the sort of things that are really hard to identify if you don't know what it is you're looking at. The label maker would allow the curator of the Museum to label everything*, and provide the casual visitor with all the information they'd need to understand the exhibits, and provide credit to the person who donated it in the first place.

As for the telepathic labels, well, that's me thinking ahead. Assuming that the Museum is around for a long, long time (like I'm sure the curator hopes it is), language is going to change, so a label written in present day English (or heaven forbid, jargon!) won't be very useful. A telepathic label will be able to change to address the person (or alien!) who is viewing it in their own preferred language. Plus, it'd be a big saving on translation services, and would draw a lot more visitors in. 

I await your call, Mr Curator!

* I'm deliberately not using the word metadata here (in case it scares off the media types), though that's essentially what I'm talking about.