Friday, May 25, 2018

Silverbacks and scientific progress: no more co-authorship for just supervision

A silverback gorilla. CC-BY-SA Raul654.
Barend Mons (GO FAIR) frequently uses the term silverback to refer to more senior scientists effectively (intentionally or not) blocking progress. When Bjoern Brembs posted on G+ today that Stevan Harnad proposed to publish all research online, I was reminded of Mons' gorillas.

My conclusion is basically that every senior scholar (after PhD) is basically a silverback. And the older we get the more back we become, and the less silver. That includes me, I'm fully aware of that. I'm trying to give the PhD candidates I am supervising (Ryan, Denise, Marvin) as much space as I can and focus only on what I can teach them. Fairly, I am limited in that too: grants put pressure on what the candidates must deliver on top of the thesis.

The problem is the human bias that we prefer to listen to more senior people. Most of us fail that way. It takes great effort to overcome that bias. Off topic, that is one thing which I really like about the International Conference on Chemical Structures that starts this Sunday: no invited speakers, no distinction between PhD candidates and award winners (well, we get pretty close to that); also, organizers and SAB members never get an oral presentation: the silverbacks take a step back.

But 80% of the innovation, discovery we do is progress that is hanging in the air. Serendipity, availability of and access to the right tools (which explains a lot of why "top" universities stay "top"), introduce some bias to who is lucky enough to find it. It's privilege.

No more co-authorship for just supervision
Besides the so many other things that need serious revision in journal publishing (really, we're trying that at J.Cheminform!), one thing is that we must stop being co-author on papers, just for being supervisor: if we did not contribute practical research, we should not be co-author.

Of course, the research world does not work like that. Because people count articles (rather than seeing what research someone does); we value grant acquisition more than doing research (only 20% of my research time is still research, and even that small amount takes great effort). And full professors are judged on the number of papers they publish, rather than the amount of research done by people in his group. Of course, the supervision is essential, but that makes you a great teacher, not an active researcher.

BTW, did you notice that Nobel prizes are always awarded for work to last authors of the papers describing the work, and the award never seems to mention the first author?

BTW, noticed how sneakingly the gender-bias sneaked in? Just to make clear, female scholars can be academic silverbacks just as well!

Sunday, March 25, 2018

SPLASHes in Wikidata

Mass spectrum from the OSDB (see also this post).
A bit over a year ago I added EPA CompTox Dashboard IDs to Wikidata. Considering that an entry in that database means that likely is something known about the adverse properties of that compound, the identifier can be used as proxy for that. Better, once the EPA team starts supporting RDF with a SPARQL end point, we will be able to do some cool federated queries.

For metabolomics the availability of mass spectra is of interest for metabolite identification. A while ago the SPLASH was introduced (doi:10.1038/nbt.3689), and adopted by several databases around the world. After the recent metabolomics winterschool it became apparent that this is now enough adopted to be used in Wikidata. So, I proposed a new SPLASH Wikidata property, which was approved last week (see P4964). The MassBank of North America (MoNA; Fiehn's lab) team made available a mapping the links the InChI for the compounds with SPLASH identifiers for spectra for that compound, as CCZero.

So, over the weekend I pushed some 37 thousand SPLASHes into Wikidata :)

This is for about 4800 compounds.

Yes, technically, I used the same Bioclipse script approach as with the CompTox identifiers, resulting in QuickStatements. Next up is SPLASHs from the Chalk's aforementioned OSDB.

Wednesday, February 21, 2018

When were articles cited by WikiPathways published?

Number of articles cited by curated WikiPathways, using
data in Wikidata (see text).
One of the consequences of the high publication pressure is that we cannot keep up converting all those facts in knowledge bases. Indeed, publishers, journals more specifically do not care so much about migrating new knowledge into such bases. Probably this has to do with the business: they give the impression they are more interested in disseminating PDFs than disseminating knowledge. Yes, sure there are projects around this, but they are missing the point, IMHO. But that's the situation and text mining and data curation will be around for the next decade at the very least.

That make any database uptodateness pretty volatile. Our knowledge extends every 15 seconds [0,1] and extracting machine readable facts accurately (i.e. as the author intended) is not trivial. Thankfully we have projects like ContentMine! Keeping database content up to date is still a massive task. Indeed, I have a (electronic) pile of 50 recent papers of which I want to put facts into WikiPathways.

That made me wonder how WikiPathways is doing. That is, in which years are the articles published cited by pathways from the "approved" collection (the collection of pathways suitable for pathway analysis). After all, if it does not include the latest knowledge, people will be less eager to use it to analyse their excellent new data.

Now, the WikiPathways RDF only provides the PubMed identifiers of cited articles, but Andra Waagmeester (Micelio) put a lot of information in Wikidata (mind you, several pathways were already in Wikidata, because they were in Wikipedia). That data is current not complete. The current count of cited PubMed identifiers (~4200) can be counted on the WikiPathways SPARQL end point with:
    PREFIX cur: <>
    SELECT (COUNT(DISTINCT ?pubmed) AS ?count)
    WHERE {
      ?pubmed a wp:PublicationReference ;
        dcterms:isPartOf ?pathway .
      ?pathway wp:ontologyTag cur:AnalysisCollection .
Wikidata, however, lists at this moment about 1200:
    SELECT (COUNT(DISTINCT ?citedArtice) AS ?count) WHERE {
      ?pathway wdt:P2410 ?wpid ;
               wdt:P2860 ?citedArtice .
Taking advantage of the Wikidata Query Service visualization options, we can generate a graphical overview with this query:
    SELECT (STR(SAMPLE(?year)) AS ?year)
           (COUNT(DISTINCT ?citedArtice) AS ?count)
    WHERE {
      ?pathway wdt:P2410 ?wpid ;
               wdt:P2860 ?citedArtice .
      ?citedArtice wdt:P577 ?pubDate .
      BIND (year(?pubDate) AS ?year)
    } GROUP BY ?year
The result is the figure given as the start (right) of this post.

Saturday, February 17, 2018

FAIR-er Compound Interest Christmas Advent 2017: learnability and citability

Compound Interest infographics
of yesterday.
I love Compound Interest! I love what it does for popularization of the chemistry in our daily life. I love that the infographics have a pretty liberal license.

But I also wish they would be more usable. That is, the usability is greatly diminished by the lack of learnability. Of course, there is not a lot of room to give pointers.  Second, they do not have DOIs and are hard to cite as source. That said, the lack of sourcing information may not make it the best source, but let's consider these aspects separate. I would also love to see the ND clause got dropped, as it makes it harder to translate these infographics (you do not have that legal permission to do so) and fixing small glitches has to involve Andy Brunning personally.

The latter I cannot change, but the license allows me to reshare the graphics. I contacted Andy and proposed something I wanted to try. This post details some of the outcomes of that.

Improving the citability
This turns out to be the easy part, thanks to the great integration of GitHub and Zenodo. So, I just started a GitHub repository, added the proper license, and copied in the graphics. I wrapped it with some Markdown, taking advantage of another cool GitHub feature, and got this simple webpage:

By making the result a release, it got automatically archived on Zenodo. Now Andy's Compound Interest Christmas Advent 2017 has a DOI: 10.5281/zenodo.1164788:

So, this archive can be cited as:
    Andy Brunning, & Egon Willighagen. (2018, February 2). egonw/ci-advent-2017: Compound Interest Christmas Advent 2017 - Version 1 (Version v1). Zenodo.
Clearly, my contribution is just the archiving and, well, what I did as explained in the next section. The real work is done by Andy Brunning, of course!

Improving the learnability
One of the reasons I love the graphics, it that is shows the chemicals around is. Just look aside your window and you'll see the chemicals that make flowers turn colorful, berries taste well, and several plants poisonous. Really, just look outside! You see them now? (BTW, forget about that nanopores and minions, I want my portable metabolomics platform :)

But if I want to learn more about those chemicals (what are their properties, how do I extract them from the plants, how will I recognize them, what toxins are I (deliberately, but in very low doses) eating during lunch, who discovered them, etc, etc?), those infographics don't help me.

Scholia to the rescue (see doi:10.1007/978-3-319-70407-4_36): using Wikidata (and SPARQL queries) this can tell me a lot about chemicals, and there is a good community that cares about the above questions too, and adds information to Wikidata. Make sure to check out WikiProject Chemistry. All it needed is a Scholia extension for chemicals, something we've been working on. For example, check out bornyl acetate (from Day 2: Christmas tree aroma):

This list of identifiers is perhaps not the most interesting, and we're still working out how we can make it properly link out with the current code. Also, this compound is not so interesting for properties, but if there is enough information, it can look list this (for acetic acid):

I can recommend exploring the information it provides, and note the links to literature (which may include primary literature, though not in this screenshot).

But I underestimated the situation, as Compound Interest actually includes quite a few compound classes, and I had yet to develop a Scholia aspect for that. Fortunately, I got that finished too (and I love it), and it as distinct features and properly integrated, but to give you some idea, here is what phoratoxin (see Day 9: Poisonous Mistletoe) looks like:

Well, I'm sure it will look quite different in a year from now, but I hope you can see where this is going. It is essential we improve the FAIR-ness (see doi:10.1038/sdata.2016.18) of resources, small and large. If project like Compound Interest would set an example, this will show the next generation scientists how to do science better.

Tuesday, February 06, 2018

PubMed Commons is shutting down

Where NanoCommons only just started, another Commons, PubMed Commons, is shutting down. There is a lot of discussion about this and many angles. But the bottom line is, not enough people used it.

That leaves me with the question what to do with those 39 comments I left on the system (see screenshot on the right). I can copy/paste them to PubPeer, ScienceOpen, or something else. Someone also brought up that those services can go down to (life cycle) and maybe I should archive them?

Or maybe they are not important enough to justify the effort?

I will keep you posted...