Fee waivers for the Wikipedia tutorial at ECCB 2012 – apply now!

The logo of WikiProject Computational Biology, which supervises the coverage of Computational Biology on the English Wikipedia. Based on Fig. 1 of Monnet, C.; Loux, V.; Gibrat, J. F. O.; Spinnler, E.; Barbe, V. R.; Vacherie, B.; Gavory, F.; Gourbeyre, E. et al. (2010). Ahmed, Niyaz. ed. “The Arthrobacter arilaitensis Re117 Genome Sequence Reveals Its Genetic Adaptation to the Surface of Cheese”PLoS ONE 5 (11): e15489. doi:10.1371/journal.pone.0015489PMC 2991359PMID 21124797. CC BY.

 

Were you planning to attend the European Conference on Computational Biology this year? It will take place in Basel, Switzerland from September 9-12 and features a rich program that covers recent developments in all corners of the field, including a number of workshops and tutorials being held on the first day.

One of these tutorials is about editing Wikipedia and specifically designed for researchers working in Computational Biology and related fields. Why? Well, what is the first topic that comes to your mind when you think of Computational Biology? What does your favourite search engine deliver for the term? Chances are that the Wikipedia entry for it will feature quite prominently in the results, be they personalized or not.

This means that whoever else is going to search that term will also stumble across that Wikipedia article, which may well become the first point of contact with the topic for your research administrators, prospective students, collaborators from different fields, or even some of the reviewers of your papers, grant proposals or tenure dossiers. Would the article in its present state benefit their understanding of the matter? If so, try another topic. Otherwise, how can you help improve the article, or even start one if it does not exist yet?

Logan, D. W.; Sandal, M.; Gardner, P. P.Manske, M.Bateman, A. (2010). “Ten Simple Rules for Editing Wikipedia”PLoS Computational Biology 6 (9): e1000941. doi:10.1371/journal.pcbi.1000941PMC 2947980PMID 20941386CC-BY.

In the tutorial, Alex Bateman and I will address these issues both in a general manner and by way of concrete examples. We will also provide you with an opportunity to put your hands on Wikipedia – or any of its sister projects, if you prefer – on a Computational Biology topic of your choice.

We can offer up to five fee waivers for the tutorial and will distribute them on an ongoing basis starting August 20. To apply, simply post a comment on this post and explain in a few sentences and up to one link (could be a blog post of yours, in which you can have as many links as you wish) why you should receive the fee waiver, and why it should be partial or full. We are especially inclined to approve requests that signal some initial engagement with Wikipedia, open research or open knowledge more generally. Some ideas for that:

  1. Review an existing Wikipedia article.
  2. Provide some suitably licensed images or multimedia files for the illustration of Wikipedia articles.
  3. Translate an existing Wikipedia article.
  4. Write a blog post on why you think researchers should contribute to Wikipedia or similar collaborative open knowledge projects.
  5. Release the raw data (under CC0) and code (under GPL or BSD) for a paper you have submitted or published.
  6. Release (under CC BY) a grant proposal you have written, whether it got funded or not.
  7. Lay out concrete plans to participate in ISCB‘s Wikipedia Computational Biology Article Competition or to write a manuscript for PLOS Computational Biology in its Topic Pages track.
  8. Free style – whatever comes to your mind as probably appropriate. More risk, more fun.

We reserve the right to split two of these waivers in half, so as to partially support up to four participants. One full fee waiver will be earmarked for a participant willing to arrange for live streaming and video recording of the tutorial, so that remote participation becomes possible. Should that fee waiver still be available on August 27, its earmark shall be removed. Participants who already registered are eligible to apply for a partial refund.

Those of you who do not want to apply are encouraged to comment on the submissions as they come in.

Let the fun begin!

Wodak, S. J.; Mietchen, D.; Collings, A. M.; Russell, R. B.; Bourne, P. E. (2012). “Topic Pages: PLoS Computational Biology Meets Wikipedia”PLoS Computational Biology 8 (3): e1002446.doi:10.1371/journal.pcbi.1002446PMC 3315447PMID 22479174. CC-BY.

Posted in events, Topic Pages | Tagged , , , , , | 1 Comment

Open Access Media Importer: Database, Upload Testing

Since the last post, I rewrote the Open Access Media Importer to use a proper database – SQLite – instead of CSV text files. This step should aid both maintainability and performance; while generating the database using oa-cache find-media now takes longer than before, querying contents is considerably faster and easier. While the data can no longer be viewed using a text editor or office program, one can browse it using the SQLite Database Browser.

For demonstration and testing purposes, I built a new source that, given a list of DOIs, downloads only metadata regarding the articles in question (if it can be found on PubMed Central). This means that importing media from specific articles only is now feasible, this function will probably be used after the bulk import is done.

There have been two test runs of the bot on so far on Wikimedia Commons, both helped finding serious oversights. It turned out that new accounts on Wikimedia Commons can upload material unhindered immediately, but are only allowed to edit pages after four days or solving a CAPTCHA. Due to this, the first uploads did not have any metadata.

Furthermore, oa-put upload-media neither did have rate limiting functionality, nor did it check if a video was already uploaded. Both are now handled: oa-put simply sleeps 10 seconds between uploads; duplicates are checked by querying for pages containing the article title, supplementary material label and supplementary material caption.

Several people suggested querying for the hash of a media file before uploading it. To understand why this is not appropriate for our use case, one has to know that Ogg stream IDs are pseudorandom to simplify muxing. Due to this, encoding the same source material twice does not yield the exact same output.

Daniel Mietchen identified two features yet missing: First, oa-cache find-media disregards media files referenced if they are not supplementary materials (thus, inline). Second, the PMC XML do not contain MeSH terms. Since categorization is important for Wikimedia Commons, I am already working on the latter part. As soon as the keyword issue is fixed, we will continue testing.

As my original motivation for writing the SQLite backend was being able to get more accurate statistics, here are some graphs showing licensing and media types of supplementary materials in the PMC Open Access Subset. Regarding licensing, readers should be aware that “None” does only mean that no license URL was given or could be determined from equivalent text, not that all rights are reserved.

Bar chart showing licensing of Supplementary Materials in the PubMed 
Central Open Access Subset

Bar chart showing MIME types of free supplementary materials in the PubMed Central Open Access Subset

Bar chart showing MIME types of non-free supplementary materials in the PubMed Central Open Access Subset

Posted in Open Access Media Importer | Tagged , , , , , , | Leave a comment

Open Access Media Importer: Usage and Statistics

A few days ago, at Wikimedia‘s free culture brunch I met up with Daniel Mietchen. He will talk about the Open Access Media Importer at Wikimania. Seeing another person trying to use the software turned out to be an interesting experience; we found new bugs and I realized the necessity of good documentation.

Using the Open Acces Media Importer is similar to using a package manager like apt in several ways. First, there are three tools for purposes of downloading data (oa-get), local operations (oa-cache) and uploading data (oa-put). Second, those programs can work with different sources, but for now only the pubmed source – corresponding to the PMC Open Access Subset – is actually useful.

Running the toolchain consists of five steps:

  1. oa-get download-metadata pubmed downloads several huge files that contain XML from the PMC Open Access Subset describing articles.
  2. oa-cache find-media pubmed looks for articles containing supplementary materials and saves a list of those.
  3. oa-get download-media pubmed filters the list of supplementary materials based on criteria like license or media type and downloads the results.
  4. oa-cache convert-media pubmed converts downloaded files to Ogg Theora + Vorbis.
  5. oa-put upload-media pubmed uploads the converted files to a mediawiki installation given in the configuration file.

A screencast demonstrating usage is available.

Due to the new statistics functionality, we can be reasonably sure that our efforts will be successful. The output of oa-cache stats pubmed shows that among 694158 481663 231146 supplementary materials in the PMC Open Access Subset, at least 160393 162395 78005 are licensed under a free license, most commonly CC BY. Furthermore, at least 3322 3511 2273 of those are videos.

The following figure shows the MIME types of the free content the Open Access Media Importer has gathered. It was generated using oa-cache stats pubmed | plot-helper Be aware that these are lower bounds, as many supplementary materials are not properly labeled with the URL of the corresponding license. This means, barring a major reasoning error the amount of free content that can be extracted will only go up.

Update: When working on licensing statistics code, I noticed that I overestimated the amount of supplementary materials by a large amount (200000) due to counting newlines instead of records in a CSV file. The amount of free content, however, was underestimated as predicted; around 2000 more papers and around 200 more videos than originally thought are under free licenses.

Update (2): A rewrite of the Open Access Media Importer backend to use SQLite suggests there was a major reasoning error: Apparently, many supplementary materials were referred to twice; I had not anticipated that and not checked for duplicates. In the interest of not exaggerating, I have updated the post to contain the new lower-bound values.

Plot of MIME types of Supplementary Materials in the PubMed Open Access Subset

Posted in Open Access Media Importer, Tools | Tagged | 4 Comments

Open Access Media Importer: Almost works as advertised!

This month, after creating the upload functionality (oa-cache upload-media), my Laptop’s SSD died. Since, by habit, I only push feature-complete commits I lost the entire upload routine. Additionally, my backup from one month earlier was corrupted – and the backup from January did not contain any emails regarding the project.

Besides working on uploads and restoring backups, some of my time was spent collecting plain text licensing information, assigning proper license URLs to it. It is frustrating to see publishers giving useless or inconsistent licensing information like the following:

  • This work is licensed under a Creative Commons Attribution 3.0 License (by-nc 3.0)
  • This work is licensed under a Creative Commons Attr0ibution 3.0 License (by-nc 3.0)
  • This document may be redistributed and reused, subject to certain conditions .
  • creative commons
  • Open Access

Recently, PubMed Central seems to dislike the user agent string of Python‘s urllib2, answering with a 403 error. Doing this to hinder bots seems to be common, as Wikipedia does it too, even though changing the user agent defeats the measure.

As the software is almost finished, Raphael has ordered a server; we will be starting the batch import next week. I am certain this will uncover quite some bugs in the crawler. I will not be at the Hackathon on the coming weekend, but I’ll write a short post that may help people extending the Open Access Media Importer. participants who want to extend the toolchain should look at the dummy and pubmed modules.

Update: A screencast detailing usage of the toolchain is available. It should be played back with ttyrec (the JavaScript player breaks when encountering unicode characters).

Posted in Open Access Media Importer, Tools | 1 Comment

Let the White House know what you think about Open Access to the research literature

Have you ever failed to get hold of a scientific article you wanted to read? Had the failure something to do with a paywall surrounding the article and its siblings?

If your answer was confirmative on both accounts, chances are that you would like to have a closer look at a petition that aims to rarify such circumstances for research funded by U.S. taxpayers. Started last night, the target of the initiative is the White House, and anyone can sign the petition, not just U.S. citizens or residents. If 25,000 signatures are reached by June 19 (the current count is 3859), the White House will have to respond in an official manner to the proposal, which reads:

We petition the Obama administration to Require free access over the Internet to scientific journal articles arising from taxpayer-funded research. We believe in the power of the Internet to foster innovation, research, and education. Requiring the published results of taxpayer-funded research to be posted on the Internet in human and machine readable form would provide access to patients and caregivers, students and their teachers, researchers, entrepreneurs, and other taxpayers who paid for the research. Expanding access would speed the research process and increase the return on our investment in scientific research. The highly successful Public Access Policy of the National Institutes of Health proves that this can be done without disrupting the research process, and we urge President Obama to act now to implement open access policies for all federal agencies that fund scientific research.

 

If you replied “no” to either or both of the introductory questions, or if you disagree with the petition, I would be especially interested in your comments. I have signed the document (as #16), since one of the ways to satisfy such a “public access policy” is actual Open Access (in the sense of the Budapest Open Access Initiative), which is the first step towards a more open science communication culture. It it also provides a foundation on which free knowledge projects like those run by the Wikimedia Foundation or the Open Knowledge Foundation can develop.

The links between Wikimedia and Open Access have recently been the subject of a special report in the Signpost, and further details can be found in the Research Committee’s response to the White House Request for Information on Open Access. Both documents emphasize the importance of the reusability of Open Access materials.

As a token of illustration, I am pasting in below the list of files that have served as Open Access File of the Day on Wikimedia Commons  so far. Click on an image to get to its metadata page. Together, these 174 files have been used to date on well over 10,000 pages within 176 Wikimedia projects.

 

2012

2011

Posted in In the news, Open Access File of the Day | Tagged , , , , , | Leave a comment

Open Access Media Importer: Encoding

Until now, the amount of appropriately licensed data the Open Access Media Importer found was limited by a notable idiosyncrasy: The most straight-forward way of providing licensing information is an URL that can easily be checked against a whitelist of known free licenses. Most publishers, though, put human-readable statements into their otherwise machine-readable XML. To handle this case, I collected those statements, mapping them to URLs when appropriate.

In other news, the Open Access Media Importer suite gained two new commands:

  • oa-cache clear-metadata deletes the cache of found supplementary materials. It should be run if a new version of oa-cache recognizes metadata (for example, licenses) better than an old one.
  • oa-cache clear-media deletes all converted media files. It may be useful if a disk is filling up or one wants to try a new set of conversion parameters.

Meanwhile, the oa-get sub-commands metadata and media were renamed to be more self-describing; they are now download-metadata and download-media.

Also, oa-cache convert-media finally works; media conversion to Theora (and optionally, Vorbis) is done using the GStreamer framework. It provides a flow-based approach to multimedia programming: First, one selects elements suitable for the task at hand – for example, a file source or a media format decoder – then one creates a pipeline by connecting those elements to each other. Since GStreamer has lots of elements for decoding, encoding, compositing, streaming and so on, almost anything is possible by writing a pipeline; Wikipedia has an in-depth explanation.

One issue I encountered was pipelines stalling when audio data was present – leading to oa-cache convert-media pubmed hanging instead of encoding media files. After (erroneously) blaming the questionable quality of several GStreamer plugins, a post on stack overflow alerted me to the fact that I did not properly queue, convert and resample the input data.

A lesser problem I encountered was GStreamer being unable to determine media file reliably. This manifested itself in a progressbar stalling on 99 percent. I solved this by including an initially silent progressreport element in the pipeline, which gets un-muted as soon as the playback position has a bogus value (for example, greater than the assumed duration or negative). I also found one file that caused a hangup at 99 percent without being able to determine why.

Metadata is written to video files in a format called Vorbis comments. Though several modules exist, most have subtle issues; I used mutagen after neither tagpy nor pyogg were able to write to Ogg files containing only a Theora (video) stream, complaining about a missing Vorbis (audio) stream. To add insult to injury, tagpy tries to infer the file type from the file name and fails if the extension is oga or ogv.

As all of this is very abstract, tomorrow I will provide a screencast of the process, possibly also demonstrating the upload functionality. In the meantime, enjoy a video from a paper about tiny parasite worms that I used as a test file – somehow all content I found until now was either parasites, flu, or eye scream.

Posted in Open Access Media Importer, Tools | Tagged , , , , | 3 Comments

PLoS Computational Biology goes wiki

Today saw an important step forward towards a wikification of scholarly workflows: PLoS Computational Biology published an article that did not only follow the journal’s own author guidelines but also those for writing articles on the English Wikipedia, where a copy of the journal article has been pasted into [[Circular Permutation in Proteins]], where it shall live on in the hands of the wiki community.

 

 

The article is the first in a new manuscript track – Topic pages – that adds a dynamic component to articles published in the journal, as explained in the accompanying editorial:

This month, we have published our first Topic Page on “Circular Permutations in Proteins” by Spencer Bliven and Andreas Prlić [6] as part of our Education section. Topic Pages are the version of record of a page to be posted to (the English version of) Wikipedia. In other words, PLoS Computational Biology publishes a version that is static, includes author attributions, and is indexed in PubMed. In addition, we intend to make the reviews and reviewer identities of Topic Pages available to our readership. Our hope is that the Wikipedia pages subsequently become living documents that will be updated and enhanced by the Wikipedia community, assuming they are in keeping with Wikipedia’s guidelines and policies, either by individuals, or, perhaps as is already happening in medicine and molecular and cell biology, by something more organized, or with a more formal review structure. We also hope this will lead to improved scholarship in a changing medium of learning, in this case made possible by the Creative Commons Attribution License that we use.

 

The editorial also discusses the issue of reward for scholars to contribute to endeavours like Wikipedia, for which Topic Pages provide a novel mechanism.

Like the quoted section, the paper contains direct links to Wikipedia pages for background, which dramatically reduces the need to rehash what is already known, while still allowing for a minimum of context.

The reviews that have been produced as a result of the journal’s peer review process have since been posted to the talk page of the Wikipedia entry, along with some further procedural explanations.

Through this manuscript track, PLoS Computational Biology joins the so far very small circle of journals that have experimented with dynamic features (which, by the way, form a core aspect of the Criteria for the journal of the future). Most closely related is the effort at RNA Biology (interestingly, not an Open Access journal), where a dedicated manuscript track established in 2008 requires that a manuscript on a new family of RNA be accompanied by the draft for a corresponding entry on the English Wikipedia. This effort is part of the Rfam project whose scope now includes over 900 articles on the English Wikipedia that are integrated with Rfam, a database dedicated to RNA families. Ideas for a similar project have been put forward in relation to the journal Gene and the Gene Wiki project.

Much older efforts to render publications more dynamic are the Living Reviews series of physics-related journals (established in 1998) and Scholarpedia (2005), which is implemented on a highly customized version of MediaWiki, the same software that Wikipedias run on. Both platforms, however, employ licensing schemes that are incompatible with reuse on Wikipedias.

While the workflow for Topic Pages is a bit convoluted (as described by Andreas Prlić), automated journal-to-wiki export has been routine practice for about a year now with several Pensoft journals. In both cases, the workflows involve dedicated wikis, for reasons that have to do with licensing (Wikipedias are more restrictively licensed than Open-Access journals) or with policies (taxonomic treatments are considered original research and thus not allowed on Wikipedias).

It can thus be expected that the workflows for Topic Pages at PLoS Computational Biology will be streamlined. Of note, other journals are invited to take advantage of that by using the dedicated Topic Pages wiki for preparing articles in their wiki track. In addition to that, Wikimedia Germany has approved funds to help journals integrate their workflows with those of Wikimedia projects. The support will be limited to the first journal per integration step, but any resulting software will be made openly available for reuse, so that other journals can build on these efforts. The funds can also be used to cover up to 50% of author-side publication fees for the first paper in such wiki tracks in up to five journals.

More general support on matters of Open Access on Wikimedia projects is also available via WikiProject Open Access.

Posted in Topic Pages | Tagged , , , , | 17 Comments

Open Access Media Importer: Plugging in your own data source

From the beginning, the Open Access Media Importer was intended to be modular. Only with the latest patches that feature has actually landed, as a result of decoupling interface and actual application logic: Data sources are now Python modules that expose a number of functions for data retrieval and refinery.

When issuing a command like oa-get metadata pubmed, the wrapper script imports the chosen module (in this case sources/pubmed.py) and calls the module’s function associated with the action (in this case, the metadata action is associated with the download_metadata function). Those functions then do their work and communicate their state back to the calling process.

However, the functions called by the wrapper scripts are unlike most “normal” functions: They are generators, functions that can save state on returning values and resume execution on the next call. In practical terms, that means a function does not have to yield a complete result to the caller: It can provide information in chunks to be iterated over – be it download completion or refined metadata.

To demonstrate and test the implementation, I created a dummy module that returns fake data before re-writing the code that crawls pubmed for the new interface. Compared to the old spaghetti code, the simplicity is beautiful: Only two functions are needed to provide a new data source – and much of the dummy module’s content is simply the fake data it yields.

New frontend functionality is provided by oa-cache‘s list-articles action: Inspired by a request from Daniel Mietchen to find out what papers about Malaria are licensed under a specific Creative Commons license, it returns metadata for articles as CSV (commonly known as “the format that Excel can read”). With that, the mentioned task becomes as easy as oa-cache list-articles pubmed | grep Malaria | grep 'creativecommons.org/licenses/by/'.

Implementing the list-articles action helped me to iron-out many corner cases in parsing the data returned by PubMed Central and laid the groundwork for the reworked find-media action that extracts supplementary materials. The next article will concern itself with downloading, converting and uploading media files from supplementary materials.

Posted in Open Access Media Importer, Tools | Tagged , , | 2 Comments

Open Access Media Importer: Apology, frontend & usage

More than a month ago, I promised to blog about my quest to build the Open Access Media Importer for Wikimedia Commons about once a week. Obviously, I failed. In the first few weeks, I was stuck, not having any noticeable progress, being stuck on two problems. When I resolved those issues, I did not blog about it – constantly postponing it out of a then-acquired bad habit I now feel ashamed for.

Later, I became ill – having headaches, being easily exhausted and almost constantly sleepy for weeks (I was since diagnosed with high blood pressure). Unable to cope, I became quite lethargic, pushing things aside with ease; when the prototype did not run on the development server due to an old version of the Python interpreter (since fixed), development stalled completely. I hereby apologize for the delay and for not bringing this up earlier.

That being said, the rest of this article deals less with personal and more with technical issues: It describes the design of the Open Access Importer frontend. To access all elements of the envisioned scraper / transcoder / upload toolchain in a uniform way is important – nobody likes to use un-usable software. After some deliberation, I chose to closely model it on the apt-get utility of the Debian GNU/Linux distribution, coming up with three wrapper scripts named oa-get, oa-cache and oa-put.

oa-get takes care of everything regarding downloads, acquiring medatada and media. With the simple invocation oa-get download-metadata, it downloads index files from PubMed Central, skipping already accquired files and displaying a progress bar (screenshot). Its less-complex sister invocation, oa-get download-media could be imagined somewhat analog to wget -i.

oa-cache is the complementary tool for any activity that does not need network connectivity. It is able to find suitable supplementary materials, writing their URLs and possible metadata to a CSV file and writes an additional file to identify articles having no or non-audiovisual supplementary materials. Known-useless files can thus be skipped on subsequent runs (screenshot); since many articles do not contain any usable media, this speeds up processing tremendously.

oa-put‘s purpose will be upload activities for Wikimedia Commons. Unlike the other tools, it currently cannot do anything. Like the others, it will be usable both manually and in shell scripts in a consistent manner.

Stay tuned: The next post will outline how you can write your own plugin for the Open Access Media Importer, extending the functionality of oa-get download-metadata and oa-cache find-media beyond accessing PubMed Central. If you are impatient, in the meantime you should follow the project on GitHub.

Posted in Open Access Media Importer, Tools | Tagged , , , , , , , , , , , | Leave a comment

Embed test

For Open Access Week last year, Alex Holcombe made a video in which an academic publisher explains why researchers have to sign copyright transfer agreements for the scholarly articles they wrote.

This video is featured on the Main Page of Wikimedia Commons today under Media of the day. I tried to embed a YouTube copy, but for some reason, I couldn’t get WordPress to accept iframe or object tags today.

Posted in Uncategorized | Leave a comment