OPEN ACCESS REPORT
Open Access Media Importer
While walking, an
Eland antelope emits multiple signals related to its fighting ability, including
click sounds caused by knee movements.
After having been approved as a bot on Commons late in October, the
Open Access Media Importer ran almost continuously throughout November, scanning
PubMed Central for suitably licensed scholarly articles with supplementary multimedia and importing these into Wikimedia Commons, to a current total of
well over 9,000, raising the number of files under
Category:Open access (publishing) and its subcategories to
almost 14,000 by the end of the month.
The bot attempts to provide the files with categories based on keywords, subject categories or
MeSH terms supplied by the journal, by PubMed Central or by
PubMed for the corresponding article. This sometimes leads to miscategorizations, often to overcategorization, and occasionally to no categories at all. At present, the files are spread over more than 20,000 categories, almost 10% of which had to be created on the occasion (e.g. for
well over a hundred journals). Some of these categories (e.g.
Caenorhabditis elegans or
Green fluorescent proteins) are now filled with hundreds of files, which will eventually have to be distributed across more fine-grained categories. For many topics covered by PubMed Central (mostly biomedicine), there are thus now way more multimedia files available than the current Wikipedia entries (if they exist) can accommodate. For further examples, see
actin cytoskeleton (
on the English Wikipedia and
on Commons),
Gap junction (
Wikipedia;
Commons) or
Woronin body (
Wikipedia;
Commons).
The review of the categorization of the files and of these new categories themselves continues – a process that you can facilitate by
checking out (thanks to the overburdened
Toolserver) a few of them and adding or removing categories as appropriate. If you can think of wiki pages where these files could be useful, please put them in there.
Bug fixes continued but shifted in focus from providing functionality to minimizing the effects of inconsistent and incorrect metadata available from PubMed Central.
Metadata at PubMed Central
The most prominent issue with the XML is that of incorrect or self-contradicting licensing statements. While this had been
noticed already in spring (e.g. “
licensed under a Creative Commons Attr0ibution 3.0 License (by-nc 3.0)” – yes, with a typo on top of it, in an article from
Orthopedic Reviews, published by PagePress), actually deploying the bot to larger parts of the database made it clear that the phenomenon is rather common and not restricted to small and lesser known publishers.
The Open Access Media Importer analyzes the
XML of articles stored in PubMed Central’s
Open Access subset. That XML is being delivered there by the individual journals or publishers, which provides the basis for a plethora of individual styles that may or may not be close to the actual specifications of the
National Library of Medicine‘s
Document Type Definition (now named
JATS).
License mismatch in
Stem Cells (Dayton, Ohio), published by Wiley-Blackwell: while the machine-readable license is CC BY, the human-readable version has a non-commercial clause (
PMC3468739).
Besides the two journals highlighted in the figures, other journals affected by contradictory license statements include
Evolutionary Applications (
Wiley-Blackwell),
Traffic (Copenhagen, Denmark) (Wiley-Blackwell),
Cellular Microbiology (Wiley-Blackwell),
Cytotheraphy (
Informa),
The American Journal of Tropical Medicine and Hygiene (
American Society of Tropical Medicine and Hygiene),
The Febs Journal (Wiley-Blackwell),
Hepatology (Baltimore, Md.) (Wiley-Blackwell),
Journal of Cellular Physiology (Wiley-Blackwell) and
Database: The Journal of Biological Databases and Curation (Oxford University Press). At the
Journal of Neurochemistry (Wiley-Blackwell), the self-contradictory notice “Re-use of this article is permitted in accordance with the Creative Commons Deed, Attribution 2.5, which does not permit commercial exploitation.” is even
displayed directly on the article’s page. The closest match to the term “Creative Commons Deed, Attribution 2.5″ would be the
Creative Commons Attribution 2.5 Generic License (CC BY 2.5), which is indeed linked from the XML and
does permit commercial exploitation.
While contradictions between machine-readable and human-readable license statements are one sort of problem, many journals – including those published by
PLOS, which account for the majority of the bot’s uploads so far – do
not provide a license link at all or mix up the license and copyright tags in other ways. On a related note, even articles clearly and unambiguously labeled CC BY may occasionally
contain materials incompatible with such licensing, and some articles in journals otherwise using Creative Commons licenses occasionally publish something under
Crown copyright or similar conditions, causing the bot to skip the articles.
Such licensing mess raises a number of questions: if the licensing for a given article agrees in its human- and machine-readable version
on PubMed Central, can we then be sure that this information is correct? This is the case, for instance, with the journal
Molecular Vision. What if the same article does not have any licensing statement
at the journal’s site (or
in the XML there), or if the journal’s copyright policy states
CC BY-NC-ND as the only option? What if Google
finds several articles from the same journal that are also labeled as CC BY?
The mismatch between stated licenses and actual licensing conditions also makes it difficult to assess, in an automated fashion, what amount of audio, video or other materials is available from PubMed Central under Wikimedia-compatible licenses. For some plots on the matter, see
this blog post, which also highlights another frequent issue: that of a mismatch between the actual
MIME type and that stated in the XML, as in the following example:
As a rough estimate, MIME type mismatches of this kind affect on the order of
10% of the supplementary files in the database. Since this translates to hundreds of multimedia files, the bot now attempts to determine the MIME type of all supplementary materials and chooses those that are, in fact, audio or video, irrespective of what the XML states about them, thereby even covering cases in which the XML makes
no statement about the MIME types . The bot naturally fails, however, in cases when suitably licensed articles
do have supplementary multimedia files but these are
not mentioned in the XML available from PMC (another case:
journal;
PMC)
Another reason preventing the import of some suitably licensed materials is that
files are frequently hidden in zip archives, which the bot ignores for the time being.
Once suitably licensed multimedia have been identified as such, they have to be
converted to a format accepted at Wikimedia Commons, i.e.
OGG. This does
not always work, since some authors use rather unusual file formats, or the metadata about the files (e.g. the length of a video) is incorrect or not stated at all. Most journals have a disclaimer that proper functioning of supplementary materials is within the authors’ responsibility, but it would be nice to establish a standard for testing that supplementary files submitted to journals actually convert properly to common standard formats.
Further issues arise when the files are converted and need to be associated with their metadata in Commons style: Sometimes, there is
no description whatsoever of supplementary files, or the description of several files is
lumped together in a way that the bot cannot parse. Some minor issues include
line breaks in article titles or
typos in categories or keywords provided by the journal, which the bot uses for the initial categorization of the files.
A problem not really solved so far is that of duplicate detection – while this
works well for images, this is not the case for multimedia files, since multiple copies of a file will normally have different hashes.
Gallery
The following files represent a selection of what has been uploaded by the Open Access Media Importer this month. If you can think of wiki pages where these files could be useful, please put them in there or
let us know. For metadata about the files, please click on the Menu button.
Videos
Can you guess the research question addressed in the corresponding scholarly article?
Sound files
Can you guess what these sounds represent?
Beyond PubMed Central
While PubMed Central is the only database currently spidered by the Open Access Media Importer, it is designed in a modular fashion, such that other sources could easily be plugged in. To lay the ground for such future work, a number of (manual) test uploads from such potential sources have been made this month.
-
-
A “singing” iceberg – the first file from a data repository (
PANGAEA).
-
WikiProject Open Access
The following news from WikiProject Open Access have been posted this month:
Open Access File of the Day
The following files have been featured as
Open Access File of the Day this month:
-
-
-
November 28: A range of putative disease-causing mechanisms for the case of the disease
progeria
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
November 1:
MRI scans of a
microcephalic patient (right) and a healthy control (left).
No comments yet. Yours could be the first!