A few days ago, at Wikimedia‘s free culture brunch I met up with Daniel Mietchen. He will talk about the Open Access Media Importer at Wikimania. Seeing another person trying to use the software turned out to be an interesting experience; we found new bugs and I realized the necessity of good documentation.
Using the Open Acces Media Importer is similar to using a package manager like apt in several ways. First, there are three tools for purposes of downloading data (oa-get), local operations (oa-cache) and uploading data (oa-put). Second, those programs can work with different sources, but for now only the pubmed source – corresponding to the PMC Open Access Subset – is actually useful.
Running the toolchain consists of five steps:
- oa-get download-metadata pubmed downloads several huge files that contain XML from the PMC Open Access Subset describing articles.
- oa-cache find-media pubmed looks for articles containing supplementary materials and saves a list of those.
- oa-get download-media pubmed filters the list of supplementary materials based on criteria like license or media type and downloads the results.
- oa-cache convert-media pubmed converts downloaded files to Ogg Theora + Vorbis.
- oa-put upload-media pubmed uploads the converted files to a mediawiki installation given in the configuration file.
A screencast demonstrating usage is available.
Due to the new statistics functionality, we can be reasonably sure that our efforts will be successful. The output of oa-cache stats pubmed shows that among
694158 481663 supplementary materials in the PMC Open Access Subset, at least 160393 162395 are licensed under a free license, most commonly CC BY. Furthermore, at least 3322 3511 of those are videos.
The following figure shows the MIME types of the free content the Open Access Media Importer has gathered. It was generated using oa-cache stats pubmed | plot-helper Be aware that these are lower bounds, as many supplementary materials are not properly labeled with the URL of the corresponding license. This means, barring a major reasoning error the amount of free content that can be extracted will only go up.
Update: When working on licensing statistics code, I noticed that I overestimated the amount of supplementary materials by a large amount (200000) due to counting newlines instead of records in a CSV file. The amount of free content, however, was underestimated as predicted; around 2000 more papers and around 200 more videos than originally thought are under free licenses.
Update (2): A rewrite of the Open Access Media Importer backend to use SQLite suggests there was a major reasoning error: Apparently, many supplementary materials were referred to twice; I had not anticipated that and not checked for duplicates. In the interest of not exaggerating, I have updated the post to contain the new lower-bound values.