Open Access Media Importer: Usage and Statistics

A few days ago, at Wikimedia‘s free culture brunch I met up with Daniel Mietchen. He will talk about the Open Access Media Importer at Wikimania. Seeing another person trying to use the software turned out to be an interesting experience; we found new bugs and I realized the necessity of good documentation.

Using the Open Acces Media Importer is similar to using a package manager like apt in several ways. First, there are three tools for purposes of downloading data (oa-get), local operations (oa-cache) and uploading data (oa-put). Second, those programs can work with different sources, but for now only the pubmed source – corresponding to the PMC Open Access Subset – is actually useful.

Running the toolchain consists of five steps:

  1. oa-get download-metadata pubmed downloads several huge files that contain XML from the PMC Open Access Subset describing articles.
  2. oa-cache find-media pubmed looks for articles containing supplementary materials and saves a list of those.
  3. oa-get download-media pubmed filters the list of supplementary materials based on criteria like license or media type and downloads the results.
  4. oa-cache convert-media pubmed converts downloaded files to Ogg Theora + Vorbis.
  5. oa-put upload-media pubmed uploads the converted files to a mediawiki installation given in the configuration file.

A screencast demonstrating usage is available.

Due to the new statistics functionality, we can be reasonably sure that our efforts will be successful. The output of oa-cache stats pubmed shows that among 694158 481663 231146 supplementary materials in the PMC Open Access Subset, at least 160393 162395 78005 are licensed under a free license, most commonly CC BY. Furthermore, at least 3322 3511 2273 of those are videos.

The following figure shows the MIME types of the free content the Open Access Media Importer has gathered. It was generated using oa-cache stats pubmed | plot-helper Be aware that these are lower bounds, as many supplementary materials are not properly labeled with the URL of the corresponding license. This means, barring a major reasoning error the amount of free content that can be extracted will only go up.

Update: When working on licensing statistics code, I noticed that I overestimated the amount of supplementary materials by a large amount (200000) due to counting newlines instead of records in a CSV file. The amount of free content, however, was underestimated as predicted; around 2000 more papers and around 200 more videos than originally thought are under free licenses.

Update (2): A rewrite of the Open Access Media Importer backend to use SQLite suggests there was a major reasoning error: Apparently, many supplementary materials were referred to twice; I had not anticipated that and not checked for duplicates. In the interest of not exaggerating, I have updated the post to contain the new lower-bound values.

Plot of MIME types of Supplementary Materials in the PubMed Open Access Subset

This entry was posted in Open Access Media Importer, Tools and tagged . Bookmark the permalink.

4 Responses to Open Access Media Importer: Usage and Statistics

  1. Chris Maloney says:

    Hi, Nils,

    This looks great – thanks for these detailed step-by-step instructions. I’m looking forward to trying this out very soon.

    I have a couple of questions/suggestions, some of which I’ve mentioned before.

    • I wonder if you could open up the issues feature on the github site?

    • Could you either accept my pull request, or let me know why you won’t? It only has some very simple, benign changes. I rewrote the README file to add a “quick start” section, for example.

    • Could you please stop referring to PMC as PubMed? They are not the same. That is another change that I made in my pull request, that, by now, might be hard to merge. If you want, I can redo it based on the latest commit in your master branch.

    • The documentation for this project is still at, and it looks more like a very preliminary sketch of design requirements. Could we start an official documentation wiki somewhere with a nice, concise, title and URL. How about the OKFN wiki, a page at

    Thanks again for this great tool!

  2. Oh sorry, I just procrastinated and forgot that pull request. Your contributions are good! Can you rebase it on my latest commits?

  3. Chris Maloney says:


  4. Ha, referring to PMC as PubMed just bit me hard when querying eutils. Seems PMC IDs and PubMed IDs are different namespaces.

Leave a Reply

Your email address will not be published. Required fields are marked *