Since the last post, I rewrote the Open Access Media Importer to use a proper database – SQLite – instead of CSV text files. This step should aid both maintainability and performance; while generating the database using oa-cache find-media now takes longer than before, querying contents is considerably faster and easier. While the data can no longer be viewed using a text editor or office program, one can browse it using the SQLite Database Browser.
For demonstration and testing purposes, I built a new source that, given a list of DOIs, downloads only metadata regarding the articles in question (if it can be found on PubMed Central). This means that importing media from specific articles only is now feasible, this function will probably be used after the bulk import is done.
There have been two test runs of the bot on so far on Wikimedia Commons, both helped finding serious oversights. It turned out that new accounts on Wikimedia Commons can upload material unhindered immediately, but are only allowed to edit pages after four days or solving a CAPTCHA. Due to this, the first uploads did not have any metadata.
Furthermore, oa-put upload-media neither did have rate limiting functionality, nor did it check if a video was already uploaded. Both are now handled: oa-put simply sleeps 10 seconds between uploads; duplicates are checked by querying for pages containing the article title, supplementary material label and supplementary material caption.
Several people suggested querying for the hash of a media file before uploading it. To understand why this is not appropriate for our use case, one has to know that Ogg stream IDs are pseudorandom to simplify muxing. Due to this, encoding the same source material twice does not yield the exact same output.
Daniel Mietchen identified two features yet missing: First, oa-cache find-media disregards media files referenced if they are not supplementary materials (thus, inline). Second, the PMC XML do not contain MeSH terms. Since categorization is important for Wikimedia Commons, I am already working on the latter part. As soon as the keyword issue is fixed, we will continue testing.
As my original motivation for writing the SQLite backend was being able to get more accurate statistics, here are some graphs showing licensing and media types of supplementary materials in the PMC Open Access Subset. Regarding licensing, readers should be aware that “None” does only mean that no license URL was given or could be determined from equivalent text, not that all rights are reserved.