More than a month ago, I promised to blog about my quest to build the Open Access Media Importer for Wikimedia Commons
about once a week. Obviously, I failed. In the first few weeks, I was stuck, not having any noticeable progress, being stuck on two problems. When I resolved those issues, I did not blog about it – constantly postponing it out of a then-acquired bad habit I now feel ashamed for.
Later, I became ill – having headaches, being easily exhausted and almost constantly sleepy for weeks (I was since diagnosed with high blood pressure). Unable to cope, I became quite lethargic, pushing things aside with ease; when the prototype did not run on the development server due to an old version of the Python interpreter (since fixed), development stalled completely. I hereby apologize for the delay and for not bringing this up earlier.
That being said, the rest of this article deals less with personal and more with technical issues: It describes the design of the Open Access Importer frontend. To access all elements of the envisioned scraper / transcoder / upload toolchain in a uniform way is important – nobody likes to use un-usable software. After some deliberation, I chose to closely model it on the apt-get utility of the Debian GNU/Linux distribution, coming up with three wrapper scripts named oa-get, oa-cache and oa-put.
oa-get takes care of everything regarding downloads, acquiring medatada and media. With the simple invocation oa-get download-metadata, it downloads index files from PubMed Central, skipping already accquired files and displaying a progress bar (screenshot). Its less-complex sister invocation, oa-get download-media could be imagined somewhat analog to wget -i.
oa-cache is the complementary tool for any activity that does not need network connectivity. It is able to find suitable supplementary materials, writing their URLs and possible metadata to a CSV file and writes an additional file to identify articles having no or non-audiovisual supplementary materials. Known-useless files can thus be skipped on subsequent runs (screenshot); since many articles do not contain any usable media, this speeds up processing tremendously.
oa-put‘s purpose will be upload activities for Wikimedia Commons. Unlike the other tools, it currently cannot do anything. Like the others, it will be usable both manually and in shell scripts in a consistent manner.
Stay tuned: The next post will outline how you can write your own plugin for the Open Access Media Importer, extending the functionality of oa-get download-metadata and oa-cache find-media beyond accessing PubMed Central. If you are impatient, in the meantime you should follow the project on GitHub.