Last Friday, I visited Wikimedia Deutschland to conduct a presentation on the current state of the Open Access Media Importer (slides), which is funded by WMDE. Attending were project manager Nicole Ebber and senior software developer Daniel Kinzler; Daniel Mietchen participated remotely via Skype.
The modular software architecture and resulting workflow, adapted from the original proposal was quickly grasped, possibly due to being inspired by the Debian package manager apt. When discussing details, however, I realized that the documentation should explicitly mention that despite the size of the PMC Open Access Subset the OAMI can also be run with access to modest computing resources, focusing on particular documents.
Where the software will be run after the initial bulk import is done is unclear. According to Daniel Kinzler, it is not particularly suited to run on the Wikimedia Toolserver: Extracting metadata from archives and conversion of media are both processing intensive tasks while downloading and uploading are I/O-bound; additionally, quite some storage is required. Since a demonstration was not part of the visit. Daniel Kinzler now has an account on our development server to test and possibly review parts of the OAMI as soon as he finds the time to do so.
On the feature side, many incremental improvements were made: Besides bugfixes, there is now support for audio (which is converted to Vorbis); the new oa-cache subcommand update-mimetypes downloads the first 4 kB of files to supplementary materials and detects their MIME type via magic. While slow, this step is useful because some of the files in the PMC Open Access Subset are listed with a wrong type. For example, in the machine-readable version of this paper all videos are reported as being plain text.
When Daniel Mietchen came to Berlin on September 23th, we corrected some subtle bugs regarding data extraction from XML and implemented a cheap heuristic for proper (not too broad) categories – to be used in the page template, a category has to contain a space. For getting the upload done in fewer API calls, I imported a patched version of python-wikitools into our repository. Since python-wikitools was the only dependency that is not in Debian, installation is now possible with a simple git clone https://github.com/erlehmann/open-access-media-importer.git.