Until now, the amount of appropriately licensed data the Open Access Media Importer found was limited by a notable idiosyncrasy: The most straight-forward way of providing licensing information is an URL that can easily be checked against a whitelist of known free licenses. Most publishers, though, put human-readable statements into their otherwise machine-readable XML. To handle this case, I collected those statements, mapping them to URLs when appropriate.
In other news, the Open Access Media Importer suite gained two new commands:
- oa-cache clear-metadata deletes the cache of found supplementary materials. It should be run if a new version of oa-cache recognizes metadata (for example, licenses) better than an old one.
- oa-cache clear-media deletes all converted media files. It may be useful if a disk is filling up or one wants to try a new set of conversion parameters.
Meanwhile, the oa-get sub-commands metadata and media were renamed to be more self-describing; they are now download-metadata and download-media.
Also, oa-cache convert-media finally works; media conversion to Theora (and optionally, Vorbis) is done using the GStreamer framework. It provides a flow-based approach to multimedia programming: First, one selects elements suitable for the task at hand – for example, a file source or a media format decoder – then one creates a pipeline by connecting those elements to each other. Since GStreamer has lots of elements for decoding, encoding, compositing, streaming and so on, almost anything is possible by writing a pipeline; Wikipedia has an in-depth explanation.
One issue I encountered was pipelines stalling when audio data was present – leading to oa-cache convert-media pubmed hanging instead of encoding media files. After (erroneously) blaming the questionable quality of several GStreamer plugins, a post on stack overflow alerted me to the fact that I did not properly queue, convert and resample the input data.
A lesser problem I encountered was GStreamer being unable to determine media file reliably. This manifested itself in a progressbar stalling on 99 percent. I solved this by including an initially silent progressreport element in the pipeline, which gets un-muted as soon as the playback position has a bogus value (for example, greater than the assumed duration or negative). I also found one file that caused a hangup at 99 percent without being able to determine why.
Metadata is written to video files in a format called Vorbis comments. Though several modules exist, most have subtle issues; I used mutagen after neither tagpy nor pyogg were able to write to Ogg files containing only a Theora (video) stream, complaining about a missing Vorbis (audio) stream. To add insult to injury, tagpy tries to infer the file type from the file name and fails if the extension is oga or ogv.
As all of this is very abstract, tomorrow I will provide a screencast of the process, possibly also demonstrating the upload functionality. In the meantime, enjoy a video from a paper about tiny parasite worms that I used as a test file – somehow all content I found until now was either parasites, flu, or eye scream.