Open Access Media Importer: Plugging in your own data source

From the beginning, the Open Access Media Importer was intended to be modular. Only with the latest patches that feature has actually landed, as a result of decoupling interface and actual application logic: Data sources are now Python modules that expose a number of functions for data retrieval and refinery.

When issuing a command like oa-get metadata pubmed, the wrapper script imports the chosen module (in this case sources/pubmed.py) and calls the module’s function associated with the action (in this case, the metadata action is associated with the download_metadata function). Those functions then do their work and communicate their state back to the calling process.

However, the functions called by the wrapper scripts are unlike most “normal” functions: They are generators, functions that can save state on returning values and resume execution on the next call. In practical terms, that means a function does not have to yield a complete result to the caller: It can provide information in chunks to be iterated over – be it download completion or refined metadata.

To demonstrate and test the implementation, I created a dummy module that returns fake data before re-writing the code that crawls pubmed for the new interface. Compared to the old spaghetti code, the simplicity is beautiful: Only two functions are needed to provide a new data source – and much of the dummy module’s content is simply the fake data it yields.

New frontend functionality is provided by oa-cache‘s list-articles action: Inspired by a request from Daniel Mietchen to find out what papers about Malaria are licensed under a specific Creative Commons license, it returns metadata for articles as CSV (commonly known as “the format that Excel can read”). With that, the mentioned task becomes as easy as oa-cache list-articles pubmed | grep Malaria | grep 'creativecommons.org/licenses/by/'.

Implementing the list-articles action helped me to iron-out many corner cases in parsing the data returned by PubMed Central and laid the groundwork for the reworked find-media action that extracts supplementary materials. The next article will concern itself with downloading, converting and uploading media files from supplementary materials.

This entry was posted in Open Access Media Importer, Tools and tagged , , . Bookmark the permalink.

2 Responses to Open Access Media Importer: Plugging in your own data source

  1. Do you see an easy way to store the large CSV on the server in an accessible manner, so that users can the perform the “grep X” parts for any X of their choice (perhaps even a text file with one X per line) and export the results, without having to download the whole OA set themselves?

  2. Given that I actually do not know how big the CSV file will become in future (I never did a full run through the entire data set), I see one easy possibility: Providing that file read-only on a shell account on some computer that has the GNU coreutils installed. In the end, it will probably be easiest to periodically regenerate the file, then making a gzip archive and hosting it on a web server.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <section align="" class="" dir="" lang="" style="" xml:lang=""> <style media="" type="" scoped="">