Open Access Media Importer: Almost works as advertised!

This month, after creating the upload functionality (oa-cache upload-media), my Laptop’s SSD died. Since, by habit, I only push feature-complete commits I lost the entire upload routine. Additionally, my backup from one month earlier was corrupted – and the backup from January did not contain any emails regarding the project.

Besides working on uploads and restoring backups, some of my time was spent collecting plain text licensing information, assigning proper license URLs to it. It is frustrating to see publishers giving useless or inconsistent licensing information like the following:

  • This work is licensed under a Creative Commons Attribution 3.0 License (by-nc 3.0)
  • This work is licensed under a Creative Commons Attr0ibution 3.0 License (by-nc 3.0)
  • This document may be redistributed and reused, subject to certain conditions .
  • creative commons
  • Open Access

Recently, PubMed Central seems to dislike the user agent string of Python‘s urllib2, answering with a 403 error. Doing this to hinder bots seems to be common, as Wikipedia does it too, even though changing the user agent defeats the measure.

As the software is almost finished, Raphael has ordered a server; we will be starting the batch import next week. I am certain this will uncover quite some bugs in the crawler. I will not be at the Hackathon on the coming weekend, but I’ll write a short post that may help people extending the Open Access Media Importer. participants who want to extend the toolchain should look at the dummy and pubmed modules.

Update: A screencast detailing usage of the toolchain is available. It should be played back with ttyrec (the JavaScript player breaks when encountering unicode characters).

This entry was posted in Open Access Media Importer, Tools. Bookmark the permalink.

One Response to Open Access Media Importer: Almost works as advertised!

  1. Chris Maloney says:

    PMC shouldn’t be giving you 403s if all you are doing is downloading stuff from the FTP site (the OA subset), regardless of what your ua string is. Are you sure that’s all you were doing? If you’re hitting www with HTTP, though, that would be different.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <section align="" class="" dir="" lang="" style="" xml:lang=""> <style media="" type="" scoped="">