Open Access scientific literature contains, almost by definition, content suitable – both in substance and licensing – for Wikimedia Commons. However, currently, there seems to be no automated, easy way to identify such files, convert them into appropriate formats and import them into Commons.
In November 2011, Daniel Mietchen submitted a proposal tackling the issue to the WissensWert funding scheme run by the German chapter of Wikimedia. Among other projects, it was chosen to receive funding (see Daniel’s post). As part of the team implementing the software envisioned, I will blog here about once a week until project conclusion.
Initially, the project will be focused on audio and video content available in PubMed Central‘s Open Access Subset – however, the toolchain is intended to be modular, so other sources can be added as development continues.
The only component currently existing is a proof-of-concept crawler / downloader: It downloads archives containing XML files – each about a GiB in size – from PubMed Central, identifies articles referring to supplementary materials (attachments) and displays URLs to retrieve those.
Until next week, I intend to add metadata collection – minimally author, source and licensing terms – and downloading of supplementary materials. Raphael Wimmer also proposed an option to only download new articles, which could reduce network load by several orders of magnitude compared to the currently existing naive implementation.
In line with the principles of free culture, all tools will be released as Free Software, licensed under the GNU General Public License, version 3 (or any later version of the License published by the Free Software Foundation).