Arcas: Using Python to access open research literature

Literature collection is a crucial process in all studies. It allows for the comparison in the historical context of the research as well as how the research is different or original from what others have done. Not several years ago, assembling sources was not an easy task. A major part of the problem was overcome with the creation of various scholarly databases and collections that live on the web.

The beauty of programming is that it allows repeated tasks such as harvesting the web for data to be automatised; that is how the web scraping or web data extraction was created. Similarly, enabling an automatised process to scrape through the scholarly databases would greatly benefit researchers.

This has inspired the development of an open source library called Arcas which allows for web scraping of academic articles using open access APIs. APIs such PLOS, IEEE, arXiv etc.

My proposal for EuroScipy 2017 is to introduce the library Arcas and how scraping online APIs was implemented in a sustainable and reproducible manner using Python.

Furthermore, to test the abilities of the library a data set was collected on a specific topic. I am also proposing to give a very brief analysis on this data set using both supervised and unsupervised machine learning algorithms.