101 Data Harvesting

Building out a fully-fledged data-harvesting bot is not our purpose here, instead we will quickly demonstrate a 101 harvester in action against a financial variable. Even though simplified, the exercise is useful: the modules of work shown below are consistent for any level of complexity.

Let’s work with FX since it is straightforward. (We’ll likely look at a more complicated example later, capturing financial records, factors and the like.)

URL Selection
Simply by googling and clicking on webpages choose the base url you want to work with. The best choice is often the base url that exhibits maximum information-to-ink ratio (with a nod to Tufte); that is, the page shows the info you are interested in and little else.

Our base url choice in this instance is a fairly arbitrary one from any number of sources. We know that we may soon be using 1000’s of variants of it so we set it up in the code in a particular way –

pairs      = 'http://fxtop.com/en/historical-exchange-rates.php?A=1&amp;C1={0}&amp;C2={1}'
date_range = '&amp;DD1={0}&amp;MM1={1}&amp;YYYY1={2}&amp;B=1&amp;P=&amp;I=1&amp;DD2={3}&amp;MM2={4}&amp;YYYY2={5}&amp;btnOK=Go!'

Looking to those base urls, you will notice {}’s dotted throughout; these allow the user to enter variables. In our case that means base-pairs of currency and a time-frame.

Variable Modifications
We create a nice user interface to gather choice selections and these will automatically modify our base url –

curr, curr2  = 'GBP', 'USD'
start, stop  = '01-01-2013', '15-08-2014'
#auto-modified urls
y, m, d      = start.split('-')
ye, me, de   = stop.split('-')
pairs        = pairs.format(curr, curr2)
date_range   = date_range.format(y, m, d, ye, me, de)
url          = pairs + date_range

Extraction
So far, so set-up; we need to read the data. Requests is an alternative library to urllib2 for opening URLs, and similarly lxml.html is an alternative to BeautifulSoup; it is used to parse the returns.

Libraries are like tools in the workshop – you need to get familiar with many in order to be able to select the right one for the job.

Beneath, we set up a very simple definition that allows us to extract a tree of html from our url from which we wish to see only table rows (the ‘//tr’ reference) –

def sapper(url):
    'url in, tree out'
    page = requests.get(url)
    return lxml.html.fromstring(page.text)
 
tree = sapper(url)
rows = tree.xpath('//tr')

Tailoring
Looking to the webpage itself we can see that the table has toward its header and footer rows starting with ‘Date…’ and ‘MinMax…’

The code below exists so that we only collect data between these two points. Specifically, the rather long line in the middle looks for the positions within our excessively long list of rows matching the keywords we have supplied in order to clip the list to include content only after and before these rows respectively.

This tailoring exercise is a very common operation in scraping.

keywords = 'Date','MinMax'
clipped = [count for count, row in enumerate(rows)
           for word in keywords if row.text_content().startswith(word)]
rows = rows[clipped[0]:clipped[1]]

A key drawback in webscraping is the amount of time spent tailoring in such fashion. On the other hand, once designed one doesn’t need to revisit again (at least not until the webpage is re-written).

Webpages designed with APIs (Advanced Programming Interfaces) in mind (something we will look to in a separate post) often represent a much cleaner way of gathering data. And, in truth we would never use the page above as the source of our FX (we would want OHLC data for a start). Regardless, as fallback, you always want to have data harvesting skills to rely upon – they provide huge freedom of information. If it’s public data, regardless of format, you can grab, catologue and store it in an automated and organised manner.

Output
The final stage in any scraping exercise is the question of data output and/or information storage, and there are always important design choices to be made (complex database / pickling / HDF files / excel etc.) particularly where scraping is on a regular basis with fresh information updating the previously gathered data.

But in this case the choice is rather simple: we end up with multiple entries in rows, each of which holds three table cells and each of which has .text_content() – the info we seek. Since this is a one-off we can just save these down as a data-stamped text file and the job is complete.

Leave a Reply Cancel reply