Late one evening reading Nathan Yau’s excellent “Visualize This” some years ago, I stumbled upon this undecipherable –
url = 'http://www.wunderground.com/history/airport/KBUF/2009/1/1/DailyHistory.html' page = urllib2.urlopen(url) soup = BeautifulSoup(page) dayTemp = soup.findAll('span').text print dayTemp
Apparently there were Pythons involved, the soup was a parser, and we now knew the weather in Buffalo for 1 Jan 2009. Of course there is no earthly reason for us to care about Buffalonian(?) weather – but a seed had been planted.
You might also be thinking “big deal – that’s what the internet and your browser is for” – and trust me, clicking would have been a much, much faster way to discover the temperature.
Clicking doesn’t let you do this though –
url = 'http://www.wunderground.com/history/airport/KBUF/%s/%s/%s/DailyHistory.html' yrs, mos, days = range(2003, 2013), range(1, 13), range(1, 32) pages = [urllib2.urlopen(url %(yr, mo, day)) for yr in yrs for mo in mos for day in days]
With these few lines we can now grab daily data over a ten year period. Better still, we don’t care what format that data is in: if it’s public, we can gather, catalogue & store in an organised fashion; mindless browsing is a thing of the past. We layer on our new data structure where it was previously absent.
Hhmm. Now where else might that be useful?
Welcome to the world of data science.