Hopefully you’re already convinced that you can grab whatever data you wish from webpages (and if not, later examples will hopefully tip the balance in our favour) but sooner or later you are going to ask about pdf. And let’s face it, when the latest report comes out that everyone wants to be able to say they have read (not always the same as actually reading – will talk lots more about text summarisation later) it is invariably in pdf format…
So let’s dive in.
Something in the news caught your eye to make reading RBS’ covered bond prospectuses of some interest, but so far you can only find it here in pdf format. As any litigator will tell you, that choice of format was no accident.
At 265 pages that’s quite the tome. Wouldn’t it be nice to type a couple of lines…
h = Handler('./Global_Covered_Bond_Programme_Update_2013.pdf') raw = h.parse_handler(parse_pages) # read in the pages toc = h.parse_handler(parse_toc) # read in the table of contents
… so that you can copy raw (holds all of the text in the document) and toc (the table of contents) over to Word?
Why do we care so much about text?
With a few pounds you can buy commercial packages that convert pdf, so we confess that simply converting to text is not our true end-goal. On the other hand, the ability to handle a wide variety of data-types is of immediate import to any harvesting operation, and pdf is one such matter that needs to be dealt with. (If pushed we might also confess, we do quite enjoy being to be able to figure these things out too.)
We care about text for the hardly-controversial reason that it provides information, which in turn informs investment. When we ask (and we have) we see that investors struggle to elucidate their information strategies generally, but particularly when it comes to text. This isn’t surprising – most of us read pretty-much the same news reports fully-aware that everyone else is reading same and that our breadth of knowledge-gathering is pitifully small in the face of the terabytes thrown at us daily.
From this starting point we will launch ourselves into the world of machine-reading, text analytics & linguistics, with a series of subsequent posts, making sure to include plenty of real-world examples from the world of finance as well as taking the time to cover some of the technical aspects that need to be managed along the way.
For the Code-Hungry
I’m not going to produce all of the code here (see Credits below for why), but the operative section is as follows:
10 11 12 13 14 15 16 17 18 19 20 21 22 23
class Handler: def __init__(self, pdf_doc=None, password=None): self.pdf_doc = pdf_doc self.password = password def parse_handler(self, fn, *args): 'opens our file for processing' try: with open(self.pdf_doc, 'rb') as fp: parser = PDFParser(fp) document = PDFDocument(parser, self.password) if document.is_extractable: res = fn(document, *args) except IOError: print 'problem opening document' return res
That code creates a handler of pdf files in the form of an object called a Class. Classes are simply production lines requiring feedstock so that when we call the handler (Line 1 in the first code block above) it needs to be supplied with a pdf as well as a password. In our case we called ‘h’ the handler of RBS’ Covered Bond Prospectus.
Our handler runs on pdfminer (which explains most of the code directly above) but the highlighted line is the operative one: it executes functions against arguments that you feed it and returns a result (if possible).
Lines 2 & 3 (code block 1) show that operation in action, using two tailored definitions to parse pages and generate tables of contents respectively.
If you do want the code in full, just leave a line below and we’ll be happy to send over.
Credits where credits due…
pdfminer was written by Yusuke Shinyama and I’ve borrowed heavily from his work as well as from Denis’ work here. If you look to those pages you will see that pdfminer is fairly low-level – there’s quite a bit of lifting you have to do yourself – but it’s also incredibly powerful. There’s much more can be done with analysing images, layout structures and the like.
Oh, and finally, we should not omit to add that Yusuke also gave us the catchy title!