{"id":1709,"date":"2014-12-19T11:21:31","date_gmt":"2014-12-19T11:21:31","guid":{"rendered":"http:\/\/www.circadian-capital.co.uk\/?p=1709"},"modified":"2015-06-17T12:14:51","modified_gmt":"2015-06-17T12:14:51","slug":"machine-reading","status":"publish","type":"post","link":"https:\/\/circadian-capital.com\/machine-reading\/","title":{"rendered":"Machine Reading"},"content":{"rendered":"
Grabbing electronic data from public sources is all very well, but what\u00a0about processing and analytics? Any data type will succumb to analysis,\u00a0pictures, sounds – you name it – but for now we will stick with numbers & words<\/strong>. And because it is not something that investors do every day, we’ll focus on the words first.<\/p>\n Dumb Computers<\/strong><\/p>\n When parsing text (linguistic information if we want to get fancy) using a data harvest the raw material returns as textstrings.Those characters have no meaning, knowledge or information as far as the computer is concerned; they’re just a visual representation of stored binary code.<\/p>\n Encodings Sidebar<\/em> HTML –> Raw –> Text<\/span><\/strong> We convert that raw string into text via tokenization (again, a fancy way to say “break into word tokens”). There are design choices about that process (e.g should \u00a34.22 be split four ways?) but, give or take, words exist between whitespaces.<\/p>\n Our first Bag of Words<\/strong> With our bag of words in token form let’s give this one last clean, lower-casing everything, getting rid of stop-words (‘i’, ‘we’, ‘up’, ‘in’ etc.), complicating affixes and ensuring our stem-words sit in dictionaries (lemmatisation). There are of course variants, but this approach will suffice. Here’s that all wrapped-up in a quick definition –<\/p>\n Quick & Dirty Analysis<\/strong> Similarly, using the underlying text we can actively search* for the usage of certain expressions or phrases:<\/p>\n – generating the contexts in which it is used (concordance); *Of course we can beef up Search using Regular Expressions<\/strong> (Regex), but that’s for another post.<\/p>\n Again, you get the idea; it’s a bit more useful, but not terribly dramatic.\u00a0Let’s change gears.<\/p>\n
\nThere’s an important topic regarding text encoding and its handling – these types of errors trip all of us up from time to time, crashing our otherwise smoothly-running scripts – but we will set this rather-dry theme aside for now, returning to in a later post.<\/p>\n
\nBy its very nature, parsing gets\u00a0rid of the excess html (tags, structure, formatting and the like) but you are still left with material that can be quite raw (e.g excess whitespace, line breaks, blank lines etc.) and needs further pre-processing.<\/p>\n
\nSo we’ve got our bond prospectus in usable format (handily translated\u00a0of course,\u00a0see PDF is Evil!<\/a>), we convert it to text, and then we can run any analysis we choose. We’ll use the brilliant nltk<\/a> package in the code below, but there are many others to choose from, or, in truth, we could have performed ourselves without any imports using a .split(‘ ‘) method.<\/p>\nimport nltk\r\ntoks = nltk.wordpunct_tokenize(raw)\r\ntext = nltk.Text(toks)<\/pre>\n
def processraw(text):\r\n 'final clean-up of text'\r\n text = [t for t in text if t.lower() not in stopwords] # lowercase, non-stopwords\r\n text = [wnl.lemmatize(t) for t in text] # wordnet lemmatizing\r\n return [t for t in text if t.isalnum()] # only give alphanumerics<\/pre>\n
\nObviously we can instantly calculate document stats (character counts, words, sentences, vocabulary, keywords etc. – see below), but that only takes you so far.<\/p>\n
\n– finding other words used in similar contexts;
\n– we can pair our similarity-words above to find common contexts; or
\n– we can develop n-grams (n words appearing next to one another) (collocations).<\/p>\n