{"id":1709,"date":"2014-12-19T11:21:31","date_gmt":"2014-12-19T11:21:31","guid":{"rendered":"http:\/\/www.circadian-capital.co.uk\/?p=1709"},"modified":"2015-06-17T12:14:51","modified_gmt":"2015-06-17T12:14:51","slug":"machine-reading","status":"publish","type":"post","link":"https:\/\/circadian-capital.com\/machine-reading\/","title":{"rendered":"Machine Reading"},"content":{"rendered":"<p>Grabbing electronic data from public sources is all very well, but what\u00a0about processing and analytics? Any data type will succumb to analysis,\u00a0pictures, sounds &#8211; you name it &#8211; but for now we will stick with <strong>numbers &amp; words<\/strong>. And because it is not something that investors do every day, we&#8217;ll focus on the words first.<\/p>\n<p><strong style=\"color: #ff9900;\">Dumb Computers<\/strong><\/p>\n<p>When parsing text (linguistic information if we want to get fancy) using a data harvest the raw material returns as textstrings.Those characters have no meaning, knowledge or information as far as the computer is concerned; they&#8217;re just a visual representation of stored binary code.<\/p>\n<p style=\"padding-left: 30px;\"><em>Encodings Sidebar<\/em><br \/>\nThere&#8217;s an important topic regarding text encoding and its handling &#8211; these types of errors trip all of us up from time to time, crashing our otherwise smoothly-running scripts &#8211; but we will set this rather-dry theme aside for now, returning to in a later post.<\/p>\n<p><strong><span style=\"color: #ff9900;\">HTML &#8211;&gt; Raw &#8211;&gt; Text<\/span><\/strong><br \/>\nBy its very nature, parsing gets\u00a0rid of the excess html (tags, structure, formatting and the like) but you are still left with material that can be quite raw (e.g excess whitespace, line breaks, blank lines etc.) and needs further pre-processing.<\/p>\n<p>We convert that raw string into text via tokenization (again, a fancy way to say &#8220;break into word tokens&#8221;). There are design choices about that process (e.g should \u00a34.22 be split four ways?) but, give or take, words exist between whitespaces.<\/p>\n<p><strong style=\"color: #ff9900;\">Our first Bag of Words<\/strong><br \/>\nSo we&#8217;ve got our bond prospectus in usable format (handily translated\u00a0of course,\u00a0see <a href=\"http:\/\/www.circadian-capital.co.uk\/pdf-is-evil\/\">PDF is Evil!<\/a>), we convert it to text, and then we can run any analysis we choose. We&#8217;ll use the brilliant <a href=\"http:\/\/www.nltk.org\/\">nltk<\/a> package in the code below, but there are many others to choose from, or, in truth, we could have performed ourselves without any imports using a .split(&#8216; &#8216;) method.<\/p>\n<pre lang=\"python\">import nltk\r\ntoks = nltk.wordpunct_tokenize(raw)\r\ntext = nltk.Text(toks)<\/pre>\n<p>With our bag of words in token form let&#8217;s give this one last clean, lower-casing everything, getting rid of stop-words (&#8216;i&#8217;, &#8216;we&#8217;, &#8216;up&#8217;, &#8216;in&#8217; etc.), complicating affixes and ensuring our stem-words sit in dictionaries (lemmatisation). There are of course variants, but this approach will suffice. Here&#8217;s that all wrapped-up in a quick definition &#8211;<\/p>\n<pre lang=\"python\">def processraw(text):\r\n    'final clean-up of text'\r\n    text = [t for t in text if t.lower() not in stopwords]  # lowercase, non-stopwords\r\n    text = [wnl.lemmatize(t) for t in text]                 # wordnet lemmatizing\r\n    return [t for t in text if t.isalnum()]                 # only give alphanumerics<\/pre>\n<p><strong style=\"color: #ff9900;\">Quick &amp; Dirty Analysis<\/strong><br \/>\nObviously we can instantly calculate document stats (character counts, words, sentences, vocabulary, keywords etc. &#8211; see below), but that only takes you so far.<\/p>\n<p>Similarly, using the underlying text we can actively search* for the usage of certain expressions or phrases:<\/p>\n<p style=\"padding-left: 30px;\">&#8211; generating the contexts in which it is used (concordance);<br \/>\n&#8211; finding other words used in similar contexts;<br \/>\n&#8211; we can pair our similarity-words above to find common contexts; or<br \/>\n&#8211; we can develop n-grams (n words appearing next to one another) (collocations).<\/p>\n<p>*Of course we can beef up Search using <strong>Regular Expressions<\/strong> (Regex), but that&#8217;s for another post.<\/p>\n<p>Again, you get the idea; it&#8217;s a bit more useful, but not terribly dramatic.\u00a0Let&#8217;s change gears.<\/p>\n<p><strong style=\"color: #ff9900;\">Internal Structure<\/strong><br \/>\nSemantically-oriented dictionaries hold words as well as word relationships (think of a sophisticated and interactive dictionary + thesaurus), and one of these is <a href=\"http:\/\/wordnet.princeton.edu\/\">WordNet<\/a> (developed at Princeton).<\/p>\n<p>The building block of WordNet might be described as the &#8216;synset&#8217; (a set of synonymous words), each of which holds a definition and from which we can draw examples (think of the synset as a bag with an idea inside\u00a0it).<\/p>\n<p>So, for example, the term &#8216;credit&#8217; sits in more than 10 synsets; each of which captures ideas such as &#8216;give someone credit for something&#8217;, &#8216;have trust in; trust in the truth or veracity of&#8217;, &#8216;ascribe an achievement to&#8217; etc.<\/p>\n<p>Each synset (c. 120,000 total) sits within an abstract concept hierarchy, with synsets becoming less\/more specific as you travel up\/down the chain to generate lexical relations. So for example we can generate hypernyms (a synset&#8217;s more abstract parents) and hyponyms (our synset&#8217;s more-specific children), as well as many other metrics.<\/p>\n<p>So for example: where &#8216;credit&#8217; captures the idea of\u00a0&#8216;have trust in; trust in the truth or veracity of&#8217; the pathway is as follows:<\/p>\n<p>Synset(&#8216;think.v.03&#8217;) <strong style=\"color: #ff9900;\">&#8212;&gt;<\/strong> Synset(&#8216;evaluate.v.02&#8217;)\u00a0<strong style=\"color: #ff9900;\">&#8212;&gt;<\/strong>\u00a0Synset(&#8216;accept.v.01&#8217;)\u00a0<strong style=\"color: #ff9900;\">&#8212;&gt;<\/strong>\u00a0Synset(&#8216;believe.v.01&#8217;)\u00a0<strong style=\"color: #ff9900;\">&#8212;&gt;<\/strong>\u00a0Synset(&#8216;trust.v.01&#8217;)\u00a0<strong style=\"color: #ff9900;\">&#8212;&gt;<\/strong>\u00a0Synset(&#8216;credit.v.04&#8217;)<\/p>\n<p>It&#8217;s early days, but hopefully you&#8217;re beginning to see how this type of analysis can, with a little work, become more significant, and that we are taking our first tentative steps on the road to inference.<\/p>\n<p><strong style=\"color: #ff9900;\">Skim-Reading (robot-style)<\/strong><br \/>\nLet&#8217;s get practical. Here&#8217;s how we might apply some of the above &#8211; let&#8217;s skim-read, robot-style.<\/p>\n<p>Our robot quickly (very quickly!) read our prospectus (green) and began making mental comparisons to other documents (orange). We do this by memory and lookup; the robot did it by contra-reading a whole catalogue of documents.<\/p>\n<p>That catalogue (referred to as a Corpus) forms the robot&#8217;s background knowledge. So choose it carefully (or very widely). We read in a million words (it says half a mill below but that count excludes stopwords) from the well-established Brown Corpus which covers a wide range of topics.<\/p>\n<p>Let&#8217;s look at the graphics below (too many words here!) and then I&#8217;ll provide some explanations below &#8211;<\/p>\n<p>&nbsp;<\/p>\n<div id=\"metaslider-id-2209\" style=\"max-width: 1080px;\" class=\"ml-slider-3-19-1 metaslider metaslider-flex metaslider-2209 ml-slider\">\n    <div id=\"metaslider_container_2209\">\n        <div id=\"metaslider_2209\">\n            <ul aria-live=\"polite\" class=\"slides\">\n                <li style=\"display: block; width: 100%;\" class=\"slide-1848 ms-image\"><img src=\"https:\/\/circadian-capital.com\/wp-content\/uploads\/2014\/09\/syn_on-1080x420.png\" height=\"420\" width=\"1080\" alt=\"\" class=\"slider-2209 slide-1848\" \/><\/li>\n                <li style=\"display: none; width: 100%;\" class=\"slide-1847 ms-image\"><img src=\"https:\/\/circadian-capital.com\/wp-content\/uploads\/2014\/09\/syn_on_nolab-1080x420.png\" height=\"420\" width=\"1080\" alt=\"\" class=\"slider-2209 slide-1847\" \/><\/li>\n                <li style=\"display: none; width: 100%;\" class=\"slide-1846 ms-image\"><img src=\"https:\/\/circadian-capital.com\/wp-content\/uploads\/2014\/09\/syn_off-1080x420.png\" height=\"420\" width=\"1080\" alt=\"\" class=\"slider-2209 slide-1846\" \/><\/li>\n            <\/ul>\n        <\/div>\n        \n    <\/div>\n<\/div>\n<p><strong style=\"color: #ff9900;\">What&#8217;s going on?<\/strong><br \/>\nStarting with the green-only graph: we&#8217;re trying to visually capture the ideas contained within\u00a0our prospectus.<\/p>\n<p>We took all the keywords in each doc, found each of their synsets, and then found each synset&#8217;s pathways from its root (there can be many), counted them up and kept the important (oft-repeated) ones.\u00a0We draw those tramlines connecting our specific\/concentrated keyword ideas (small dark-dots) up to their parents (pale, diffuse dots).<\/p>\n<p>Notice how these results are distinctly different to the keywords. Nice.<\/p>\n<p><strong>Mixed Graphs<\/strong><br \/>\nThese graphs (they&#8217;re the same, one just has labels switched off) map the overlap that our robot sees as it compares our document (prospectus, greens) with a universe (Brown Corpus, orange).<\/p>\n<p>That overlap represents the specificity of our text (what it&#8217;s about, and also what it&#8217;s not about). It&#8217;s the robot&#8217;s version of first impressions, or skim-reading, as we said above.<\/p>\n<p><strong style=\"color: #ff9900;\">Next Steps<\/strong><br \/>\nWe&#8217;ve come a long long way from keywords, but let&#8217;s not kid ourselves, we are not yet at enlightenment (patience, young grasshopper!). Despair not, this is only an introductory blog &#8211; we&#8217;ve only just scratched the surface.<\/p>\n<p><strong>Similarity testing, Sentiment Analysis, Text Summarisation, Latent Semantic Analysis<\/strong>\u00a0and more, all lie ahead of us&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Grabbing electronic data from public sources is all very well, but what\u00a0about processing and analytics? Any data type will succumb to analysis,\u00a0pictures, sounds &#8211; you name it &#8211; but for now we will stick with numbers &amp; words. And because it is not something that investors do every day, we&#8217;ll focus on the words first. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":1881,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[3],"tags":[],"jetpack_featured_media_url":"https:\/\/circadian-capital.com\/wp-content\/uploads\/2014\/09\/robot_reading.png","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9TEZs-rz","_links":{"self":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts\/1709"}],"collection":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/comments?post=1709"}],"version-history":[{"count":3,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts\/1709\/revisions"}],"predecessor-version":[{"id":2416,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts\/1709\/revisions\/2416"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/media\/1881"}],"wp:attachment":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/media?parent=1709"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/categories?post=1709"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/tags?post=1709"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}