{"id":1765,"date":"2015-05-19T12:00:17","date_gmt":"2015-05-19T12:00:17","guid":{"rendered":"http:\/\/www.circadian-capital.co.uk\/?p=1765"},"modified":"2015-06-17T15:56:21","modified_gmt":"2015-06-17T15:56:21","slug":"glorious-ambiguity-context-is-all","status":"publish","type":"post","link":"https:\/\/circadian-capital.com\/glorious-ambiguity-context-is-all\/","title":{"rendered":"Glorious Ambiguity: context is all"},"content":{"rendered":"
\t\t\t
English of course not only has ambiguity, but is all the richer for it. I\u2019m sure there\u2019s good evolutionary social science making the case for ambiguity being absolutely essential for the success of a language, but we all instinctively know why this is so. We all value occasionally not quite meaning what we say. <\/div>\n\t\t<\/div>
\t\t\t
Latent Semantic Analysis<\/strong>
\nSounding like a bad stomach bug Latent Semantic Analysis (LSA) is a technique to try and come to understanding from context. LSA starts by looking at the presence of words (and each document is seen as just a bags of words) in a document. When this happens across a bundle of different documents we see that some docs show higher representation for certain words, i.e. clustering occurs. This makes sense: chickens, pigs and sheep are much more likely to appear together and in docs from Farmers\u2019 Weekly than say rates, credit, and reverse floaters (showing my age there!). It seems contradictory, but LSA\u2019s starting point is to ignore context \/ meaning \/ word-ordering \/ ambiguity etc. altogether and simply count the presence of words. From counting, concepts are inferred by the presence of groups of words.\n\n

The Count Matrix<\/strong>
\nSo we can think of LSA as presenting itself in two directions, words (vertical) and document reference (horizontal). There are a few modifications to the counting process (we get rid of stopwords \u2013 common words such as 'and', 'it', 'the'; uppercase, punctuation etc.) the most important of which is TFIDF. For any word we care about its Term Frequency (frequency of that term\/word in any document) as a measure of its in-document relevance divided by (the Inverse bit) its Document Frequency (fraction of documents containing the term) so that terms frequently used across lots of documents become reduced in importance. (Log counting is used but that shouldn't trouble us here.) \n\n

A tall & slender Corpus (there's a reason it's not short and fat!) <\/strong>
\nIn thinking about it, you will quickly come to realise that the y-axis tends to be very long: each author has the whole of the English language to choose from; each subsequent 'book' will have lots of overlap in word choice but also much heterogeneity, and accordingly we end up with something called a sparse matrix (lots of blanks \/ zeroes). \n\n

Into the Matrix: SVD<\/strong>
\nSo, we\u2019ve got this grid showing modified word counts (more accurately \u2018frequencies\u2019) by document. Now what? We will use Singular Value Decomposition to break it down. You can look up the details of SVD elsewhere, but in essence it breaks apart that grid. Today the grid captures two concepts\/things at once (let\u2019s call them \u2018wordishness\u2019 and \u2018bookishness\u2019 \u2013 bear with me, not as absurd as it sounds). SVD splits these so that we have two separate grids, one capturing \u2018wordishness\u2019 only (words down left; wordishness across top) and the other \u2018bookishness\u2019 only (books across top; bookishness down the left). When we multiply them together we perfectly (see below) recreate the original. Visually it looks a bit like this - <\/div>\n\t\t<\/div>
\n\n
\n\t\"\"\n<\/div>\n\n<\/div><\/div>
\t\t\t
A bit more detail<\/strong>
\nThe mechanics of matrix maths means a third grid gets slotted in-between these two, but it just exists to unitise things (keeping column entries so they sum to 1). The proper terminology you see for my three elements \u2013 the wordy, unitiser and booky grid - is U s V<\/strong> (for the left, singular value and right vectors).\n

Perfect Reconstruction?<\/strong>
\nI said \u2018perfectly\u2019 above but that isn\u2019t always entirely true (and really, what would be the point?). What we actually find is that the each of the columns of U<\/strong> (tall and thin) is in order of decreasing importance. That is, the first column captures most of the information we need to produce the word-book grid; the second a bit less and so on. Funnily enough, the same happens with the V matrix<\/strong> (short and wide) where each row decreases in importance. So the net effect of SVD is to reorder the signalling \/ important information toward the leftmost columns \/ uppermost rows and push the noise toward the right and bottom. And, in fact, the s matrix<\/strong> gives us a quantifiable measure of the decreasing importance of each column. In this way we can begin to think of these U columns and V rows <\/strong>as representing concept space. The U matrix tells us where our word sits within word-space, and the V matrix where our doc sits within doc-space.\n

Next Post<\/strong>
\nWe'll follow this post up with a real code example, but we've gotten to where we want to today: seeing that without much more than counting (and understanding nothing about language per se) we can begin to improve upon bag-o-words models and see pattern and context within documents.<\/div>\n\t\t<\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"

English of course not only has ambiguity, but is all the richer for it. I\u2019m sure there\u2019s good evolutionary social science making the case for ambiguity being absolutely essential for the success of a language, but we all instinctively know why this is so. We all value occasionally not quite meaning what we say. Latent […]<\/p>\n","protected":false},"author":2,"featured_media":2124,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[3,7],"tags":[],"jetpack_featured_media_url":"https:\/\/circadian-capital.com\/wp-content\/uploads\/2015\/05\/AbottCostelloBaseball2.jpg","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9TEZs-st","_links":{"self":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts\/1765"}],"collection":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/comments?post=1765"}],"version-history":[{"count":19,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts\/1765\/revisions"}],"predecessor-version":[{"id":2286,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/posts\/1765\/revisions\/2286"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/media\/2124"}],"wp:attachment":[{"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/media?parent=1765"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/categories?post=1765"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/circadian-capital.com\/wp-json\/wp\/v2\/tags?post=1765"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}