We’re obviously very interested in financial disclosure. I say ‘obviously’ well-aware that our purist technical trader friends are interested in nothing but price, however let us assume that the disclosure information has function, and leave ideas about predictive capabilities for another time.
Roughly stated, US companies accessing the capital markets need to file regularly with the SEC and those filings can be found here.
The filings are obligatory and under penalty of civil and criminal enforcement authority, so we can safely assume that every word written is considered and serves purpose. I say words, but, as you might imagine, the statements also serve as the official repository of financial and numerical data.
This ‘formalisation’ process – whether by law, or industry norm – proves to be of great assistance, here and in any other (and you’ll have seen us reference scientific research elsewhere) industry, as it shortcuts the parsing process.
To use its own words, the SEC makes this information available to all ‘in an attempt to level the playing field for all investors’.
Like most searchable databases Edgar proves to be a fantastic resource. As with many highly technical sources, it can initially prove a little cumbersome to navigate but the data quality is phenomenal. (Having encountered this a few times now, one presumes that initial awkwardness is a symptom of the complexity of data indexing needed.)
Our problem, and you might be picking up a theme here from other posts, is output. We don’t want just to be able to read html. We want to be able to download, store and structure data for future analysis and comparison. And, generally, we want to do be able to do that very quickly.
Understanding the plethora of different filing types (over 400 by my reckoning) is outside of the realm of this blog, so let’s focus on one of the most important, the 10-K (the annual report).
Outside of all the financial data (structured information) the 10-K contains two (unstructured) narrative sections are of greater interest to most investors –
ITEM 1A. RISK FACTORS
ITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS
Between them these provide an account outlining the previous year’s operations,
excuses explanations, and views on future risks.
Let’s get going. I’m interested in the health-care sector.
Let’s take the top 40 or 50 largest health-care names globally (just grab off the internet). Our database maintains an index of all filings with the SEC (about 3,400 filings added every month), so we query, generating a reference of all of the 10ks filed by large health-care companies in 2015. (Notice I did something behind the scenes there: moving from names to tickers to SEC coding references – trivial of course, but play with Edgar yourself and that triviality will be anything-but fairly quickly.)
So far, so index. Now let’s read those records looking for (i) financial data; and (ii) textual data.
We don’t here want to spend too much time on the gathering of the financial data. Owing to its structured nature, that is fairly freely available resource and you can pick up downloads of financial data anywhere.
Although, as ever, those simple freeware tools with which you are all familiar, are pretty basic. By comparison, each morning, Matt auto-updates 350 days’ worth of Open, Hi, Lo, Close prices on 1,850 stocks across the UK, Europe and the USA in R, producing feature generation (channel breakouts, moving averages, volatility filtering and the like) as it goes, all in around six minutes (on a Mac Air).
Regardless, for completeness’ and for those interested in factor analysis, we do perform that data grab, picking up anything up to 50 factors per ticker. The factors we choose to gather are, of course, entirely flexible and can be tailored to any particular investor’s interest.
Since it is more forward-looking, we focused on the Risk Section, which for each entity turns out to be a bit less than half the length of an average-sized book.
We fed the machine its index of filings, and off it went reading all of those ‘books’ automatically storing as it went. And in truth, ‘reading’ is a fairly good proxy for what the machine is doing – these aren’t just ASCII characters being picked up, but words, sitting within clauses & sentences analysed for linguistic structure and context (and, subsequently, feature extraction). And now they sit on our database.
For an idea of scale: in this particular exercise we gathered about 650,000 words. (In truth, how can we humans be expected to keep up?)
Interested in commentary regarding blockbuster drugs? This is essentially a variant of keyword searching (and we could instead search by any keyword we chose).
In any case, here’s a snippet of commentary about the world’s top 100 selling drugs –
And to give you an idea of the simplicity of operation, here is the pseudo-code –
# 1. CREATE INDEX: FILTER META DATABASE TO CONTAIN 10Ks ONLY tenks = db.session.query(MoMetaData).filter_by(edgar_formtype='10-K') .order_by(MoMetaData.edgar_acceptancedatetime) # 2. RENDER FILINGS ASSOCIATED WITH INDEX (SNIPPING ACCORDING TO OUR LABEL REQUIREMENT) for t in tenks: FilingsRender(t.enclosure, t.index_url, scrape_labels=labs) # 3. READ IN ALL THOSE DRUG NAMES (WITH WILDCARDS) with open('./100drugs.txt') as fname: nam = map(lambda x: x.strip().split(' ') + '*', fname.readlines()) # 4. BUILD SENTENCES: QUERY FILINGS DATABASE CONTAINING DRUG NAME for n in nam: tmp = dbo.build_di_raw_sentences_containing_query(n)
We will, in due course, drop a GUI over our interface so that anyone – not just code-geeks – can easily search.
All feel a bit too easy?
The purpose of today was merely to show the harvesting machinery in action (particularly now that it’s paired up with a database), and, for the time-being to focus on speed-reading (something we can build on in later posts).
Perhaps that doesn’t feel ambitious enough, but test that thought: see if you can coax the same from your browser, or, failing that, ask how long it would take your team of (well-paid) industry analysts to generate similar.