Monday, October 13, 2014

Real Technology Agnosticism

Consider the two opinions:

"Use whatever tools will make you the most productive. If they cost a lot of money, just talk to us and explain why the tool is important. If we can afford it, we will. If we can't afford it, we will make the best compromise with other tools we can, and we'll work on affording the better tool. If it involves changing something that is a standard in our company, meaning that many people and established processes would have to change too, we're not going to adopt the change quickly -- but we will listen to evidence about why the change would be cost effective and we will be open with you about considering it."

"Use only the tools that we have decided, as a company, to use. These tools are our technology conventions. Whether you like them or not, they are the tools available to you and you will not be considered to be a 'team player' unless you find a way to get your work done with the tools we give you. If you recommend changes we will view it as a waste of your time to even contemplate tool changes, even if you have evidence of their cost-effectiveness. Changing something that is already a standard within the company is impossible unless the idea originates with senior-level employees; the more you ask about changing established policies, the more you will be viewed as uncooperative."

Which of these attitudes falls more under the banner of "technology agnosticism"? In most bureaucratic settings, the second attitude is trumpeted as a pragmatic, technology-agnostic view point. But really it is an excuse to avoid dealing with the consequences of depriving talented workers of affordable, productivity-enhancing technologies (generally for political reasons, like deflecting blame with standards, and emphatically not in the name of legitimate business concerns).

If anything, it is technology dogmatic.

Meanwhile, the first attitude strives to be actually pragmatic rather than merely paying lip service. If a better tech tool is available and affordable: just use it. If it's not affordable: justify it with numbers and be content to wait until it's affordable. If the scope of the change is massive within the company: expect that you will need to present equally massive evidence that the change is beneficial, but also expect that we will appreciate it when you do present this kind of evidence.

That sounds much more technology agnostic to me -- not to mention less dehumanizing, more pragmatic, and fairer. And you will have the benefit that technology policies will be influenced more by technology experts in an organization than by managers.

Sunday, March 23, 2014

From 0 to Fama/French Postgres Database Tables with Python and Emacs


In this post, I am just recording the steps that I took to go from not having PostgreSQL installed or configured on my laptop, and not having downloaded any of Kenneth French's available financial data sets, all the way to performing a very simple regression on the data in Python via native Python queries to a PostgreSQL database table.

Setup Postgres

I'm using Ubuntu 12.10, so for me, the following was needed.
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib

Then I needed to do some things as the postgres administrator (named "postgres" by default).
sudo su postgres 
< enter password >
Add a new user (called 'roles' in Postgres).
postgres@eschaton:/home/ely$ createuser --pwprompt
Enter name of role to add: ely
Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) y
Create a database where the data will eventually go: 
postgres@eschaton:/home/ely/$ createdb FamaFrench
Exit back to regular user.
postgres@eschaton:/home/ely$ exit
Now we can try logging in to the new database:
ely@eschaton:~$ psql -d FamaFrench
psql (9.1.12)
Type "help" for help.


FamaFrench=#
So Postgres is up and running and we have a place to keep the data. Let's go get some data.

Getting some data.

Kenneth French provides a nice assortment of academic finance data sets at the French Data Library. One of the most commonly used files is the set of monthly US market "factor" values: a set of 3 things that are supposed to a pretty good job of explaining where financial returns will come from in the US market.

This data set is found at the link titled "Fama/French Factors," currently the first link under the downloadable files section. We can click the link and save the file locally. I saved it to "/home/ely/databases/raw_data/F-F_Research_Factors.zip" and extracted it to get "F-F_Research_Factors.txt" which we can open with any standard text editor and examine.

There is a column for dates. The factors that come after the date are the excess market return, something called "SMB" and something called "HML" -- and for good measure the risk free rate is also provided. You can read Wikipedia about why these are supposed to matter: Fama French 3-Factor Model.

Also note that way at the bottom of the file, there a bit of a gap and then the same sort of data is repeated but on a yearly basis -- the yearly factor data. For our purposes, let's ignore this. So I just copied all of the monthly data and pasted it into a new file, "ff_cleaned_monthly.csv" -- you can see where I'm going: I want to make the file into a sutiable .csv file.

The first thing I notice is that the date format is pretty inconvenient. Integer dates? Really? That's positively French. But what should we use to associate the dates with an actual date-type? We could debate that a lot, but the convention I picked was to associate each row with the final weekday of the month that it describes. This isn't perfect, because some weekdays at the end of the month can be holidays (like Memorial Day) and also before September 29, 1952, NYSE actually traded on Saturdays (can you believe it!) So the last weekday might not be the final day on which trading could have happened during a month in some of the older data.

But who cares. We have to pick something, and it happens to be easy to generate a list of the final weekdays from Python. How, you ask? I wrote the following little function:
def last_weekday(year, month):
    import calendar
    import numpy as np
    return np.asarray(
        calendar.monthcalendar(year, month))[:,0:5].max()
and then I ran it for the particular months in our data file, 1926-07 through 2014-02:
for year in range(1926, 2015):
    for month in range(1, 13):
        ld = last_weekday(year, month)
        print year, month, ld

This generated some output that looked like this:


Over in Emacs, the current status of my soon-to-be .csv looks like this:


So what I want to do is change all those nasty dates from something like '192607' into something like '1926-07-30'.

Putting in the first "-" symbol is easy. I move the cursor over between the "6" and the "0", hit C-<space> to set a marker, and then page all the way down to the bottom, effectively selecting a 0-width column. Then I type C-x r t followed by the thing I want to type, - and hit enter. Voila, it writes that text along the entire column. I can repeat this on the other side of the month number to the the second - symbol for the date strings.

Now I need to do something similar for writing in the appropriate last day of the month. For this, I first copied all of the month-end output from my IPython session into the bottom of my work-in-progress .csv file. I moved to the top left of this pasted data, hit C-<space> to set a mark, and moved down to the bottom right. At that point I type M-x kill-rectangle to copy the rectangle of text to the Emacs clipboard.

I move up to the spot where I'd want the top left of that column to be pasted, right after the "06-" in the first line of data. Then I type M-x yank-rectangle to paste it. Yeah, Emacs calls copying killing and pasting yanking. I wonder if anyone 500 years ago ever thought that someone would write the English sentence "Yeah, Emacs calls copying killing and pasting yanking."? Kind of makes you wonder what sorts of English sentences there might be floating around in 500 years from now.

Now we're almost done cleaning the data. We have the nicely formatted dates at the beginning of each line. But we need the file to be a .csv. Right now it is some kind of space- or tab-delimited file and I don't want to fiddle with that.

This is where the Emacs ability to bind keyboard macros on the fly comes in super handy. If you type C-x ( <some key strokes> C-x ), then Emacs will remember <some key strokes> and re-execute precisely that key stroke sequence when you type C-x e.

I played around for a while and it didn't seem super easy to come up with a set of key strokes that neatly deleted the white space and added commas at the needed spots. Here's what I finally came up with (I use the notation (right) to mean pressing the right-arrow key, and so on for up, down or left):
C-x (                                           # Line 1
C-u 3 C-(right)                                 # Line 2
M-x search-forward-regexp [^[:space:]] (left) , # Line 3
C-space C-(left) (right) (right) <del>          # Line 4
# repeat the above two lines 3 more times       # Line 5
(down) <home>                                   # Line 6
C-x )                                           # Line 7
Line 1: Starts the remembering-all-keystrokes mode.

Line 2: Equivalent of pressing Ctrl and the right arrow 3 times in a row, to move through the date characters.

Line 3: Moves ahead to just after the next non-space character, steps back by one position, and places a comma.

Line 4: Sets a mark, moves back to the previous decimal point, then move to the right by two spots since everything has two decimal places. Then deletes the selection, which will be all of the white space.

Line 5: Repeat lines 3 and 4 for each of the remaining data columns. Thus could be done by saving lines 3 and 4 together as their own macro, and C-u-ing it.

Line 6: Go to the start of the next line

Line 7: Stop listening to keystrokes, thus defining our new macro.

Then do C-x e a bunch of times.

Protip: after doing C-x e once, you can just press e by itself. Heh, you might say that e is the way that Emacs naturally logs your command (rimshot). You can also type C-x u <number> <command-to-repeat> to repeat <command-to-repeat> for <number> times in a row.

So we want to do this for 6 months in 1926 + 2 months in 2014 + 12*(2013-1926) - 1 line that we already processed while making the macro = 1051 total months. So after defining the keyboard macro, we could move the cursor to the beginning of the second line of data, and type C-x u 1051 C-x e and, as Emeril might have said if he had been attracted to a career in computer science instead of cooking, BAM:


I'm certain that people more wizardy with Emacs would scoff at this and come up with a much shorter set of key strokes to achieve the same thing. But hey, I'm distracted watching NCAA basketball at the moment. To quote Marge Simpson, who gives a doodle.

Yippee, now we have a nice .csv file with the kind of dates we want. Now let's put it into a database table.

Put some data into the database.

This part is pretty easy. All we have to do is create a table and tell Postgres what kind of columns there will be and what the type of data is for each column. Then, with an empty skeleton of a table, we can use use the nifty copy command to copy the data in from our .csv file:
ely@eschaton:~$ psql -d FamaFrench
psql (9.1.12)
Type "help" for help.

FamaFrench=# create table MonthlyFactors_USMarket (FactorDate date, ExcessMarket double precision, SMB double precision, HML double precision, RiskFreeRate double precision);

CREATE TABLE
FamaFrench=# \copy MonthlyFactors_USMarket from '/home/ely/databases/raw_data/ff_cleaned_monthly.csv' delimiter ',' CSV;
FamaFrench=# select * from MonthlyFactors_USMarket limit 5;
 factordate | excessmarket |  smb  |  hml  | riskfreerate
------------+--------------+-------+-------+--------------
 1926-07-30 |         2.95 |  -2.5 | -2.67 |         0.22
 1926-08-31 |         2.63 |  -1.2 |   4.5 |         0.25
 1926-09-30 |         0.38 | -1.33 |  -0.3 |         0.23
 1926-10-29 |        -3.24 | -0.13 |  0.79 |         0.32
 1926-11-30 |         2.54 | -0.24 | -0.39 |         0.31
(5 rows)

FamaFrench=# select count(*) from MonthlyFactors_USMarket;
 count
-------
  1052
(1 row)
Awesome! It worked! Now lets access it directly from Python!

Access PostgreSQL data from Python

For this, you will need to install unixodbc and the drivers for Postgres:
sudo apt-get install unixodbc unixodbc-dev odbc-postgresql
You will also want to install pyodbc:

With conda (recommended):
conda install pyodbc
or with pip
pip install pyodbc
You may choose to use sudo with these commands depending on how you have set up your Python environment.

Then we will need to put an entry into the odbc initialization file, /etc/odbcinst.ini, that looks like this:
[PostgreSQL]
Description   = PostgreSQL ODBC driver (ANSI version)
Driver        = /path/to/psqlodbca.so
Setup         = /path/to/libodbcpsqlS.so
Database      = FamaFrench
Debug         = 0
CommLog       = 1
UsageCount    = 1
The driver and setup locations can vary. I suggest you go to /usr and just try:
find . -name psqlodbca.so
For me, with 64-bit Ubuntu, they are found as follows:
ely@eschaton:/usr$ find . -name psqlodbca.so
./lib/x86_64-linux-gnu/odbc/psqlodbca.so
So the path for me is: /usr/lib/x86_64-linux-gnu/odbc/...

Then lastly place the following in /etc/odbc.ini.
[PostgreSQL]
Description    = PostgreSQL
Driver         = PostgreSQL
Trace          = No
TraceFile      = /tmp/psqlodbc.log
Servername     = localhost
ReadOnly       = Yes
If you really want the gory details on all this .ini business, here's a good link: Unix ODBC Internals. But basically, think of the odbc.ini file as the shorthand notation for how you want to connect to things that are spelled out in the odbcinst.ini file.

Anyway, once all of the above is installed and configured, you can just pop open IPython and do the following:
import pyodbc

conn = pyodbc.connect("DRIVER={PostgreSQL};SERVER=localhost;DATABASE=FamaFrench;UID=ely;PWD=*****")

conn.execute("select * from MonthlyFactors_USMarket limit 10").fetchall()
which prints:

[(datetime.date(1926, 7, 30), 2.95, -2.5, -2.67, 0.22),
 (datetime.date(1926, 8, 31), 2.63, -1.2, 4.5, 0.25),
 (datetime.date(1926, 9, 30), 0.38, -1.33, -0.3, 0.23),
 (datetime.date(1926, 10, 29), -3.24, -0.13, 0.79, 0.32),
 (datetime.date(1926, 11, 30), 2.54, -0.24, -0.39, 0.31),
 (datetime.date(1926, 12, 31), 2.62, -0.22, -0.11, 0.28),
 (datetime.date(1927, 1, 31), -0.11, -0.17, 4.86, 0.25),
 (datetime.date(1927, 2, 28), 4.11, 0.37, 3.26, 0.26),
 (datetime.date(1927, 3, 31), -0.15, -1.62, -2.62, 0.3),
 (datetime.date(1927, 4, 29), 0.52, 0.25, 0.66, 0.25)]
And if we wanted the data as a Pandas DataFrame:
df = pandas.DataFrame(map(tuple, conn.execute("select * from MonthlyFactors_USMarket").fetchall()), columns=["FactorDate", "ExcessMarket", "SMB", "HML", "RiskFreeRate"])
Note that we needed to map the returned rows to tuple types in Python. The usual type will be pyodbc.Row, which is not recognized by the Pandas DataFrame constructor.

So what did the distribution of this SMB factor look like over all time?
df.SMB.hist(bins=71)

If we regress the excess market on the 1-month lagged values of HMB and SML, are they good predictors? (I reformatted the output a bit to be more readable).
In [35]: pandas.ols(y=df.ExcessMarket, x=df.shift()[["SMB", "HML"]])
Out[35]:

----- Summary of RegressionAnalysis -----

Formula: Y ~ <SMB> + <HML> + <intercept>

Number of Observations:         1051
Number of Degrees of Freedom:   3

R-squared:         0.0069
Adj R-squared:     0.0050

Rmse:              5.4000

F-stat (2, 1048):     3.6279, p-value: 0.0269

Degrees of Freedom: model 2, resid 1048

-- Summary of Coefficients --
Variable    Coef     Std Err   t-stat   p-value   -----------------------------------------------
SMB         0.0390   0.0520    0.75     0.4534
HML         0.1185   0.0478    2.48     0.0133
intercept   0.5911   0.1679    3.52     0.0004   
Eh, they're ok. It's pretty hard to predict the overall US market with much accuracy.

This sort of regression is not the common mode of use for this data. Instead, folks will normally load in a single stock's time-series of monthly excess returns and regress that on all three of the Fama/French factors. Then they will repeat this for all the stocks in their investable universe. Usually it will be a rolling regression, with a window size of somewhere between 12 and 60 monthly periods.

In the end, for each (stock, date) pair, you'll get coefficients of that stock (at that time) on the 3 factors. You can aggregate these by averaging over cross-sections at the same time, or averaging for single (or groups of) stocks over time. In the end, the goal is to save some coefficients on these factors that you believe will link today's observable data (this month's excess market return, SMB, and HML) with the returns each stock will realize over the next 1-month period.

Exercise for the reader

Use the Pandas io module to get some stock level data for AAPL over a few decades. Clean the data so the dates align with the ones fetched from our new database. Join those stock-level returns onto our database data, and perform one of these rolling regressions for AAPL to see what kind of coefficients AAPL exhibits in the Fama/French 3-factor model.




Saturday, March 1, 2014

Summary blog and some summaries

I've decided to start blogging again. To keep myself motivated to blog I will try to reduce the size of the posts, create fewer, more-targeted posts, and tend to make most posts function more as summaries of interesting things that I have recently read. Here goes!

By the time you give them a raise, they are already out the door.

If finding talented folks is priority number one, retaining them is priority number two. Do this by paying them at or above market rates (and be sure you get these rates right) and by developing a clear path for growth. Undervalued folks also complain less directly; if you hear them complaining about side properties of the situation then address the complaints fast. Otherwise, by the time you offer a fix (a raise, improved conditions) such folks have already committed to leaving.


The Engineer Crunch

Many firms currently state that for non-engineering jobs they can find candidates fast yet for engineering jobs they cannot find suitable candidates at all. To attract engineers some of the following may work: be a good engineer yourself, already employ good engineers, provide excellent pay, provide a compelling (and clear) mission, offer freedom and interesting projects. This is highly related to Paul Graham's essay "Great Hackers."


Hanson on Gopnik on Religion

Many measures show societal decline in religiosity. Gopnik claims that as wealth increases, religiosity decreases as past-poverty-based pains no longer motivate belief. Hanson notes that despite this, most folks still partake in rituals and profess some degree of belief. Will religion surge again if in the future most folks live at a subsistence level?


Calm

Om.


Worse than Wal-Mart: Amazon...

Amazon uses an uber-quantitative modern form of Taylorism to track employee movements with possible privacy-intruding info grabbing (such as whether an employee chose to go to the nearest possible bathroom or not). Folks who have some characteristic preventing them from meeting the employer's demands (such as age preventing a box-checker from checking enough boxes per hour) seem to be fired or treated badly. The tone is very critical but it's not clear why or what the author wants.


Some notes on the Amazon thing, just for reference.

There are (at least) three tensions that might be mutually inconsistent (even morally): (1) wanting organizations to efficiently satisfy consumer demands (e.g. keeping only the best workers from among a pool of competing employee candidates), (2) wanting all people to have quality of life above some threshold (e.g. a person's age-induced slowness won't result in poverty-inducing job losses), and (3) if you even can operate above the threshold in item two, then allowing a person's productivity (as measured in financially meaningful units of output for the firm) to determine how that person is compensated (so that harder, better, or smarter work == more pay).

One conjecture is that because (3) is hard to measure and asymmetrically favors a small segment of society, then folks will try to argue that (2) should morally defeat (1). This is related to Hanson's "Inequality Talk is About Grabbing", though there is a kernel of legitimacy: it is inefficient (and unfair) if a marginal increase in an activity does result in more profit for a firm yet zero share of that profit is conferred back to the worker. An often overlooked point is that when a worker increases personal productivity, some of that is due to previously deployed capital, some is due to good management, some is due to the fact that the company paid the electricity bill and the employee consumed usage of some lighting while improving personal productivity, etc. So the Marxist idea that this "surplus effort" by the worker is not repaid to the worker can never be fully consistent. A worker is never logically entitled to full rights on the proceeds from the stream of the worker's future 'surplus effort' since many other factors were required for it to even be possible that there would be surplus effort.

Friday, January 18, 2013

Treatment > Cure

There is a particular flavor of conspiracy theory surrounding disease cures that can be stated like this: private interests actively suppress disease cures because they can make more money by "milking" the diseased through on-going treatments that last a lifetime instead of cures which only earn money until the disease is mostly gone. Historical evidence does not favor this theory, but putting that aside I am interested in the underlying incentive problem.

Suppose it is extremely costly to research and develop a mechanism that totally kills all of the roaches, ants, and termites living in or around your home. Such a panacea insecticide could exist; it would just be very costly to develop it. On the other hand, suppose that common insect repellents are relatively cheap and easy to produce, or at least the chemists consulting with would-be insecticide company founders can give convincing reasons to expect it to be cheap.

Then the forecasted income stream from an on going business that makes the cheaper insecticides is more likely to succeed, more likely to get needed venture capital funding, and more likely to actually produce insect repellents that improve the lives of consumers. The savvy business investor will be incented to make products that solve a problem. Yes, an insect "cure" would be better than an on going insect "treatment" and we could grumble that the evil business investor is diverting funds that could otherwise be exhaustively spent on a cure search. But then in the meantime we might not have the useful stop gap repellents that, while not a cure, sure make life better and are more guaranteed due to their lower production burden.

Why doesn't similar reasoning apply to disease? I sure don't want to get cancer. It looks like a pretty difficult disease to understand and treat, much more so to cure. But smart people have already devoted a lot of time to explore potential cures, most of which haven't shown signs of working. So we should expect a full cure to be very, very difficult and expensive. So should I desire folks to go Indiana Jones style after that cure or praise them if they do? Their highly risky and expensive research efforts may fail to produce the "cancer repellent" equivalent along the way, leaving us with no cure, no treatment, and lost wealth.

Instead, perhaps a pharmaceutical company might look at the business implications of a long term revenue stream from on going cancer treatments... the better the treatment (less pain, no hair loss, less weakness or morbidity) the more money people will pay for the treatment stream. That sounds like a good world for future me to live in. Yes, I'd like a flat out cure more than that, but if the incentives can more reliably steer people towards a treatment that also improves my life, I'll happily take it.

To summarize, on going cancer treatments may very well be "better" than cancer cures, in the sense that treatments could be more cheaply and reliably achieved and actually offer investors an attractive revenue stream that will act as an incentive to get them to solve the problem. Going for the cure only, and sneering at treatments as if they are "greedy" ways to solve the problem is disingenuous: if it improves lives reliably then it's a good thing even if it's not the hard-to-get best thing.

I suspect some folks will not like this and will see it as "giving up" on a cure when, because of the high value of human life, we should practically give up anything to find a bullet proof cure. Economic behavior suggests that people only pay lip service to such an idea. But if you're more committed to your convictions than most, an excellent avenue to explore is prize-based charity so that you actively incent people to win a prize for single milestones of achievement. If it's true that people really want expensive cancer cures instead of just the cheaper cancer repellent, we ought to see a lot more private prize donations.

Wednesday, December 26, 2012

Why Loss Lingering?

I just re-watched 'Mother Simpson' (season 7, episode 8 of The Simpsons) and was more struck this time by the end of the episode than the other dozen or so times that I've seen it. In it, Homer's long presumed-dead mother makes a dramatic reappearance where her past as a law-breaking activist is revealed. At the end, the law are after her again and she must suddenly flee, leaving Homer motherless again. Homer says goodbye as she gets into a van that speeds into the distance. Then Homer remains right there, on the side of a rural road, until long after nightfall. It's that last part that made me think.

I've had many times in life where I've had to make dramatic goodbyes, most often with my family. One thing I often notice is that I retrace steps or linger in places where now-gone people once were. I walk my family to their car or a cab, wave goodbye, then go back to my living room and feel a bit sad to think that they were just in that very living room. I've had these experiences when retracing my steps after saying goodbye at my own apartment, at airports, at train stations, and at various points of interest where paths diverge. I've had this feeling on a small scale when I know it will only be weeks or months before seeing the person again; I've had this feeling on a large scale when I am unsure if I will ever see the person again.

Of course it fades and it is not the same feeling as grief over death or other separation emotions. But it makes me curious about why physical surroundings and retracing steps combine like this to amplify separation loss feelings. It's very easy to come up with simple/obvious answers for this, but are they really good explanations?

Here are a few (very speculative) ideas:

  1. Sensory surprise. When you are around family, friends, or other loved ones, you are more attentive to things that would usually just be cursory elements of your setting. The specific sensory experiences that relate to your family, friends, or loved ones become familiar very fast and then their absence later is very stark, providing a depressive environment for previously-heightened senses. For example, it's well-known that peer and mate bonding has a lot to do with oxytocin levels in the brain, and things like eye-to-eye contact can increase this. If you become accustomed to receiving an oxytocin trigger in a given setting and then the trigger goes away, it could be like a mild form of withdrawal.
  2. Immediacy of loss. If the impact of separation falls off according to some power law, then the moments right after a separation would be relatively more difficult than distant moments. Thus, we might expect to blow small details out of proportion, like the significance of someone having just been with you at a certain spot, more when separation is extremely recent.
  3. Milestone effects. Culturally, communities and societies choose occasions and events that have "intrinsic" meaning. Sure, some have other valuable meanings too (like a college graduation conferring certification of a certain level of hard work or knowledge, or the birth of a child implying all sorts of emotional and physical lifestyle changes). But many things, like birthdays, vacations, retirement, rituals, holidays, or reunions, have meaning that more or less is dreamed up out of thin air and persists only as long as cultural pressures make it persist. But even so, we grow up embedded in audio, video, and tactile sense streams that reinforce the importance of milestone events all the time. Most goodbyes accompany milestone events, and therefore we might be prone to contrast a milestone-setting with a non-milestone-setting.
  4. Maybe we don't. It could be that I perceive this affect to be more common than it is. Maybe most people don't linger in post-separation places or states of mind. Perhaps this is more strongly felt the more "emo" or whiny that someone is? What other personality traits does linger-after-loss correlate with? Why would such a trait appear in people? Does it indicate brooding? Is it a signal of commitment to others? What ways have we evolved to detect fake grief, so that such brooding or loyalty signals could be relied upon?
What other explanations help model this linger-after-loss behavior? If it was never very advantageous for social species to develop quick, robust, elastic happiness feelings that pop back into place after a loss, why not?