Saturday, January 3, 2015

A (an?) Haiku

Inspired to totally rip off a phrase from this:

I am a quilt of // loosely cobbled-together // coping strategies.

Tuesday, October 21, 2014

Bad Job Avoidance

I've been unemployed for just over four months. I've been searching for jobs every day: I've conducted over 30 phone interviews (not counting the dozens of third-party recruiters I've spoken with), filled out well over 100 applications, participated in more than 10 later-stage on-site interviews, and rejected 5 formal job offers. It's definitely stressful and demoralizing to go through such a difficult searching experience.

I spoke recently with a third party recruiter who did not like my preferences: I expect to have a job with satisfactory compensation and benefits, satisfactory vacation time, a healthy work/life balance, freedom to use computational tools that are best for the job, the opportunity to gain quality work experience, the opportunity to learn new things, and the ability to work in a suitably quiet and private space to facilitate software development productivity.

I believe these are minimally acceptable preferences: a job that fails to meet even a single one of these items is not just less than ideal; it is unhealthy and should be flat out rejected.

The recruiter asked with a smug arrogance, "So, how's the job search going?"

"Great," I replied. "Here I am enjoying the fact that I'm not stuck in a horrible job."

Monday, October 13, 2014

Real Technology Agnosticism

Consider the two opinions:

"Use whatever tools will make you the most productive. If they cost a lot of money, just talk to us and explain why the tool is important. If we can afford it, we will. If we can't afford it, we will make the best compromise with other tools we can, and we'll work on affording the better tool. If it involves changing something that is a standard in our company, meaning that many people and established processes would have to change too, we're not going to adopt the change quickly -- but we will listen to evidence about why the change would be cost effective and we will be open with you about considering it."

"Use only the tools that we have decided, as a company, to use. These tools are our technology conventions. Whether you like them or not, they are the tools available to you and you will not be considered to be a 'team player' unless you find a way to get your work done with the tools we give you. If you recommend changes we will view it as a waste of your time to even contemplate tool changes, even if you have evidence of their cost-effectiveness. Changing something that is already a standard within the company is impossible unless the idea originates with senior-level employees; the more you ask about changing established policies, the more you will be viewed as uncooperative."

Which of these attitudes falls more under the banner of "technology agnosticism"? In most bureaucratic settings, the second attitude is trumpeted as a pragmatic, technology-agnostic view point. But really it is an excuse to avoid dealing with the consequences of depriving talented workers of affordable, productivity-enhancing technologies (generally for political reasons, like deflecting blame with standards, and emphatically not in the name of legitimate business concerns).

If anything, it is technology dogmatic.

Meanwhile, the first attitude strives to be actually pragmatic rather than merely paying lip service. If a better tech tool is available and affordable: just use it. If it's not affordable: justify it with numbers and be content to wait until it's affordable. If the scope of the change is massive within the company: expect that you will need to present equally massive evidence that the change is beneficial, but also expect that we will appreciate it when you do present this kind of evidence.

That sounds much more technology agnostic to me -- not to mention less dehumanizing, more pragmatic, and fairer. And you will have the benefit that technology policies will be influenced more by technology experts in an organization than by managers.

Sunday, March 23, 2014

From 0 to Fama/French Postgres Database Tables with Python and Emacs

In this post, I am just recording the steps that I took to go from not having PostgreSQL installed or configured on my laptop, and not having downloaded any of Kenneth French's available financial data sets, all the way to performing a very simple regression on the data in Python via native Python queries to a PostgreSQL database table.

Setup Postgres

I'm using Ubuntu 12.10, so for me, the following was needed.
sudo apt-get update
sudo apt-get install postgresql postgresql-contrib

Then I needed to do some things as the postgres administrator (named "postgres" by default).
sudo su postgres 
< enter password >
Add a new user (called 'roles' in Postgres).
postgres@eschaton:/home/ely$ createuser --pwprompt
Enter name of role to add: ely
Enter password for new role:
Enter it again:
Shall the new role be a superuser? (y/n) y
Create a database where the data will eventually go: 
postgres@eschaton:/home/ely/$ createdb FamaFrench
Exit back to regular user.
postgres@eschaton:/home/ely$ exit
Now we can try logging in to the new database:
ely@eschaton:~$ psql -d FamaFrench
psql (9.1.12)
Type "help" for help.

So Postgres is up and running and we have a place to keep the data. Let's go get some data.

Getting some data.

Kenneth French provides a nice assortment of academic finance data sets at the French Data Library. One of the most commonly used files is the set of monthly US market "factor" values: a set of 3 things that are supposed to a pretty good job of explaining where financial returns will come from in the US market.

This data set is found at the link titled "Fama/French Factors," currently the first link under the downloadable files section. We can click the link and save the file locally. I saved it to "/home/ely/databases/raw_data/" and extracted it to get "F-F_Research_Factors.txt" which we can open with any standard text editor and examine.

There is a column for dates. The factors that come after the date are the excess market return, something called "SMB" and something called "HML" -- and for good measure the risk free rate is also provided. You can read Wikipedia about why these are supposed to matter: Fama French 3-Factor Model.

Also note that way at the bottom of the file, there a bit of a gap and then the same sort of data is repeated but on a yearly basis -- the yearly factor data. For our purposes, let's ignore this. So I just copied all of the monthly data and pasted it into a new file, "ff_cleaned_monthly.csv" -- you can see where I'm going: I want to make the file into a sutiable .csv file.

The first thing I notice is that the date format is pretty inconvenient. Integer dates? Really? That's positively French. But what should we use to associate the dates with an actual date-type? We could debate that a lot, but the convention I picked was to associate each row with the final weekday of the month that it describes. This isn't perfect, because some weekdays at the end of the month can be holidays (like Memorial Day) and also before September 29, 1952, NYSE actually traded on Saturdays (can you believe it!) So the last weekday might not be the final day on which trading could have happened during a month in some of the older data.

But who cares. We have to pick something, and it happens to be easy to generate a list of the final weekdays from Python. How, you ask? I wrote the following little function:
def last_weekday(year, month):
    import calendar
    import numpy as np
    return np.asarray(
        calendar.monthcalendar(year, month))[:,0:5].max()
and then I ran it for the particular months in our data file, 1926-07 through 2014-02:
for year in range(1926, 2015):
    for month in range(1, 13):
        ld = last_weekday(year, month)
        print year, month, ld

This generated some output that looked like this:

Over in Emacs, the current status of my soon-to-be .csv looks like this:

So what I want to do is change all those nasty dates from something like '192607' into something like '1926-07-30'.

Putting in the first "-" symbol is easy. I move the cursor over between the "6" and the "0", hit C-<space> to set a marker, and then page all the way down to the bottom, effectively selecting a 0-width column. Then I type C-x r t followed by the thing I want to type, - and hit enter. Voila, it writes that text along the entire column. I can repeat this on the other side of the month number to the the second - symbol for the date strings.

Now I need to do something similar for writing in the appropriate last day of the month. For this, I first copied all of the month-end output from my IPython session into the bottom of my work-in-progress .csv file. I moved to the top left of this pasted data, hit C-<space> to set a mark, and moved down to the bottom right. At that point I type M-x kill-rectangle to copy the rectangle of text to the Emacs clipboard.

I move up to the spot where I'd want the top left of that column to be pasted, right after the "06-" in the first line of data. Then I type M-x yank-rectangle to paste it. Yeah, Emacs calls copying killing and pasting yanking. I wonder if anyone 500 years ago ever thought that someone would write the English sentence "Yeah, Emacs calls copying killing and pasting yanking."? Kind of makes you wonder what sorts of English sentences there might be floating around in 500 years from now.

Now we're almost done cleaning the data. We have the nicely formatted dates at the beginning of each line. But we need the file to be a .csv. Right now it is some kind of space- or tab-delimited file and I don't want to fiddle with that.

This is where the Emacs ability to bind keyboard macros on the fly comes in super handy. If you type C-x ( <some key strokes> C-x ), then Emacs will remember <some key strokes> and re-execute precisely that key stroke sequence when you type C-x e.

I played around for a while and it didn't seem super easy to come up with a set of key strokes that neatly deleted the white space and added commas at the needed spots. Here's what I finally came up with (I use the notation (right) to mean pressing the right-arrow key, and so on for up, down or left):
C-x (                                           # Line 1
C-u 3 C-(right)                                 # Line 2
M-x search-forward-regexp [^[:space:]] (left) , # Line 3
C-space C-(left) (right) (right) <del>          # Line 4
# repeat the above two lines 3 more times       # Line 5
(down) <home>                                   # Line 6
C-x )                                           # Line 7
Line 1: Starts the remembering-all-keystrokes mode.

Line 2: Equivalent of pressing Ctrl and the right arrow 3 times in a row, to move through the date characters.

Line 3: Moves ahead to just after the next non-space character, steps back by one position, and places a comma.

Line 4: Sets a mark, moves back to the previous decimal point, then move to the right by two spots since everything has two decimal places. Then deletes the selection, which will be all of the white space.

Line 5: Repeat lines 3 and 4 for each of the remaining data columns. Thus could be done by saving lines 3 and 4 together as their own macro, and C-u-ing it.

Line 6: Go to the start of the next line

Line 7: Stop listening to keystrokes, thus defining our new macro.

Then do C-x e a bunch of times.

Protip: after doing C-x e once, you can just press e by itself. Heh, you might say that e is the way that Emacs naturally logs your command (rimshot). You can also type C-x u <number> <command-to-repeat> to repeat <command-to-repeat> for <number> times in a row.

So we want to do this for 6 months in 1926 + 2 months in 2014 + 12*(2013-1926) - 1 line that we already processed while making the macro = 1051 total months. So after defining the keyboard macro, we could move the cursor to the beginning of the second line of data, and type C-x u 1051 C-x e and, as Emeril might have said if he had been attracted to a career in computer science instead of cooking, BAM:

I'm certain that people more wizardy with Emacs would scoff at this and come up with a much shorter set of key strokes to achieve the same thing. But hey, I'm distracted watching NCAA basketball at the moment. To quote Marge Simpson, who gives a doodle.

Yippee, now we have a nice .csv file with the kind of dates we want. Now let's put it into a database table.

Put some data into the database.

This part is pretty easy. All we have to do is create a table and tell Postgres what kind of columns there will be and what the type of data is for each column. Then, with an empty skeleton of a table, we can use use the nifty copy command to copy the data in from our .csv file:
ely@eschaton:~$ psql -d FamaFrench
psql (9.1.12)
Type "help" for help.

FamaFrench=# create table MonthlyFactors_USMarket (FactorDate date, ExcessMarket double precision, SMB double precision, HML double precision, RiskFreeRate double precision);

FamaFrench=# \copy MonthlyFactors_USMarket from '/home/ely/databases/raw_data/ff_cleaned_monthly.csv' delimiter ',' CSV;
FamaFrench=# select * from MonthlyFactors_USMarket limit 5;
 factordate | excessmarket |  smb  |  hml  | riskfreerate
 1926-07-30 |         2.95 |  -2.5 | -2.67 |         0.22
 1926-08-31 |         2.63 |  -1.2 |   4.5 |         0.25
 1926-09-30 |         0.38 | -1.33 |  -0.3 |         0.23
 1926-10-29 |        -3.24 | -0.13 |  0.79 |         0.32
 1926-11-30 |         2.54 | -0.24 | -0.39 |         0.31
(5 rows)

FamaFrench=# select count(*) from MonthlyFactors_USMarket;
(1 row)
Awesome! It worked! Now lets access it directly from Python!

Access PostgreSQL data from Python

For this, you will need to install unixodbc and the drivers for Postgres:
sudo apt-get install unixodbc unixodbc-dev odbc-postgresql
You will also want to install pyodbc:

With conda (recommended):
conda install pyodbc
or with pip
pip install pyodbc
You may choose to use sudo with these commands depending on how you have set up your Python environment.

Then we will need to put an entry into the odbc initialization file, /etc/odbcinst.ini, that looks like this:
Description   = PostgreSQL ODBC driver (ANSI version)
Driver        = /path/to/
Setup         = /path/to/
Database      = FamaFrench
Debug         = 0
CommLog       = 1
UsageCount    = 1
The driver and setup locations can vary. I suggest you go to /usr and just try:
find . -name
For me, with 64-bit Ubuntu, they are found as follows:
ely@eschaton:/usr$ find . -name
So the path for me is: /usr/lib/x86_64-linux-gnu/odbc/...

Then lastly place the following in /etc/odbc.ini.
Description    = PostgreSQL
Driver         = PostgreSQL
Trace          = No
TraceFile      = /tmp/psqlodbc.log
Servername     = localhost
ReadOnly       = Yes
If you really want the gory details on all this .ini business, here's a good link: Unix ODBC Internals. But basically, think of the odbc.ini file as the shorthand notation for how you want to connect to things that are spelled out in the odbcinst.ini file.

Anyway, once all of the above is installed and configured, you can just pop open IPython and do the following:
import pyodbc

conn = pyodbc.connect("DRIVER={PostgreSQL};SERVER=localhost;DATABASE=FamaFrench;UID=ely;PWD=*****")

conn.execute("select * from MonthlyFactors_USMarket limit 10").fetchall()
which prints:

[(, 7, 30), 2.95, -2.5, -2.67, 0.22),
 (, 8, 31), 2.63, -1.2, 4.5, 0.25),
 (, 9, 30), 0.38, -1.33, -0.3, 0.23),
 (, 10, 29), -3.24, -0.13, 0.79, 0.32),
 (, 11, 30), 2.54, -0.24, -0.39, 0.31),
 (, 12, 31), 2.62, -0.22, -0.11, 0.28),
 (, 1, 31), -0.11, -0.17, 4.86, 0.25),
 (, 2, 28), 4.11, 0.37, 3.26, 0.26),
 (, 3, 31), -0.15, -1.62, -2.62, 0.3),
 (, 4, 29), 0.52, 0.25, 0.66, 0.25)]
And if we wanted the data as a Pandas DataFrame:
df = pandas.DataFrame(map(tuple, conn.execute("select * from MonthlyFactors_USMarket").fetchall()), columns=["FactorDate", "ExcessMarket", "SMB", "HML", "RiskFreeRate"])
Note that we needed to map the returned rows to tuple types in Python. The usual type will be pyodbc.Row, which is not recognized by the Pandas DataFrame constructor.

So what did the distribution of this SMB factor look like over all time?

If we regress the excess market on the 1-month lagged values of HMB and SML, are they good predictors? (I reformatted the output a bit to be more readable).
In [35]: pandas.ols(y=df.ExcessMarket, x=df.shift()[["SMB", "HML"]])

----- Summary of RegressionAnalysis -----

Formula: Y ~ <SMB> + <HML> + <intercept>

Number of Observations:         1051
Number of Degrees of Freedom:   3

R-squared:         0.0069
Adj R-squared:     0.0050

Rmse:              5.4000

F-stat (2, 1048):     3.6279, p-value: 0.0269

Degrees of Freedom: model 2, resid 1048

-- Summary of Coefficients --
Variable    Coef     Std Err   t-stat   p-value   -----------------------------------------------
SMB         0.0390   0.0520    0.75     0.4534
HML         0.1185   0.0478    2.48     0.0133
intercept   0.5911   0.1679    3.52     0.0004   
Eh, they're ok. It's pretty hard to predict the overall US market with much accuracy.

This sort of regression is not the common mode of use for this data. Instead, folks will normally load in a single stock's time-series of monthly excess returns and regress that on all three of the Fama/French factors. Then they will repeat this for all the stocks in their investable universe. Usually it will be a rolling regression, with a window size of somewhere between 12 and 60 monthly periods.

In the end, for each (stock, date) pair, you'll get coefficients of that stock (at that time) on the 3 factors. You can aggregate these by averaging over cross-sections at the same time, or averaging for single (or groups of) stocks over time. In the end, the goal is to save some coefficients on these factors that you believe will link today's observable data (this month's excess market return, SMB, and HML) with the returns each stock will realize over the next 1-month period.

Exercise for the reader

Use the Pandas io module to get some stock level data for AAPL over a few decades. Clean the data so the dates align with the ones fetched from our new database. Join those stock-level returns onto our database data, and perform one of these rolling regressions for AAPL to see what kind of coefficients AAPL exhibits in the Fama/French 3-factor model.

Saturday, March 1, 2014

Summary blog and some summaries

I've decided to start blogging again. To keep myself motivated to blog I will try to reduce the size of the posts, create fewer, more-targeted posts, and tend to make most posts function more as summaries of interesting things that I have recently read. Here goes!

By the time you give them a raise, they are already out the door.

If finding talented folks is priority number one, retaining them is priority number two. Do this by paying them at or above market rates (and be sure you get these rates right) and by developing a clear path for growth. Undervalued folks also complain less directly; if you hear them complaining about side properties of the situation then address the complaints fast. Otherwise, by the time you offer a fix (a raise, improved conditions) such folks have already committed to leaving.

The Engineer Crunch

Many firms currently state that for non-engineering jobs they can find candidates fast yet for engineering jobs they cannot find suitable candidates at all. To attract engineers some of the following may work: be a good engineer yourself, already employ good engineers, provide excellent pay, provide a compelling (and clear) mission, offer freedom and interesting projects. This is highly related to Paul Graham's essay "Great Hackers."

Hanson on Gopnik on Religion

Many measures show societal decline in religiosity. Gopnik claims that as wealth increases, religiosity decreases as past-poverty-based pains no longer motivate belief. Hanson notes that despite this, most folks still partake in rituals and profess some degree of belief. Will religion surge again if in the future most folks live at a subsistence level?



Worse than Wal-Mart: Amazon...

Amazon uses an uber-quantitative modern form of Taylorism to track employee movements with possible privacy-intruding info grabbing (such as whether an employee chose to go to the nearest possible bathroom or not). Folks who have some characteristic preventing them from meeting the employer's demands (such as age preventing a box-checker from checking enough boxes per hour) seem to be fired or treated badly. The tone is very critical but it's not clear why or what the author wants.

Some notes on the Amazon thing, just for reference.

There are (at least) three tensions that might be mutually inconsistent (even morally): (1) wanting organizations to efficiently satisfy consumer demands (e.g. keeping only the best workers from among a pool of competing employee candidates), (2) wanting all people to have quality of life above some threshold (e.g. a person's age-induced slowness won't result in poverty-inducing job losses), and (3) if you even can operate above the threshold in item two, then allowing a person's productivity (as measured in financially meaningful units of output for the firm) to determine how that person is compensated (so that harder, better, or smarter work == more pay).

One conjecture is that because (3) is hard to measure and asymmetrically favors a small segment of society, then folks will try to argue that (2) should morally defeat (1). This is related to Hanson's "Inequality Talk is About Grabbing", though there is a kernel of legitimacy: it is inefficient (and unfair) if a marginal increase in an activity does result in more profit for a firm yet zero share of that profit is conferred back to the worker. An often overlooked point is that when a worker increases personal productivity, some of that is due to previously deployed capital, some is due to good management, some is due to the fact that the company paid the electricity bill and the employee consumed usage of some lighting while improving personal productivity, etc. So the Marxist idea that this "surplus effort" by the worker is not repaid to the worker can never be fully consistent. A worker is never logically entitled to full rights on the proceeds from the stream of the worker's future 'surplus effort' since many other factors were required for it to even be possible that there would be surplus effort.