9 Assorted Examples of Toolmaking

9.1 Criticism of faulty regression methodology

9.1.1 Catchy titles in academic papers attract readers. It’s hard to resist.

When Harry Fired Sally: The Double Standard in Punishing Misconduct Mark L. Egan, Gregor Matvos, and Amit Seru NBER Working Paper No. 23242 March 2017, Revised August 2018 JEL No. D18,G24,G28,J71

9.1.2 The paper examines gender inequlity in the securities business

The authors identify a gender punishment gap in their analysis of over one million cases in which an employee required to be licensed by the Securities and Exchange Commission (SEC) through the Financial Industry Regulatory Authority (FINRA) and was found to have engaged in misconduct in violation of FINRA’s rules. They find that women are terminated at a higher rate than their male counterparts and that they require a longer time to obtain another position at a different firm.

That this gender based disparity exists is highly plausible based on anecdoctal evidence in many areas of economic and social life and many academic studies.

My interest lies in their methodological approach. They give their model at page 9 of the report.

9.1.3 The authors construct a multiple linear regression model

Separation model:

\(Misconduct_iqjlt = αFemale_i + βX_ it +µ_qjlt + ε*_iqjlt\)

(© 2017 by Mark L. Egan, Gregor Matvos, and Amit Seru. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit,including © notice, is given to the source.)

The authors characterize their model as linear regression. Mutiple regression equations are an extension of ordinary least square regression where

\(y = mx + b\)

with the addition of other terms.

The subscripts indicate individual, licensure, firm, time, and county.

9.1.4 The authors collect an impressively large database

The authors collected monthly records on 1.2 million registered advisers during the 2005-2015 period. They carefully analyzed first names to determine gender and censored cases where the given name was ambiguous. I was impressed by the effort put into controlling for variables with the potential to bias their results.

9.1.5 The authors make an unforced error in their choice of methodology

But now be dragons. The dependent variable is binary: how likely, given misconduct in year 0, is a subject likely to be fired in year 2. In the aggregate are there gender differences in that likelihood?

The technical term for this class of problem is logistic regression, commonly performed through a statistical technique similar to linear regression, termed generalized linear model with a specification of a binary outcome.

9.1.6 How could this happen?

The authors had good data, careful thinking about controls, yet applied a mistaken analysis. How could they have put so much time and effort into analysis, yet fall into such an obvious methodological trap?

My gut reaction is that they had no data science colleague down the hall to drop in and consult with.

9.1.7 Who am I to criticize?

Two of the authors are affiliated with business schools that 2017 rankings by The Economist include in the top 5 world wide. The third is affiliated with another university that is highly regarded in a number of other fields.

I am a mere student of applied statistics. If the question of methodological appropriateness should be decided solely on credentials, stop here.

9.1.8 Why am I doing this?

It’s an important issue and the cause of promoting gender equality in this business can be set back by a faulty analysis.

Harumph, says top management. Must not be true, then. – imaginary executive

It’s also an important reminder not to take impressive looking quantitative analyses at face value.

9.1.9 How do I know?

I’ve run my own multiple regression analyses and logistic models and I will show what the output looks like. I’ll only take their principal model, separation given misconduct as the basis for illustrating the issues.

9.1.10 Their results

Table 3b of the paper shows their results on separation given misconduct. If accurate, the results are not pretty. Whether or not accurate, the results are incomplete and misleading, even if their methodology were applicable.

Table 3b

Starting at the top, the first task to determine what the numbered columns refer to. The first column does not control for differences among the adviser’s experience and licensing, year, firm or county. The second column adds adviser exerpience and licensing. The third column adds year, firm and county. The fourth column adds the possible combinations of potential licenses that the individual holds.

Essentially these are four different models.

\(Misconduct_i = Person_i\)
\(Misconduct_iq = αFemale_i + ε_iqjlt\)
\(Misconduct_iqj = αFemale_i + βX_it + ε_iqjlt\)
\(Misconduct_iqjlt = αFemale_i + βX_it + µ_qjlt + ε_iqjlt\)

as described on page 11.

On page 11, the authors state: “The coefficient on misconduct measures the probability that a male adviser experiences a job separation following misconduct.” (emphasis added). Thus, predicting male separation from some measure of female misconduct and female status, with or without other control, seems non-sensical. Perhaps the authors expect the reader to infer the corresponding coefficient for the probability that a female adviser experiences a job separation. That dog won’t hunt.

Each column reports a number of “observations,” which correspond to the 10-year period of the data, meaning about 400K-600K observations per year. According to Table 1 on page 62, however, the 10-year incidence of misconduct by male advisers in any given year was 0.72%. Therefore, we expect to observe, on average obvervations/10*0.072 misconduct incidents by males per year or, for the first model approximately 43,350 incidents, of which some proportion resulted in separation. That is the n, or population size, that goes into the calculations, including the degrees of freedom needed to calculate the F-statistic. From the information presented, we cannot be sure what value of n was relied on.

Finally, the means of the dependent variable have unexplained similarity. Case three should not be identical to cases one and two if the added independent variables are increasing explanatory power, and the mean in case four seems too little difference from case three, given the increase in R-Squared. Without access to the author’s data, however, there is no way for me to decide that.

The first line, “Misconduct” contains coefficients for each of the columns, given in percentages. (More often coefficients are expressed in proportions.) Coefficients are a prediction of the effect of the variable on the outcome, holding the other variables constant. The asterisks following each measure are explained in a note to Table 3b as standard errors less than 0.01. The terminology of expressing standard errors in the form p < 0.01 is also unusual. A p-value measure of “statistical significance” (the probability that the result was a product of chance) usually refers to the probability of the variable is greater than a measure called the t value. Standard errors are also not generally expressed as percentages.

R-squared, sometimes called the coefficient of determination and indicates the proportion of the variance in the data that an independent variable accounts for. It is usually tweeked to an adjusted R square to better eliminate random variation.

A value of R-squared that is equal to 1 completely explains the variance. A value of 0 indicates a complete lack of explanatory power. For the first column, R-squared is 0.004. Four basis points on a billion dollars is not chump change. The same proportion in a analytic study is. The result of the second column is R-squared equals 0.014. These two columns tell us essentially nothing except the failure of the model to reflect reality when not controlling for other factors.

Columns three and four show R-squared of 0.332 and 0.403, respectively, which show appreciable effects in a multiple linear regression model appropriately applied.

9.1.11 A counterexample: what a data scientist would expect to see in a multiple linear regression model output

Data on Seattle area housing prices provide a convenient way to illustrate the usual output of a multiple linear regression model output. This is a 21K dataset with 20 variables on housing characteristics and sales price.

The first thing I am going to do is to split the dataset 2:1 into a training set and a test set. I’ll model the training set, then see if I get comparable results on the test set. That’s something the authors of the paper omitted. I’ll be using the R statistical programming language.

`{r bonehead, results=“asis”, echo = FALSE, warning=FALSE} library(tidyverse) house <- csv_read(“/Users/rc/projects/statistics-for-data-scientists”) glimpse(house) train <- house %>% sample_frac(0.66, replace = TRUE) test <- setdiff(house,train) mod <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_lot + yr_built, data = train) summary(mod) ```

Well, now! I pick a handful of variables off the top of my head and presto! My model explains over 50% of the variation in price. Am I good or what?

Not really, and I’ll explain all the bonehead errors in another post (didn’t normalize, used variables that are not truly independent, didn’t check the distribution of residual errors and a host of other sins). But let’s talk about the information you should expect from a multiple linear regression.

9.2 DIY R Packages

keywords: workflow, documentation, repository, github

During exploratory analysis, I often find that after hitting on a promising analysis and beginning to apply it to, say, stratifications, that I’m doing cutting and pasting. That’s when I know that it’s time to move beyond scraps or even scripts and create a package. I started one for the subprime mortgage analysis, which you can see at subprimeR on github.

9.3 The Short Genome File: Day-to-Day Python

Keywords: Python, one-off, utility

9.3.1 Case Description

One of the recent courses that I took at MIT had a simple short datafile genome, representing a portion of the DNA sequences of Caulobacter crescentus, a species of bacteria very popular in biological studies. The class problem was in K-means clustering based on principal component analysis of the genome data.

9.3.2 Conversion Problem

That data consisted 1,528 lines of 200-character fragments. The building blocks of DNA genomes are very simple, consisting of sequences of only four possible character values, ‘A’,‘C’,‘G’,‘T’.

The immediate problem was that the assignment called for 300-character fragments. This is a recurring fact of life for data science; transforming data from one structure to another.

9.3.3 Everyday Solution

The obvious tool was Python, and it took no more than 20 minutes to do, even for someone who used Python as a handtool rather than industrial machinery.

In short, we start with

>fragment of c.crescentus genome gccgatagcctatgatccc… [to 200 characters]

and then apply a short Python program

# required modules
import csv
import itertools
import textwrap

data  = "/Users/rc/Desktop/MITx8/ccrescentus.fa" # your path here
# Desired length of genome fragments
fraglen = 300
OBJECT = [] # empty bucket
f = csv.reader(open(data, 'r'))

# read the source file into a list
for row in f:
    OBJECT.append(row) 
# remove header '>fragment of c.crescentus genome'
del(OBJECT[0])
# combine all of the 200-character lines from the source
laundry = list(itertools.chain(*OBJECT))
# join them back into a single string
washed = ''.join(laundry)
# check to make sure each line of data can be 300 characters long
overage = len(washed)%fraglen 
# should return True to the console
overage == 0
# split the string into 300-character list elements
dried = textwrap.wrap(washed,300)
# write the processed file back to a processed file, adding
# opening and closing double quotes and new lines
with open('fragments.csv', 'w') as f:
    f.write('\"')
    f.write('\"\n'.join(dried))
# "fragments.csv" is now ready for analysis with Python or R

and end up with

“gccgatagcctatgatcccc … [to 300 characters]

Athough I checked that I had enough data to make each line 300-characters long, I did lose a few bytes in the conversions, which isn’t material for purposes of the exercise.

This is not intended as an example of best-practices Python programming, nor I am not a programmer. I use programming to solve problems. If I have a solution like this, I go to a programmer to scale it, when necessary.

9.4 Hard but simple – parsing YAML files in Haskell

It is a truth universally acknowledged that the best way to learn a language is from the lips of a lover. Mine, alas, has no Haskell, so I have made the choice I usually do. Rather than attempting first to master the rules of syntax and the rationale underlying a new computer language, I launch into trying to solve a real problem, by which I mean automating some repetitive task that I would otherwise have to do by hand.

In this case, it was tweaking a LaTeX table. I chose Haskell because I was already using the estimable pandoc to convert the rest of my document and it has a facility to pipe its AST (abstract syntax tree) through to a filter for intermediate processing before rendering the final document.

Alas, the functionality I sought was not available without doing some heavy lifting of internals that is beyond my pay grade as a rank Haskell beginner. I did manage to get a version working, but it had two major defects: First, it relied on parsing regular expressions. I imagine a term like unPythonic applies to this approach – unHaskellian? Second, it lacked a clean separation between logic and data, meaning it would have to be rewritten for each new table that differed from this first use.

The functionality could also be achieved through sed and awk in combination with the other standard tools of bash. But, as I had come this far with Haskell, I determined to continue.

I learned a lot along the way, but I’m only going to report the results. There are any number of great resources on theory and concepts, but recipes seem hard to come by.

To begin a yaml file, table.yaml

stripComments: true
stripLable: true
zeroEntry: "."
justifyLeft: true
stubHeader: "Cause of Death"
subHeader: "& (\\%) & (\\%) & (\\%) & (\\%)"

In the same directory, mwe.hs

{-# LANGUAGE OverloadedStrings #-}  -- yaml

module Mwe where

import Data.Yaml 
import Data.Maybe (fromJust)

data ReadData = ReadData { stripComments     :: Bool
                         , stripLable      :: Bool
                         , zeroEntry       :: Bool
                         , justifyLeft     :: Bool
                         , stubHeader      :: String
                         , subHeader       :: String
                         } deriving (Eq, Ord, Show)

instance FromJSON ReadData where
  parseJSON (Object v) = ReadData <$>
                         v .: "stripComments" <*>
                         v .: "stripLable"    <*>
                         v .: "zeroEntry"     <*>
                         v .: "justifyLeft"   <*>
                         v .: "stubHeader"    <*>
                         v .: "subHeader" 
  parseJSON _ = error "Can't parse ReadData from YAML/JSON"

I won’t embarrass myself by revealing how long it took to get this working. I did pick up a fairly solid understanding of types, some insight into typeclasses and instances and got on the right track for IO and Maybe, with some notion of what it means to be pure in Haskell. What took the longest was how to do anything with a successful read of a yaml file other than to print it, which is where the examples I found stopped. I acknowledge my debt to the many sources I consulted to figure this out.

9.5 Contextual Awareness

Whenever I see a time series that purports to show dramatic results, I like to look back to trace the prior history.

Misleading unemployment data

9.6 Minimalism Throws You into the Pool

I have been dabbling in the hipster lanaguage Lua, which you have to be cool even to have heard about. Its claim to fame is minimalism. It has one data type, called a table and I found myself wanting to combine two or more of them à la cat (concatenate). WTF, there is no built-in way to do this? Incredulity began to lift when I discovered that neither was there a built-in way to even print tables. As a public service for the benefit of other pilgrims, here is

--cat.lua concatenate tables
 function cat( ... ) -- append 1D tables
     local List = require 'pl.List' -- penlight library
     local args = {...}
     local l = List()
     for i=1, #args do
         l:extend(args[i])
     end
     return l
 end
 --[[Example
 a = {
    "gala",
    "cameo",
    "jazz"
 }
 o = {
    "seville",
    "valencia",
    "navel"
 }
 g = {
    "concord",
    "thompson",
    "muscat"
 }
 f = cat(a,o,g)
 {gala,cameo,jazz,seville,valencia,navel,concord,thompson,muscat}
 --]]

9.7 Parsing system date strings into Python datetime objects

 from datetime import datetime
 from dateutil.parser import parse
 from dateutil import tz

 s = "Mon Aug 15 21:17:14 GMT 2011"     
 d = parse(s)                           

 GMT = tz.gettz('UTC')                  
 Beijing = tz.gettz('Asia/Shanghai')    

 there = d.astimezone(GMT)              
                                        
 here = d.astimezone(Beijing)           

 print(here)
 print(there)

9.8 Combination of k items, taken n at a time

Very functional

-- combo.hs
-- problem: C(k,n), where k = the integers from 1 to 9, inclusive
-- and n = 3, without regard to order, then sum the subsets
import Data.List
combinations 0 lst = [[]]
combinations n lst = do
    (x:xs) <- tails lst
    rest <- combinations (n-1) xs
    return $ x : rest
result = ( map sum (combinations 3 [1..9]))

Python alternative

import itertools
result = map(sum, itertools.combinations([1,2,3,4,5,6,7,8,9], 3))
for i in result: print(i)

9.9 Flex/Bison to compile data parser for June 20, 2018 form

datify.l Bison script

/* NB: OSX, cannot link to -lfl, use -ll */
/* can't have internal comments */

%{
%}

%%

^[0-9]+         {printf("")             ; }
"January "      {printf("2015-01-")     ; }
"February "     {printf("2015-02-")     ; }
"March "        {printf("2015-03-")     ; }
"April "        {printf("2015-04-")     ; }
"May "          {printf("2015-05-")     ; }
"June "         {printf("2015-06-")     ; }
"July "         {printf("2015-07-")     ; }
"August "       {printf("2015-08-")     ; }
"September "    {printf("2015-09-")     ; }
"October "      {printf("2015-10-")     ; }
"November "     {printf("2015-11-")     ; }
"December "     {printf("2015-12-")     ; }

%%

int main()

{
 yylex();
}

datify2.l Bison script

/* NB: OSX, cannot link to -lfl, use -ll */
/* can':t have internal comments */

%{
%}

%%

"-1"{1}[ \t\n]  {printf("-01\n")        ; }
"-2"{1}[ \t\n]  {printf("-02\n")        ; }
"-3"{1}[ \t\n]  {printf("-03\n")        ; }
"-4"{1}[ \t\n]  {printf("-04\n")        ; }
"-5"{1}[ \t\n]  {printf("-05\n")        ; }
"-6"{1}[ \t\n]  {printf("-06\n")        ; }
"-7"{1}[ \t\n]  {printf("-07\n")        ; }
"-8"{1}[ \t\n]  {printf("-08\n")        ; }
"-9"{1}[ \t\n]  {printf("-09\n")        ; }

%%

int main()
{
 yylex();
}