3 The 2007 subprime mortgage crisis, part 1

Keywords: large database, MySQL, Python, R, data wrangling, exploratory data analysis, ordinary least squares linear regression, diagnostics

3.1 Case Description

This is a residential mortgage credit underwriting review of 125,000 subprime mortgage loans sold in mortgage-backed securities form in 2006 by a single issuer, Washington Mutual Bank, whose assets JPMorgan Chase acquired in 2008.

No single definition of what made a loan subprime is universally accepted. The market, however, distinquishes subprime loans from other loans based on the cash flow features of the related mortgage backed security. In particular subprime transactions often feature a credit enhancement known as over collateralization, which means that the dollar volume of loans in the pool is greater than the dollar volume of securities issued, providing a cushion against loss.

Each loan has a handful of associated quantitative variables and a few dozen qualitative variables. The characteristics of the dataset help illustrate many of the challenges of data science

Correctly classifying problems as confirmatory or observational
Data wrangling
Treating a problem as a predictive or classification
Exploratory data analysis, knowing what you have to work with
Keeping an open mind on methods
Carefully checking the assumptions that the methods require
Worrying over diagnostics
Finding domain knowledge

3.2 Data Wrangling

3.2.1 The pool dataset

Washington Mutual sold $25 billion of subprime mortgage securities in 2006.

+----------+-------------+
| count(*) | sum(obal)   |
+----------+-------------+
|   124645 | 25542466576 |
+----------+-------------+

During 2006, however, the pool shrank to just under 100,000 loans with an aggregate original principal balance of just over $19 billion.

+----------+-------------+
| count(*) | sum(obal)   |
+----------+-------------+
|    96271 | 19148472600 |
+----------+-------------+

This diminished pool, as it existed as of the last date in early 2007 for which I had data, I will refer to as the pool. Here is its layout after assembly and the addition of some calculated fields. The name of the database, dlf stands for delinquency, loss and foreclosure, a business term. The table name y7 stands for 2007 to distinguish it from the original pool y6. The original version was done under the MySQL open source database, and converted to MariaDB, which was created to maintain the software as open source following Oracle’s acquisition of the MySQL organization.

MariaDB [dlf]> describe y7;
+----------+---------------+------+-----+---------+-------+
| Field    | Type          | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| city     | varchar(25)   | YES  |     | NULL    |       |
| cltv     | decimal(10,0) | YES  |     | NULL    |       |
| ctapeno  | decimal(10,0) | YES  |     | NULL    |       |
| deal     | varchar(25)   | YES  |     | NULL    |       |
| down     | int(11)       | YES  |     | NULL    |       |
| dti      | decimal(10,0) | YES  |     | NULL    |       |
| dtype    | varchar(25)   | YES  |     | NULL    |       |
| ebal     | decimal(10,0) | YES  |     | NULL    |       |
| fico     | decimal(10,0) | YES  |     | NULL    |       |
| fpd      | date          | YES  |     | NULL    |       |
| gainloss | decimal(10,0) | YES  |     | NULL    |       |
| grade    | varchar(5)    | YES  |     | NULL    |       |
| irate    | decimal(10,0) | YES  |     | NULL    |       |
| issuer   | varchar(10)   | YES  |     | NULL    |       |
| lat      | decimal(10,0) | YES  |     | NULL    |       |
| lien     | int(11)       | YES  |     | NULL    |       |
| liq      | decimal(10,0) | YES  |     | NULL    |       |
| lng      | decimal(10,0) | YES  |     | NULL    |       |
| lstatus  | int(11)       | YES  |     | NULL    |       |
| ltype    | varchar(25)   | YES  |     | NULL    |       |
| margin   | decimal(10,0) | YES  |     | NULL    |       |
| metro    | varchar(25)   | YES  |     | NULL    |       |
| nrate    | decimal(10,0) | YES  |     | NULL    |       |
| obal     | decimal(10,0) | YES  |     | NULL    |       |
| odate    | date          | YES  |     | NULL    |       |
| oltv     | decimal(10,0) | YES  |     | NULL    |       |
| orate    | decimal(10,0) | YES  |     | NULL    |       |
| oterm    | int(11)       | YES  |     | NULL    |       |
| payments | int(11)       | YES  |     | NULL    |       |
| pmiflag  | int(11)       | YES  |     | NULL    |       |
| pocode   | int(11)       | YES  |     | NULL    |       |
| pod      | date          | YES  |     | NULL    |       |
| ppp      | int(11)       | YES  |     | NULL    |       |
| ptd      | date          | YES  |     | NULL    |       |
| purpose  | varchar(25)   | YES  |     | NULL    |       |
| remit    | decimal(10,0) | YES  |     | NULL    |       |
| sbal     | decimal(10,0) | YES  |     | NULL    |       |
| servno   | decimal(10,0) | YES  |     | NULL    |       |
| sint     | decimal(10,0) | YES  |     | NULL    |       |
| sprin    | decimal(10,0) | YES  |     | NULL    |       |
| spymt    | decimal(10,0) | YES  |     | NULL    |       |
| st       | varchar(4)    | YES  |     | NULL    |       |
| zip      | int(11)       | YES  |     | NULL    |       |
| otype    | varchar(25)   | YES  |     | NULL    |       |
| rdate    | date          | YES  |     | NULL    |       |
| ptype    | varchar(15)   | YES  |     | NULL    |       |
| dptd     | int(7)        | YES  |     | NULL    |       |
| dfpd     | int(7)        | YES  |     | NULL    |       |
+----------+---------------+------+-----+---------+-------+
48 rows in set (0.00 sec)

The field names are mostly standard abbreviations in the business, although practice varied.

As of the end of the first quarter 2007 down is the number of months that a loan has missed payment. A value greater than 3 generally indicates a loan in foreclosure or a loan that has been foreclosed and is in the process of being sold. However, a borrower in many jurisdictions may reinstate the loan by paying past due amounts with interest and fees.

3.2.2 Creating the data set

Data came from public and semi-public sources. The loan level data at origination came from 14 HTML files. The performance data came from spreadsheets posted to a password protected website available to institutional investors, the only customers for mortgage backed securities not issued by a non-governmental sponsored issuer. I describe the reasons why the in-house systems were bypassed in this sidenote.

The HTML data was easy to extract. The header and each record was tagged as preformatted

<pre>payload</pre>

where payload represents the data fields. An example source document comes from a filing with the Securities and Exchange Commission for the first transaction in 2006. This type of file was processed with a bash command line script, the separate tables imported into temporary databases that were joined on the loan number key.

One of the typical data scrubbing tasks was converting dates into ISO form – from 3/6/2006, say, to 2006-03-06, for which I wrote a version 2 Python script.

The performance data was available only in spreadsheet form, by month and transaction, which required considerable scrubbing to bring into the database. (For example, some dollar values were entered as character strings; when these problems cropped up, I wrote small bash, Perl or Python scripts.) I combined the spreadsheets into a single plaintext file to import to the database, used a supplemental zipcode lookup table to identify the metropolitan area (for possible use with the Case Shiller Index of Home Prices) and added the longitude/latitude coordinates of centroid of the Census version of the zipcode (the ZCTA) for possible mapping.

3.3 Choosing the outcome measure

A mortgage payment is delinquent when a payment due on the first of the month has not been received by the first business day of the following month. (For some government sponsored programs, not included here, the convention is receipt by the last business day of the month in which due.)

The first decision is whether to measure outcome on the number of loans or on the weighted principal balance of each loan. Investors generally look at pool performance in terms of the scheduled aggregate principal balance due versus the amount received each month. Because each class of investor has a different perspective depending on where it falls in the priority of payment (the waterfall), no single measure of balance weighted delinquency will be relevant to all investors. Therefore, I begin the analysis with the unweighted number of delinquencies.

3.4 Defining the outcome measure

The simplest measure is the percentage of loans that are current in any given month, regardless of past payment history. For the July 1, 2007 due date, out of 96,271 loans, 87,369 were current, or approximately 90.8%. Of the remainder 4,844, approximately 5.0% were down one payment, and 3,015, approximately 3.1% were in default (three payments down) or in the process of foreclosure.

The delinquency rate is often defined as the percentage of loans 60 or more days delinquent, meaning that a payment due in month one was not receive in month two or three. Under that measure, the delinquency rate for the pool at July 1, 2007 was approximtely 8.1%.

A retrospective study found that the median delinquency rate for all subprime loans, including those originated in earlier years, to be 12.2%, with a range of 3% to 25%. The mid-year performance of the pool at mid-2007 was, accordingly, below the median.

The most complex measure of delinquency would be the pattern over a period of, say, 12 months of payment versus non-payment. A loan, for example, could be delinquent every other month but happen to be current on the measurement date. Unfortunately there are 4,096 payment patterns in a year. Also, many of the payment patterns consist of a long string of defaulted loans in foreclosure.

Other alternatives would the number of months since the last payment, whether the loan had missed two or more payments at the measurement date or three or more payments, and the percentage of payments made before entering the foreclosure process,

Payment status at June 1, 2007

3.5 Preliminary analysis, based on FICO scores, using linear regression

3.5.1 FICO’s importance

In 1995, Freddie Mac advised lenders that it had found that consumer credit scores developed by Fair, Issac and Company, Inc. (FICO scores) strongly predicted the likelihood of default on mortgage loans. Freddie Mac’s communication is reproduced as Attachment 2 in testimony before the U.S. House Committee on Oversight and Government Reform, the [Pinto testimony], beginning at page 28 of the pdf.

A traditional description of the limitations of credit scores is similar to the following:

Third-party credit reporting organizations provide credit scores as an aid to lenders in evaluating the creditworthiness of borrowers. Although different credit reporting organizations use different methodologies, higher credit scores indicate greater creditworthiness. Credit scores do not necessarily correspond to the probability of default over the life of the related mortgage loan because they reflect past credit history, rather than an assessment of future payment performance. In addition, the credit scores shown were collected from a variety of sources over a period of weeks, months or longer, and the credit scores do not necessarily reflect the credit scores that would be reported as of the date of this prospectus supplement. Credit scores also only indicate general consumer creditworthiness, and credit scores are not intended to specifically apply to mortgage debt. Therefore, credit scores should not be considered as an accurate predictor of the likelihood of repayment of the related mortgage loans.

(From a 2002 transaction.)

The rating agencies and buyers involved in residential mortgage backed securities, however, attached considerable importance to credit scores, generically referred to as FICOs. Therefore, the FICO composition was an obvious starting point.

3.5.2 Performance outcome measure

Down is the number of missed payments as of the latest date. If a borrower misses three consequtive payments, foreclosure is triggered, which may be a relatively quick process in states that do not require a court proceeding or relatively lengthy in those that do. It is what we are trying to predict with the data at hand. It will be the dependent variable.

3.5.3 FICO scores in the pool

FICO scores have a minimum value of 300, and a maximum value of 850. Foreign borrowers have a FICO score of 0. The summary statistics are:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   603.0   632.0   633.6   665.0   821.0

and the mode is 620. The pool included two loans with FICO scores of 0, indicating that they were foreign buyers.

Sixteen other loans had scores below 500:

Table 3.1: Number of low-FICO loans by deal
deal	down	fico	dti	cltv	obal	orate	fpd	ptd	grade	purpose	dtype	ltype	otype	ptype	zip	metro
LBMLT 2006-WL1	0	497	32	100	58,000	9	2005-06-01	2007-06-01	AP	Purchase	Full	2/28 LIBOR	OO	SFR	29,301	Greenville-Spartanburg-An
LBMLT 2006-WL1	0	494	51	80	52,800	10	2005-10-01	2007-06-01	B	Refi - Cashout	Full	3/27 LIBOR	OO	SFR	77,021	Houston-Galveston-Brazori
LBMLT 2006-WL1	1	473	29	43	50,000	9	2005-10-01	2007-05-01	B	Refi - Cashout	Stated	2/28 LIBOR	OO	SFR	49,461	Grand Rapids-Muskegon-Hol
LBMLT 2006-WL1	0	497	51	80	388,000	8	2005-06-01	2007-06-01	B+	Refi - Cashout	Full	2/28 LIBOR	OO	SFR	93,635	Merced, CA
LBMLT 2006-WL3	1	498	52	100	67,500	10	2005-11-01	2007-05-01	A	Purchase	Full	2/28 LIBOR	OO	SFR	61,603	Peoria-Pekin, IL
LBMLT 2006-2	0	470	33	80	327,975	8	2005-09-01	2007-06-01	AP	Refi - Cashout	Full	2/28 LIBOR	OO	SFR	19,341	Philadelphia-Wilmington-A
LBMLT 2006-2	2	481	42	77	195,194	8	2005-11-01	2007-04-01	B	Refi - Cashout	Full	2/28 LIBOR	OO	SFR	20,019	Washington-Baltimore, DC-
LBMLT 2006-2	0	493	39	80	148,000	10	2006-01-01	2007-06-01	C	Refi - No Cashout	Full	2/28 LIBOR	OO	SFR	30,115	Atlanta, GA
LBMLT 2006-2	0	491	37	95	142,500	9	2005-07-01	2007-06-01	AP	Purchase	Full	2/28 LIBOR	OO	SFR	75,052	Dallas-Fort Worth, TX (C)
LBMLT 2006-2	0	465	39	75	93,750	8	2005-07-01	2007-06-01	AP	Refi - Cashout	Full	2/28 LIBOR	NOO	2-4 Units	27,406	Greensboro–Winston Salem

The cliffs around 500 (below which only the small number of loans in the tables above are included), 600, 630 and 635 represent the cumulative segmentation of the pools to obtain favorable ratings. Not all deals followed this approach, however.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The variability among transactions suggest that if FICO scores have an influence on default rates, it may be necessary to stratify of otherwise transform the data to obtain useful results.

3.5.4 The FICO scores are not normally distributed

As a group, FICO scores have fat tails, a trait that is present in each of the deals to some degree. There are more loans with very low FICO scores than you would expect if score were randomly distributed plus more loans with high score. We also see a dip in the 500-600 range (loans that may only have been originated due to compensating factors), which are fewer than the sub-500 FICO loans.

An approximately 12% random sample with replacement of normalized scores shows a similar pattern.

With variations, all the deals in the sample of normalized FICO scores have similiar distributions.

A candidate variable that is not normally distributed, even when sampled, raises a red flag. Can it be used as part of a regression model? Let’s find out.

3.5.5 None of the quantitative variables can be used in linear regression for the pool as a whole

fico_sam_mod <- lm(down ~ fico, data = y7samtrain_scaled)
library(ggfortify)
autoplot(fico_sam_mod, label.size = 3)

The diagnostic on the upper right corner shows the distribution of the residuals (differences between the data point and the best fit line) compared to the expected difference, represented by the dashed line, a normal distribution.

This eliminates fico as a candidate independent variable in any ordinary least square linear regression model.

We see the same problem with the debt-to-income ratio, dti,

combined loan-to-value ratio, cltv,

original balance, obal,

original rate, orate,

and all the quantitative variables taken together, dipping our toes into multiple linear regression.

Clearly something is amiss. It’s time to test whether the model assumptions are sane:


Call:
lm(formula = down ~ dti + cltv + obal + orate, data = y7samtrain_scaled)

Coefficients:
(Intercept)          dti         cltv         obal        orate  
    1.47369      0.01819      0.32053      0.47700      0.52147  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = lm(down ~ dti + cltv + obal + orate, data = y7samtrain_scaled)) 

                       Value   p-value                   Decision
Global Stat        2.019e+04 0.000e+00 Assumptions NOT satisfied!
Skewness           7.879e+03 0.000e+00 Assumptions NOT satisfied!
Kurtosis           1.230e+04 0.000e+00 Assumptions NOT satisfied!
Link Function      1.618e+01 5.752e-05 Assumptions NOT satisfied!
Heteroscedasticity 5.138e-01 4.735e-01    Assumptions acceptable.

In short, a test of the model residuals highlights numerically what we can see in the diagnostic plots, that they are not normally distributed, but markedly skewed with a high kurtosis (many extreme outliers). The Global Stats measure provided by the gvlma package tests for linearity of each predictor variable, holding the others constant, that the slopes bear no interdependence and the effects of each are additive. The failure of the Link Function is just another indication of the non-normality of the residuals

Heteroscedasticity is absent, none of the residuals have different variability from the others, which is the only assumption satisfied.

Had we relied solely on the summary output of the model and failed to look at either the diagnostic plots or the output of gvlma, we might have fooled ourselves into thinking that we had at least accounted for a small proportion of the variance in the response variable.


Call:
lm(formula = down ~ dti + cltv + obal + orate, data = y7samtrain_scaled)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9930 -1.6141 -1.0718  0.0088 16.4915 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.47369    0.03484  42.295   <2e-16 ***
dti          0.01819    0.03549   0.513    0.608    
cltv         0.32053    0.03565   8.990   <2e-16 ***
obal         0.47700    0.03919  12.173   <2e-16 ***
orate        0.52147    0.03908  13.344   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.058 on 7701 degrees of freedom
Multiple R-squared:  0.04327,   Adjusted R-squared:  0.04277 
F-statistic: 87.07 on 4 and 7701 DF,  p-value: < 2.2e-16

In the following sections, we will reconsider outcome measures and turn to logistic regression and use the categorical variables available in the data.