1 Purpose

This site shows where Richard Careaga’s data science skills stand in mid-2018.

Its intent is to provide prospective employers with concrete evidence of my abilities to do the work of a data sciencist. The content is all my own work, and none of the cases are based on classroom assignments, except where indicated. The style of these pages is directed to that purpose. They are not academic papers, industry conference papers, nor reports to management. They are designed to show you my thought processes and methods.

1.1 Updates

2018-06-23 added qqnorm tests to FICO example in Failure Analysis chapter, part 1
2018-06-26 stratified Failure Analysis 125K dataset to censor loan dropouts during 2006, classified remaining loans into performance categories based on monthly payment/nonpayment data and compared FICO distributions in the categories
2018-06-28 illustrated non-normality of FICO distributions with Shapiro-Wilks test; refactored database for principal component analysis
2018-07-01 performed PCA on sample
2018-07-02 performed K-means clustering
2018-07-06 Correspondence analysis performance x clusters
2018-07-08 Correspondence analysis of categorical variables
2018-07-11 Revised qq plots to take advantage of new ggplot2 v 3.0.0 feature, added R package to tools chapter, new credential on this page
2018-07-14 Logistic regression of FICO on performance
2018-07-21 added PH125.6x: Data Science: Wrangling certificate
2018-08-04 added PH125.5x: Data Science: Productivity certificate
2018-08-04 new blog at http://technocrat.rbind.io
2018-08-14 added PH125.7x: Data Science: Linear Regression certificate
2018-08-15 Stratified suprime data set
2018-09-01 Exhuasted linear regression on subprime loan dataset
2018-09-07 Major revisions to part 1 and part 2 of subprime dataset
2018-09-08 Completed revisions to part 1 of subprime dataset (linear regression)
2018-09-09 Completed revisions to part 2 of subprime dataset (logistic regression using only quntitative variables)
2018-09-10 corrected keywords in revised part 2 of subprime dataset
2018-09-11 site configuration change
2018-09-19 added section for credits
2018-10-02 Added final certificate for BerkeleyX program
2018-10-09 fixed broken links
2018-10-25 revised linear regression discussion in subprime dataset part 1

1.2 Literate programming: the tight integration of code and text

An analysis can have different audiences, and one of those may be peers, who may want to look under the hood to see exactly how the data were processed to produce the results given. This book is in that style and the RMarkdown files used to produce this portfolio are all available:

git clone https://github.com/technocrat/technocrat.github.io

1.3 Background

In 2007, I put much effort into acquiring the nuts-and-bolts of data science, dusting off old statistial learning, and throwing them into the front line of the initial skirmishes of the Great Recession. This was not something in my job description at Washington Mutual Bank, the largest ever to fail in the U.S. Fortunately, I was senior enough to decide on my own how best to spend my time.

I spent a month split between data acquisition and learning three new software tools – MySQL, R and Python. That dataset is one of several cases studies in this document that I use to show what you can expect if you take me on as a data scientist.

The cases range from spreadsheet sized (a few hundred records) to small (a thousand), middling (125,000) and largest (500,000). As the cases get larger, the number of fields grow, as well, along with the data clean-up.

1.4 The Cases

Each of the cases is designed to illustrate one or more specific skill by presenting an example and explaining what motivated it, what it does, the tools used, and what its output accomplished.

2015 police involved homicides: Descriptive statistics, observational data hypothesis testing
The financial cash flow model: OOP Python, model derivation from narrative
Failure analysis of subprime Loans: MySQL,R, exploratory data analysis, high-dimensional data, covariance, clustering, regression, principa component analyis, machine learning
The Enron email corpus: Python, NLTK, social network analysis, de-duplication, stopwords, boilerplate stripping
Daycare costs and unexamined Assumptions: Data skepticism
Examples of utility programming in Python, Haskell, Lua and Flex/Bison

1.5 Current Credentials

I’ve completed the following online courses, to consolidate and expand my previous training and experience in data science, which sprung, ultimately, from undergraduate and graduate degrees in geology and geophysics. My plan is to use these cases to apply the many new techniques that I have learned to date, and expect to learn as I complete the remaining courses in the series.

1.5.1 Graduate school level

MITxPro Data Science and Big Data Analytics: Making Data-Driven Decisions

HarvardX: PH559x Causal Diagrams: Draw Your Assumptions Before Your Conclusions

PH125.1x: Data Science: R Basics

PH125.2x: Data Science: Visualization

PH125.3x: Data Science: Probability

PH125.4x: Data Science: Inference and Modeling

Harvard PH125.5x: Data Science, Productivity Tools

HarvardX PH125.6x: Data Science: Data Wrangling

PH125.7x: Data Science: Linear Regression

1.5.2 Undergraduate level

BerkeleyX Professional Certificate Foundations of Data Science

1.5.3 Prior analytic and programming experience

My prior analytic education and experience was in geology/geophysics (M.S.) and law (J.D.) I have been on *nix as my own sysadm since 1984, including Venix (a v7 derivative), Irix (SGI), Ubuntu and other Linux versions, and Mac OSx. My orientation is strongly CLI, rather than GUI, but I have used Excel and Word since their early release. I have non-PC experience with the IBM S/34 and implementation of payroll, general ledger and budgeting. During the mid-90s, I installed, configured and operated http, mail, news, proxy, certificate and LDAP servers. I am familiar with many of the important bash tools (which I use on a daily basis) and with C and Perl (which I can still read, but seldom use).

When we got our numbers off of greenbar, no one cared how pretty it looked.

Data Science Portfolio