1 Purpose

This site shows where Richard Careaga’s data science skills stand in mid-2018.

Its intent is to provide prospective employers with concrete evidence of my abilities to do the work of a data sciencist. The content is all my own work, and none of the cases are based on classroom assignments, except where indicated. The style of these pages is directed to that purpose. They are not academic papers, industry conference papers, nor reports to management. They are designed to show you my thought processes and methods.

1.1 Updates

  • 2018-06-23 added qqnorm tests to FICO example in Failure Analysis chapter, part 1
  • 2018-06-26 stratified Failure Analysis 125K dataset to censor loan dropouts during 2006, classified remaining loans into performance categories based on monthly payment/nonpayment data and compared FICO distributions in the categories
  • 2018-06-28 illustrated non-normality of FICO distributions with Shapiro-Wilks test; refactored database for principal component analysis
  • 2018-07-01 performed PCA on sample
  • 2018-07-02 performed K-means clustering
  • 2018-07-06 Correspondence analysis performance x clusters
  • 2018-07-08 Correspondence analysis of categorical variables
  • 2018-07-11 Revised qq plots to take advantage of new ggplot2 v 3.0.0 feature, added R package to tools chapter, new credential on this page
  • 2018-07-14 Logistic regression of FICO on performance
  • 2018-07-21 added PH125.6x: Data Science: Wrangling certificate
  • 2018-08-04 added PH125.5x: Data Science: Productivity certificate
  • 2018-08-04 new blog at http://technocrat.rbind.io
  • 2018-08-14 added PH125.7x: Data Science: Linear Regression certificate
  • 2018-08-15 Stratified suprime data set
  • 2018-09-01 Exhuasted linear regression on subprime loan dataset
  • 2018-09-07 Major revisions to part 1 and part 2 of subprime dataset
  • 2018-09-08 Completed revisions to part 1 of subprime dataset (linear regression)
  • 2018-09-09 Completed revisions to part 2 of subprime dataset (logistic regression using only quntitative variables)
  • 2018-09-10 corrected keywords in revised part 2 of subprime dataset
  • 2018-09-11 site configuration change
  • 2018-09-19 added section for credits
  • 2018-10-02 Added final certificate for BerkeleyX program
  • 2018-10-09 fixed broken links
  • 2018-10-25 revised linear regression discussion in subprime dataset part 1

1.2 Literate programming: the tight integration of code and text

An analysis can have different audiences, and one of those may be peers, who may want to look under the hood to see exactly how the data were processed to produce the results given. This book is in that style and the RMarkdown files used to produce this portfolio are all available:

git clone https://github.com/technocrat/technocrat.github.io

1.3 Background

In 2007, I put much effort into acquiring the nuts-and-bolts of data science, dusting off old statistial learning, and throwing them into the front line of the initial skirmishes of the Great Recession. This was not something in my job description at Washington Mutual Bank, the largest ever to fail in the U.S. Fortunately, I was senior enough to decide on my own how best to spend my time.

I spent a month split between data acquisition and learning three new software tools – MySQL, R and Python. That dataset is one of several cases studies in this document that I use to show what you can expect if you take me on as a data scientist.

The cases range from spreadsheet sized (a few hundred records) to small (a thousand), middling (125,000) and largest (500,000). As the cases get larger, the number of fields grow, as well, along with the data clean-up.

1.4 The Cases

Each of the cases is designed to illustrate one or more specific skill by presenting an example and explaining what motivated it, what it does, the tools used, and what its output accomplished.

  • 2015 police involved homicides: Descriptive statistics, observational data hypothesis testing
  • The financial cash flow model: OOP Python, model derivation from narrative
  • Failure analysis of subprime Loans: MySQL,R, exploratory data analysis, high-dimensional data, covariance, clustering, regression, principa component analyis, machine learning
  • The Enron email corpus: Python, NLTK, social network analysis, de-duplication, stopwords, boilerplate stripping
  • Daycare costs and unexamined Assumptions: Data skepticism
  • Examples of utility programming in Python, Haskell, Lua and Flex/Bison

1.5 Current Credentials

I’ve completed the following online courses, to consolidate and expand my previous training and experience in data science, which sprung, ultimately, from undergraduate and graduate degrees in geology and geophysics. My plan is to use these cases to apply the many new techniques that I have learned to date, and expect to learn as I complete the remaining courses in the series.

1.5.3 Prior analytic and programming experience

My prior analytic education and experience was in geology/geophysics (M.S.) and law (J.D.) I have been on *nix as my own sysadm since 1984, including Venix (a v7 derivative), Irix (SGI), Ubuntu and other Linux versions, and Mac OSx. My orientation is strongly CLI, rather than GUI, but I have used Excel and Word since their early release. I have non-PC experience with the IBM S/34 and implementation of payroll, general ledger and budgeting. During the mid-90s, I installed, configured and operated http, mail, news, proxy, certificate and LDAP servers. I am familiar with many of the important bash tools (which I use on a daily basis) and with C and Perl (which I can still read, but seldom use).

When we got our numbers off of greenbar, no one cared how pretty it looked.