Intro to R

Jacob Fenton
Sunlight Foundation
jfenton@sunlightfoundation.com

This presentation is available at:
http://bit.ly/correlation-does-not-imply-causation

If you're following along at home jump to installation

0. Why R?

1. Why not R?

2. OK, it sounds good, but I hate the command line

3. I'm a bad-ass programmer running a monstrous horde of news apps. I don't wanna do anything outside ruby.

4. *Disclaimer* - Running with scissors

5. Fire up R

If you're following along at home jump to installation.

6. Do some basic viz

I sure hope this works out in the labs.

  • Run a histogram of the median income.
        > hist(counties$median_hh_income)
        
    You should see something like this:

    Here's how to print out the graphic to a file:
        > png("median_income.png", width=500, height=300, units="px")
        > hist(counties$median_hh_income)
        > # Now turn the graphic output off
        > dev.off()
        quartz 
             2 
        > # wait, where did it go? 
        > getwd()
        [1] "/path/to/where/the/png/is/now"
        >        
        
  • 7. Side track: what the heck is c

    8. Style the chart a bit more

    Relationships between the data

    9. Look at the data more

  • R's plot command is pretty robust. Lets get to it.
       > plot(counties$median_hh_income, counties$ba_plus)
       
        

  • So there's obviously a relationship there. How much?
        > cor(counties$median_hh_income, counties$ba_plus)
        [1] NA
        > # We can't correlate because there are null values.
        > cor(counties$median_hh_income, counties$ba_plus, use="complete.obs")
        [1] 0.6763103
        
    In the above, we're just saying use only the complete observations. That could be dangerous.
    Also note--the default is pearson correlation. But sometimes it's useful to use spearman (rank)
        > cor(counties$median_hh_income, counties$ba_plus, use="complete.obs", method='pearson')
        [1] 0.6763103
        > cor(counties$median_hh_income, counties$ba_plus, use="complete.obs", method='spearman')
        [1] 0.6236564
    
  • What about the uncertainty of the correlation?
         > cor.test(counties$median_hh_income, counties$ba_plus, use="complete.obs")
    
         	Pearson's product-moment correlation
    
         data:  counties$median_hh_income and counties$ba_plus 
         t = 51.4071, df = 3135, p-value < 2.2e-16
         alternative hypothesis: true correlation is not equal to 0 
         95 percent confidence interval:
          0.6568609 0.6948604 
         sample estimates:
               cor 
         0.6763103 
    
         >             
            
  • But we're only examining one relationship this way. Lets do all of them. I'm gonna transform it into a matrix. Don't ask why, I'm not really sure.
        > county_matrix <- data.matrix(counties)
        > cor(county_matrix, use="complete.obs")
                               state           co         pop median_hh_income      ba_plus
        state             1.00000000  0.013705958 -0.05401844       0.03493939  0.035086788
        co                0.01370596  1.000000000  0.01460693       0.02956972  0.004772263
        pop              -0.05401844  0.014606927  1.00000000       0.26945683  0.325710503
        median_hh_income  0.03493939  0.029569722  0.26945683       1.00000000  0.685827752
        ba_plus           0.03508679  0.004772263  0.32571050       0.68582775  1.000000000
        white_pcnt        0.12284483  0.002043147 -0.17966726       0.13377545  0.013097471
        black_pcnt       -0.12748272 -0.008496116  0.06352790      -0.23045513 -0.097418256
        hispanic_pcnt     0.08728682  0.028717294  0.18910339       0.01684250  0.020222111
        pov_pcnt         -0.06326769 -0.014163910 -0.13208919      -0.79398962 -0.432999042
                           white_pcnt   black_pcnt hispanic_pcnt    pov_pcnt
        state             0.122844834 -0.127482722    0.08728682 -0.06326769
        co                0.002043147 -0.008496116    0.02871729 -0.01416391
        pop              -0.179667258  0.063527896    0.18910339 -0.13208919
        median_hh_income  0.133775455 -0.230455130    0.01684250 -0.79398962
        ba_plus           0.013097471 -0.097418256    0.02022211 -0.43299904
        white_pcnt        1.000000000 -0.832365355   -0.15548590 -0.41024198
        black_pcnt       -0.832365355  1.000000000   -0.11994898  0.42466780
        hispanic_pcnt    -0.155485898 -0.119948976    1.00000000  0.09953580
        pov_pcnt         -0.410241980  0.424667802    0.09953580  1.00000000        
        
    That's hard to read, lets write it out to a table in the filesystem:
        > cor_found <- cor(county_matrix, use="complete.obs")
        > write.table( cor_found, file="correlations.txt", sep="|", eol="\n", row.names=TRUE)
    
  • statecopopincomeba_pluswhite_pcntblack_pcnthispanic_pcntpov_pcnt
    state1.000.01-0.050.030.040.12-0.130.09-0.06
    co0.011.000.010.030.000.00-0.010.03-0.01
    pop-0.050.011.000.270.33-0.180.060.19-0.13
    income0.030.030.271.000.690.13-0.230.02-0.79
    ba_plus0.040.000.330.691.000.01-0.100.02-0.43
    white_pcnt0.120.00-0.180.130.011.00-0.83-0.16-0.41
    black_pcnt-0.13-0.010.06-0.23-0.10-0.831.00-0.120.42
    hispanic_pcnt0.090.030.190.020.02-0.16-0.121.000.10
    pov_pcnt-0.06-0.01-0.13-0.79-0.43-0.410.420.101.00

    10. Pearson correlation is for *linear* correlation

    Shamelessly ripped off from wikipedia

    If you're looking for *non-linear* relationships use spearman correlation...

    Done!

    But if we're not out of time yet, here's more

    11. Adding libraries

    All kinds of good stuff is put out by random people (typically people who know way more about it than I). Stuff is different on windows. Here's how it works on Mac, roughly:

    I skipped this because it varies with platform. It's pretty painless on my rig, just remember to click the 'add dependencies' button, which oddly isn't checked by default.

    12. Getting help

    There's a system help. It's not incredibly helpful for beginners (though it's invaluable if you really want to know how something works). You can get to it with windows. In general, typing '?command' at the console *should* launch it, but in the labs it doesn't. I could make it work by cutting a pasting the local help url--once it's started in the labs--into a browser bar.
    http://127.0.0.1:10788/doc/html/index.html This might work in the labs (?)
    http://127.0.0.1:10788/library/base/html/getwd.html

    13. Working example: dates, ggplot2, etc.

    Here's a data file, call it faceted.txt

        a <- read.delim("/Users/jacobfenton/IRW/banks_small_biz/parse_custom/custom_read.csv",header=TRUE, sep=",", quote='"')
        
    

    14. Installation

    There are installers for Mac and for windows. Installing with homebrew to use at the command line, for example, seems to make it harder to do thing like load and install packages from remote libraries.