DS 4100 Data Collection, Integration, and Analysis

Everyone is downloading R, meanwhile I’m just sitting here and finishing up Java.

Data is stored as objects in R. Objects are created by:

Now we’re going over R, basically what I learned in the past (6) tutorials.

> seq(from=50, to=52, along=x)
[1] 50.0 50.2 50.4 50.6 50.8 51.0 51.2 51.4 51.6 51.8 52.0

That’s pretty cool, now we’re going over data frames and querying data frames. R has a BUNCH of built in data sets that can be loaded easily:

> data(sunspots.year)
> sunspot_stuff = data.frame(year=1700:1988, count=sunspot.year)
> sunspot_stuff[sunspot_stuff[,1]==1950,]
> sunspot_stuff[sunspot_stuff$count>=50,]
> summary(sunspot_stuff)
      year          count       
 Min.   :1700   Min.   :  0.00  
 1st Qu.:1772   1st Qu.: 15.60  
 Median :1844   Median : 39.00  
 Mean   :1844   Mean   : 48.61  
 3rd Qu.:1916   3rd Qu.: 68.90  
 Max.   :1988   Max.   :190.20  
> sunspot_stuff$count == 190.20
Time Series:
Start = 1700 
End = 1988 
Frequency = 1 
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[253] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[289] FALSE
> sunspot_stuff[sunspot_stuff$count == 190.20,]
    year count
258 1957 190.2
> sum(sunspot_stuff$count)
[1] 14049.3
> 

I really enjoy the querying system in R. Say what you will, but the fact I don’t have to use loops to find stuff is really really nice. The rest of class is much of the same: load data, analyze data. The homework is interesting: we have to unzip a file in R, and then make a couple functions to deal with data. Probably going to be a bunch of queries and functions to do it properly.