DS 4100 Day 5

DS 4100 Data Collection, Integration, and Analysis

Like my OOD posts, I wasn't here in class for day 4, so I'm skipping right to day 5. Today/ this week we're doing data importing from multiple sources as well as scraping. We started by importing CSVs and txt files. Now we're going to import from other normal files, JSON, and XML. Eventually we're going to be getting data from databases.

Packages

Packages are collections of R functions, data, and compield code in a well-defined format. It's important where packages are stored.

Functions to use with packages:

.libPaths()
library() to list all installed packages
search() to list currently loaded packages
require() to load a package for use

Use install.packages() to install a package

We can load most file formats:

.csv
.txt
.xls
.xlsx
.RData

Package called "foreign" that allows R to load data from stata, SPSS, SAS, MATLAB, etc. Data is located in folders, so be sure a working directory is set.

Data

use file.exists to see if a file exists, use dir.create to create a directory.

download.file("FILEURL", destfile="DESTINATION_PATH", method="curl")

When a file is loaded its content are stored in the computer's RAM, R is default 32 bit, so you need to specify 64bit to hold more data. When "big data" needs to be analyzed use a database! Files can also take a long time to load.

CSV doesn't have a standard file format so you have to be careful when your read from a CSV. If there are double quotes surrounding a string in a CSV R will take them as the literal value. You can actually specify the seperator in read.csv which is kind of ironic cause it's COMMA SEPERATED VALUES. There's also read.table which read.csv is basically a wrapper around.

library(xlsx)
people <- read.xlsx("some.xlsx", sheetIndex=1)

It's best to use CSV, because XLS and XLSX are proprietary. Saving R objects in .RData files is fast and convenient but is not compatible AT ALL with other programs.

XML

XML stands for Extensible Markup Language. It was designed to describe data in a human readable format that is simple to parse. It's purpose is as a software and hardware independent encoding format for carrying information. The data is described within XML in the form of a tree. Excel, Word, Powerpoint all use XML. For each open tag in XML, a tag most have a closing tag. It's often up to the programmer how to encode data in XML.

To do CRUD operations on XML you use the aptly named "XML" package in R.

> xmlobj <- xmlParse("pubmed_sample.xml")
> xmldf <- xmlToDataFrame("pubmed_sample.xml")
> url <- "http://www.statistics.life.ku.dk/primer/mydata.xml"
> data <- xmlToDataFrame(url)
> head(data)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

Parsing through XML is PRETTY UGLY, it's a whole mess of square brackets thrown together inside of functions which are poorly named, they're so incredibely ugly I'm not even going to put it here. My teacher says it's fast, but because it's all positional, it's not near as safe!