DS 4100 Data Collection, Integration, and Analysis

Course Website

Just like my CS 3500 posts, this series is for my data science class notes. This class is what most of my R tutorial posts have been for (I’m going to continue those until I finish the book). Here are the units:

So it seems like we start with R, and eventually get to Python. The first part is all about the concepts. The teacher calls it “plumbing” the stuff about collection and integration. Getting data from SO many different sources (Files, JSON, XML, Scraping, APIs). Then we go to data storage in SQL and NoSQL. Then we leave “plumbing” and do data analytics: descriptive analytics and predictive analytics :) :) :) :); data mining, machine learning, etc (REALLY COOL STUFF). Then we do visualization and communicating results leading into building analytics reports. Then a little Python programming and checking the quality of data. It also sounds like this is a job prep kind of class where at the end I’ll be job ready for DS.

Module 1

This module is about the role of data scientists, the value of data, and the course structure

Data drives decision making in organizations. The product team needs information to satisfy customers. We’re not going to be doing machine learning, this is going to be mainly human driven learning (we’ll do a little bit AI.) Many people think that a computer generated answer is more true than a human, just because you built a predictive model doesn’t mean that it’s true. Algorithms may contain bias. Loans in the 1950s often had bias, in the 70’s a company began collecting data about bank customers. This company then created the FICO credit score which is a number that predicts the likeliness that someone will pay back a loan. Colleges now use a similar system for accepting students. When you start building models like this, do the biases go away? Initially it looks like they do, but often times they actually don’t. There is an ethical component to the work we do, whether you like it or not, biases can be inadvertantly built in to a model. Teacher recommends reading Weapons of math destruction, it’s about a woman who realized how often her work was being misused and all the times there was inherent bias in her models.

The role of the Data Scientist

DS turn data into actionable information. A lot of data science work is to clean up data. 80-90% of data looks like crap. How do you know data types if they aren’t there? How can you make inferences such that you have better information. I.E. how do you determine the gender based off of a first name? A possible project later in the semester is to participate in a data challenge an example of a challenge.

Data Repositories

Data is stored in TONS of ways:

A lot of time is grabbing data from tons of sources and integrating it with others.

This course goes like this:

Data collection through scripting »>

Quality Assesment & Cleaning »>

Storage in analysis-appropriate Data Stores »>

Clean data ready to retrieve for analytics & visualization.

Module 2

Big data:

Important facts:

Sources of data:

Definition of “Big Data”: Big data is the integration of large amounts of multiple types of structured and unstructured data into a single data set that can be analyzed to gain insight and new understanding of an industry, business, the enviroment, medicine, disease control, science, and the human interactinos and expectations.

Big Data Characteristics:

Information Quality:

The 6 V’s of Big Data

Self reported data has a lot of veracity concerns