DS 4100 Data Collection, Integration, and Analysis

Course Website

Just like my CS 3500 posts, this series is for my data science class notes. This class is what most of my R tutorial posts have been for (I'm going to continue those until I finish the book). Here are the units:

Unit 1 - Essentials Concepts of Data Science
Unit 2 - Programming in R for Data Science
Unit 3 - Data Collection & Integration
Unit 4 - Data Storage
Unit 5 - Data Analytics
Unit 6 - Python Programming for Data Science
Unit 7 - Data Quality & Governance

So it seems like we start with R, and eventually get to Python. The first part is all about the concepts. The teacher calls it "plumbing" the stuff about collection and integration. Getting data from SO many different sources (Files, JSON, XML, Scraping, APIs). Then we go to data storage in SQL and NoSQL. Then we leave "plumbing" and do data analytics: descriptive analytics and predictive analytics :) :) :) :); data mining, machine learning, etc (REALLY COOL STUFF). Then we do visualization and communicating results leading into building analytics reports. Then a little Python programming and checking the quality of data. It also sounds like this is a job prep kind of class where at the end I'll be job ready for DS.

Module 1

This module is about the role of data scientists, the value of data, and the course structure

Data drives decision making in organizations. The product team needs information to satisfy customers. We're not going to be doing machine learning, this is going to be mainly human driven learning (we'll do a little bit AI.) Many people think that a computer generated answer is more true than a human, just because you built a predictive model doesn't mean that it's true. Algorithms may contain bias. Loans in the 1950s often had bias, in the 70's a company began collecting data about bank customers. This company then created the FICO credit score which is a number that predicts the likeliness that someone will pay back a loan. Colleges now use a similar system for accepting students. When you start building models like this, do the biases go away? Initially it looks like they do, but often times they actually don't. There is an ethical component to the work we do, whether you like it or not, biases can be inadvertantly built in to a model. Teacher recommends reading Weapons of math destruction, it's about a woman who realized how often her work was being misused and all the times there was inherent bias in her models.

The role of the Data Scientist

DS turn data into actionable information. A lot of data science work is to clean up data. 80-90% of data looks like crap. How do you know data types if they aren't there? How can you make inferences such that you have better information. I.E. how do you determine the gender based off of a first name? A possible project later in the semester is to participate in a data challenge an example of a challenge.

Data Repositories

Data is stored in TONS of ways:

CSV
XML
JSON
MySQL
Oracle
SQL Server
JavaDB
CouchDB
Mongo
Hadoop
Redis
Cassandra
HTML
Some bad human formats

A lot of time is grabbing data from tons of sources and integrating it with others.

This course goes like this:

Data collection through scripting >>>

Quality Assesment & Cleaning >>>

Storage in analysis-appropriate Data Stores >>>

Clean data ready to retrieve for analytics & visualization.

Module 2

Big data:

Big data is a new term that describes large, complex data sets that need to be stored and processed in different ways
Larger data sets allow for more detailed analysis
It's very dependent on the time (in the future what will be considered big data is very different than what big data is now)

Important facts:

Volume of data is growing, each day over 2.5 quintillion bytes of data is being generated.
90% of the world's data has been generated over the past two years
Data from multiple sources is being integrated into one massive source

Sources of data:

There are nearly five billion web pages
Collected data includes network traffic, site and page visits, page navigation, page searches
User generated content
Facebook
Twitter
Instagram
Blogs
YouTube
Forums
Wikis
etc
RFID (radio frequency ids)
Tags for tracking merchandise and shipments, sports performance, automated toll collection
GPS tracking data generated by mobile devices
Tracking of movement of equipment, vehicles, and people
Weather conditions
Tidal movements
Transactional data
Census
Polls
Healthcare data
Education, law and order, economic activity, agriculture, food production
Radio telescopes, particle physics

Definition of "Big Data": Big data is the integration of large amounts of multiple types of structured and unstructured data into a single data set that can be analyzed to gain insight and new understanding of an industry, business, the enviroment, medicine, disease control, science, and the human interactinos and expectations.

Big Data Characteristics:

Large, distributed aggregations of loosely structured data
Excess of multiple petabytes and exabytes
Billions of records about people or transactions

Information Quality:

Information/ Data quality is a measurement of the fit of information for particular use
Poor information quality can cost businesses 10% of revenue
Bad info can account for $600 billion loss in USA

The 6 V's of Big Data

Volume
- Large quantity
- 2.7 zetabytes of data in the universe, expected to double every two years
Variety
- A lot of data sources
- Not just one type
Velocity
- How quickly we can process it
- How quickly it emerges
Veracity
- Truthfullness of data
- Reliability of data
Validity
- Correct syntax & expected type
- Appropriate for the analysis you're doing
Volatility
- How long you can store data for before it's obsolete

Self reported data has a lot of veracity concerns