Published on

DS 4100 Day 1

Authors
  • avatar
    Name
    Jacob Aronoff
    Twitter

DS 4100 Data Collection, Integration, and Analysis

Course Website

Just like my CS 3500 posts, this series is for my data science class notes. This class is what most of my R tutorial posts have been for (I'm going to continue those until I finish the book). Here are the units:

  • Unit 1 - Essentials Concepts of Data Science
  • Unit 2 - Programming in R for Data Science
  • Unit 3 - Data Collection & Integration
  • Unit 4 - Data Storage
  • Unit 5 - Data Analytics
  • Unit 6 - Python Programming for Data Science
  • Unit 7 - Data Quality & Governance

So it seems like we start with R, and eventually get to Python. The first part is all about the concepts. The teacher calls it "plumbing" the stuff about collection and integration. Getting data from SO many different sources (Files, JSON, XML, Scraping, APIs). Then we go to data storage in SQL and NoSQL. Then we leave "plumbing" and do data analytics: descriptive analytics and predictive analytics :) :) :) :); data mining, machine learning, etc (REALLY COOL STUFF). Then we do visualization and communicating results leading into building analytics reports. Then a little Python programming and checking the quality of data. It also sounds like this is a job prep kind of class where at the end I'll be job ready for DS.

Module 1

This module is about the role of data scientists, the value of data, and the course structure

Data drives decision making in organizations. The product team needs information to satisfy customers. We're not going to be doing machine learning, this is going to be mainly human driven learning (we'll do a little bit AI.) Many people think that a computer generated answer is more true than a human, just because you built a predictive model doesn't mean that it's true. Algorithms may contain bias. Loans in the 1950s often had bias, in the 70's a company began collecting data about bank customers. This company then created the FICO credit score which is a number that predicts the likeliness that someone will pay back a loan. Colleges now use a similar system for accepting students. When you start building models like this, do the biases go away? Initially it looks like they do, but often times they actually don't. There is an ethical component to the work we do, whether you like it or not, biases can be inadvertantly built in to a model. Teacher recommends reading Weapons of math destruction, it's about a woman who realized how often her work was being misused and all the times there was inherent bias in her models.

The role of the Data Scientist

DS turn data into actionable information. A lot of data science work is to clean up data. 80-90% of data looks like crap. How do you know data types if they aren't there? How can you make inferences such that you have better information. I.E. how do you determine the gender based off of a first name? A possible project later in the semester is to participate in a data challenge an example of a challenge.

Data Repositories

Data is stored in TONS of ways:

  • CSV
  • XML
  • JSON
  • MySQL
  • Oracle
  • SQL Server
  • JavaDB
  • CouchDB
  • Mongo
  • Hadoop
  • Redis
  • Cassandra
  • HTML
  • Some bad human formats

A lot of time is grabbing data from tons of sources and integrating it with others.

This course goes like this:

Data collection through scripting >>>

Quality Assesment & Cleaning >>>

Storage in analysis-appropriate Data Stores >>>

Clean data ready to retrieve for analytics & visualization.

Module 2

Big data:

  • Big data is a new term that describes large, complex data sets that need to be stored and processed in different ways
  • Larger data sets allow for more detailed analysis
  • It's very dependent on the time (in the future what will be considered big data is very different than what big data is now)

Important facts:

  • Volume of data is growing, each day over 2.5 quintillion bytes of data is being generated.
  • 90% of the world's data has been generated over the past two years
  • Data from multiple sources is being integrated into one massive source

Sources of data:

  • There are nearly five billion web pages
  • Collected data includes network traffic, site and page visits, page navigation, page searches
  • User generated content
  • Facebook
  • Twitter
  • Instagram
  • Blogs
  • YouTube
  • Forums
  • Wikis
  • etc
  • RFID (radio frequency ids)
  • Tags for tracking merchandise and shipments, sports performance, automated toll collection
  • GPS tracking data generated by mobile devices
  • Tracking of movement of equipment, vehicles, and people
  • Weather conditions
  • Tidal movements
  • Transactional data
  • Census
  • Polls
  • Healthcare data
  • Education, law and order, economic activity, agriculture, food production
  • Radio telescopes, particle physics

Definition of "Big Data": Big data is the integration of large amounts of multiple types of structured and unstructured data into a single data set that can be analyzed to gain insight and new understanding of an industry, business, the enviroment, medicine, disease control, science, and the human interactinos and expectations.

Big Data Characteristics:

  • Large, distributed aggregations of loosely structured data
  • Excess of multiple petabytes and exabytes
  • Billions of records about people or transactions

Information Quality:

  • Information/ Data quality is a measurement of the fit of information for particular use
  • Poor information quality can cost businesses 10% of revenue
  • Bad info can account for $600 billion loss in USA

The 6 V's of Big Data

  • Volume
    • Large quantity
    • 2.7 zetabytes of data in the universe, expected to double every two years
  • Variety
    • A lot of data sources
    • Not just one type
  • Velocity
    • How quickly we can process it
    • How quickly it emerges
  • Veracity
    • Truthfullness of data
    • Reliability of data
  • Validity
    • Correct syntax & expected type
    • Appropriate for the analysis you're doing
  • Volatility
    • How long you can store data for before it's obsolete

Self reported data has a lot of veracity concerns