Published on

DS 4100 Day 11

Authors
  • avatar
    Name
    Jacob Aronoff
    Twitter

DS 4100 Data Collection, Integration, and Analysis

Today we're talking about storing data in a relational database (MySQL). We're going to be focusing on SQL right now, in a couple of weeks we're going to be doing work in NoSQL; MongoDB, couch, etc. Eventually, I'll have to make a full pipeline of data, starting with collection, then storage, then retrieval, and finally a predictive model. Finally, there needs to be a report; possibly a visualization.

I think what I would want to do is a website where you give a subreddit, a post, and what you want to comment, and the website predicts how many upvotes it could get. What I'm thinking is:

  1. Constantly running scraper adding data to a database
  2. Backend either creates another scraper or retrieves data given a subreddit
  3. Create a generalized model for upvotes
  4. Apply the model given a comment
  5. Return the information

Back to relational database

Relational Data Storage

  • Tables in a database are stored as tables
  • DBMS are tasked with performance, backup, recovery, storage, access, etc.
  • Relational databases store data in a relational (tabular) way.
  • You get a table back from a database call.

If the majority of the computation happens on the client side, it's called a "Fat Client". The opposite is where a server does most of the work, called a "Fat Server". The benefit of a fat server is sharing logic across clients. It's bad to use just file access as a means of a database, because file access only allows one user at a time. Also, a DBMS keeps all the data, and querying in one place, keeping your client lean. The relational data model is the most commonly used model.

Database types:

OLTP (On-line transcation processing):

  • Support customer service
  • Interactive queries and real time updates
  • Concurrent access by many users

OLAP (On-line analytical processing):

  • facilitate decision support
  • Analytical queries, no real-time update
  • Precompute, and redundantly store data to speed up queries

Data warehouses:

  • an OLAP database
  • an integrated and consolidated collection of data from multiple sources
  • Primarily queried and is non-volatile
  • contains data records for a long time

Process:

Source database -> extract data -> transform data by running scripts -> store in data warehouse

The transformation is the ETL layer (extraction, transformation, loading)

Data marts are the result of the extraction of a subset of data from a data warehouse

Queries that you would run against a data warehouse are more complex, and require data crunching and models

Fundamental structure:

  • Tables are the fundamental structural entity in a relational database

  • Each table represents an entity from the business domain and consists of:

    • columns
    • rows
    • key column

Foreign keys are used to establish links between two different tables.

Table rules:

  • All rows are formatted identically
  • Rows are unsorted unless SPECIFICALLY TOLD TO BE SORTED
  • Columns are unsorted
  • Each row in a table is a unique entity
  • There are no duplicate column names
  • Within the table, values represnet data
  • Each column value is atomic (in a single column, you store a single value)

Now onto basic SQL queries, which I'm not going to write down.

We're going to be using SQLite because it's easy, small, and available. Other than that, my teacher specified, DONT USE IT :P