DS 4100 Day 11

DS 4100 Data Collection, Integration, and Analysis

Today we're talking about storing data in a relational database (MySQL). We're going to be focusing on SQL right now, in a couple of weeks we're going to be doing work in NoSQL; MongoDB, couch, etc. Eventually, I'll have to make a full pipeline of data, starting with collection, then storage, then retrieval, and finally a predictive model. Finally, there needs to be a report; possibly a visualization.

I think what I would want to do is a website where you give a subreddit, a post, and what you want to comment, and the website predicts how many upvotes it could get. What I'm thinking is:

Constantly running scraper adding data to a database
Backend either creates another scraper or retrieves data given a subreddit
Create a generalized model for upvotes
Apply the model given a comment
Return the information

Back to relational database

Relational Data Storage

Tables in a database are stored as tables
DBMS are tasked with performance, backup, recovery, storage, access, etc.
Relational databases store data in a relational (tabular) way.
You get a table back from a database call.

If the majority of the computation happens on the client side, it's called a "Fat Client". The opposite is where a server does most of the work, called a "Fat Server". The benefit of a fat server is sharing logic across clients. It's bad to use just file access as a means of a database, because file access only allows one user at a time. Also, a DBMS keeps all the data, and querying in one place, keeping your client lean. The relational data model is the most commonly used model.

Database types:

OLTP (On-line transcation processing):

Support customer service
Interactive queries and real time updates
Concurrent access by many users

OLAP (On-line analytical processing):

facilitate decision support
Analytical queries, no real-time update
Precompute, and redundantly store data to speed up queries

Data warehouses:

an OLAP database
an integrated and consolidated collection of data from multiple sources
Primarily queried and is non-volatile
contains data records for a long time

Process:

Source database -> extract data -> transform data by running scripts -> store in data warehouse

The transformation is the ETL layer (extraction, transformation, loading)

Data marts are the result of the extraction of a subset of data from a data warehouse

Queries that you would run against a data warehouse are more complex, and require data crunching and models

Fundamental structure:

Tables are the fundamental structural entity in a relational database
Each table represents an entity from the business domain and consists of:
- columns
- rows
- key column

Foreign keys are used to establish links between two different tables.

Table rules:

All rows are formatted identically
Rows are unsorted unless SPECIFICALLY TOLD TO BE SORTED
Columns are unsorted
Each row in a table is a unique entity
There are no duplicate column names
Within the table, values represnet data
Each column value is atomic (in a single column, you store a single value)

Now onto basic SQL queries, which I'm not going to write down.

We're going to be using SQLite because it's easy, small, and available. Other than that, my teacher specified, DONT USE IT :P