DS 4100 Data Collection, Integration, and Analysis

Today we’re talking about storing data in a relational database (MySQL). We’re going to be focusing on SQL right now, in a couple of weeks we’re going to be doing work in NoSQL; MongoDB, couch, etc. Eventually, I’ll have to make a full pipeline of data, starting with collection, then storage, then retrieval, and finally a predictive model. Finally, there needs to be a report; possibly a visualization.

I think what I would want to do is a website where you give a subreddit, a post, and what you want to comment, and the website predicts how many upvotes it could get. What I’m thinking is:

1) Constantly running scraper adding data to a database 2) Backend either creates another scraper or retrieves data given a subreddit 3) Create a generalized model for upvotes 4) Apply the model given a comment 5) Return the information

Back to relational database

Relational Data Storage

If the majority of the computation happens on the client side, it’s called a “Fat Client”. The opposite is where a server does most of the work, called a “Fat Server”. The benefit of a fat server is sharing logic across clients. It’s bad to use just file access as a means of a database, because file access only allows one user at a time. Also, a DBMS keeps all the data, and querying in one place, keeping your client lean. The relational data model is the most commonly used model.

Database types:

OLTP (On-line transcation processing):

OLAP (On-line analytical processing):

Data warehouses:

Process:

Source database -> extract data -> transform data by running scripts -> store in data warehouse

The transformation is the ETL layer (extraction, transformation, loading)

Data marts are the result of the extraction of a subset of data from a data warehouse

Queries that you would run against a data warehouse are more complex, and require data crunching and models

Fundamental structure:

Foreign keys are used to establish links between two different tables.

Table rules:

Now onto basic SQL queries, which I’m not going to write down.

We’re going to be using SQLite because it’s easy, small, and available. Other than that, my teacher specified, DONT USE IT :P