DS 4100 Weekly Review

This weekend I worked on the extra credit assignments and did some more research on my project. I’ll start this post with what I learned when doing research with my project and then walkthrough how my extra credit assignments went.

Research project

So this weekend, I was looking a lot into the publicly available data sets for reddit. I found a HUGE one available on google BigQuery which I’ve never used before. Basically, it’s a place for people to store GINORMOUS data sets and run sql queries on them. The reddit data sets I found are around 1 TB large, and queries take a couple seconds to finish. The next thing I think I’m going to do is make my backend/ frontend project on Heroku, or AWS (still can’t decide which one is better for me). Once i have that going, I’m going to begin constructing the frontend for the project, and simulatenously connect to the GBQ so that I can run some basic queries and get basic information. After that I may or may not throw Facebook’s graphQL in the mix, because I think that should make collaborative filtering a bit easier.

Extra Credit

The extra credit assignments were pretty pretty cool. The first one involved some XML parsing and writing some functions to aggregate XML data. That one wasn’t too interesting. The second one, however, was WAY cooler. The task was to write a parser for a custom file format. The data you’re given is more than 3,000,000 lines of iMDB data. The format’s, on first glance, have many differences, however, after looking and playing with it for a couple minutes I figured out the difference between a Movie and a TV Show (we only want to get the movies in a new data frame). I realized I had two options once I figured out the formatting: 1) write some code and do a bunch of contains 2) write some regexs

I decided to do option 2) because the runtime was going to be infinitely faster. The issue with regex is that, I don’t really know it :P so i spent the last couple of hours tweeking and constructing a couple different regexes to find out the year a movie was released, the name of the movie, and whether or not it was a movie, and whether or not it was a tv show. I’m really happy with the final product, and it’s running pretty quickly. I’ve been running it for a while and it’s still on A, though, that being said it’s gone through millions of rows already (it had to sift through all the TV shows first).