DS 4100 Weekly Review

This week I had to write a scraper for class. This was a fun assignment that I got to do in Python. I had a couple of ideas for this assignment, but I was able to decide upon scraping reddit. Reddit has an api, however, this assignment specified that we scrape HTML. I’m really interested in eventually doing sentiment analysis; how does what someone says affect the amount of points they get?

This was a fun project because I got to build a lot of functionality from nothing. I started with the basics: get the page in Python. Once I had the page in Python, I went through the basics. First, I found the different links to the comments pages. I did this by using chrome to find a common trait shared between the comments links. Often times it was a class shared by the text. Once I had the class I would make a method that got all the instances of this property. I then chained each of these properties with methods like so: get the pages > get the comment blocks > get the [user, comment, points]. I would then run this through another method which would go through x amount of pages. This amount of pages would vary based on the amount of data that I wanted.

To begin these tests, I started with one page. After fixing all the issues from that, I moved on to ten pages. After ten, I expanded it to 100. Once that was done, I had a relatively large data set ~17832 rows. This sort of scraping proved effective. I was worried that I was missing a lot of data, so I went to the actual link to do a sanity check. I found that my scraper was getting everything correct. I really enjoyed this project, and I’m excited to expand on it in the next assignment.

In the future, I think I would want to turn this into a real website. I would want to program it in Django, so that I could reuse my code for right now. Actually, if I were to turn this into a website I would use the reddit API, which I assume would be much faster than the scraping I’m currently doing. I’d also want to pair up with someone in the data visualizations class. I’d be interested to pair up the data I get with some really beautiful live visualizations that update in real time. I’d also be interested in opening up the website so that people could make their own visualizations with data set. I feel like that may be pretty difficult, however, because I’d either have to make a way for people to code with given data or make a drag and drop interface with basic mathematical operations. Either way, i think it could turn into an interesting project.