DS 4100 Project Proposal

For my project, I want to investigate Reddit. There’s a lot of openly available data about reddit that I’ll be able to scrape. Recently, there was a brief study posted on FiveThirtyEight doing what he called “Reddit Algebra”, which was actually just a fancier way of explaining set operations.

FiveThirtyEight Article

For the past couple of weeks in my weekly reviews, I’ve talked a lot about analyzing reddit data to generate predictions or statistics. What I want to do with this project, is take the basics of what the aformentioned person did, and go one step further. Here would be my process:

  • Get large reddit data set (available on the web)
  • Store it in MySQL (data is already in SQL format)
  • Write GraphQL schema to connect code to DB
  • Construct GQL queries given params

    • User inputs reddit username
    • GQL query gets the subreddits that user has commented on, and truncates it into a number for each subreddit
    • Get users who have commented the most on the places where the given user has also commented
    • Aggregate the subreddits the other users have also commented on, returning a list of suggested subreddits
    • I can also add in the ability to find subreddits like a given subreddit, listing other subreddits ordered by their %similarity


  • MySQL backend
  • GraphQL middleware
  • Node.js frontend

The end goal of my project will be the following: given a reddit username suggest other subreddits that user may like, or given a subreddit, give a list of subreddits similar. I would accomplish this using collaborative filtering.