DS 4100 Data Collection, Integration, and Analysis

The reason some of the previous days are missing is because they were either work days or we didn’t have class. Today we’re talking about scraping.

Scraping

Scraping often starts with making GET requests. Take the HTML and search through the HTML for the data.

Process:

Get a website -> Scrape it -> Get data

Be careful of websites that do not allow web scraping. A LOT of data is copyrighted. Regex is very useful when writing a scraper.

Now we’re going over how to scrape in R, which is a weird process. I wrote a scraper in python, and I’m just going to translate it into R, it’s going to be weird, but it should work. I think I’m going to use the rvest package, it seeems easy to use and it’s supposed to be inspired by BeautifulSoup which is the package I was using in Python.