How do I scrape data from the Internet to conduct analysis?

Sept. 21, 2019, 10:58 a.m.

Scraping Data from the Web

We are going to use Python to collect data from the web—this can be useful on many different levels. This is one of those things that I use all the time, so I am happy to write a little about it. Go ahead and try out some of these methods on your own website for practice or use it to collect data from another website so long as you comply with the target URL's (i.e. website) robots.txt protocol. Check out our post on robots.txt to learn more about how to read and interpret this common file.

For now, let's dive right in—if you spin up a remote Linux Ubuntu instance, then we can get started.

########################### ## html_crawler_temps.py ## ###########################

# IMPORTING DEPENDENCIES import urllib from bs4 import BeautifulSoup import pandas as pd

# DEFINING THE TARGET SITE target = 'https://datasciencelife.com/blog/data-crawling-example'

# USING BEAUTIFUL SOUP TO COLLECT HTML DATA t = urllib.request.urlopen(target) soup = BeautifulSoup(t.read(), 'lxml')

# FILTERING ONLY THE <TR> TAGS titles = soup.find_all('tr')

# BIT OF DATA ENGINEERING TO DEFINE LIST OF LISTS list_temps = [] for tr in titles: td = tr.find_all('td') if(len(td) == 0): continue list_row = [d.text for d in td] list_temps.append(list_row)

# CONVERTING LIST TO PANDAS DATA FRAME df_temps = pd.DataFrame(list_temps) df_temps.columns = ['Month','High (F)','Low (F)'] print(df_temps)

Now that we have the pandas data frame we can start incorporating this into our analysis!

Cheers,

Sources:

Name:
Email address:
URL:
Comment:
If you enter anything in this field your comment will be treated as spam: