How do I scrape data from the Internet to conduct analysis?

Sept. 21, 2019, 10:58 a.m.

Scraping Data from the Web

We are going to use Python to collect data from the web—this can be useful on many different levels.  This is one of those things that I use all the time, so I am happy to write a little about it.  Go ahead and try out some of these methods on your own website for practice or use it to collect data from another website so long as you comply with the target URL's (i.e. website) robots.txt protocol.  Check out our post on robots.txt to learn more about how to read and interpret this common file.

For now, let's dive right in—if you spin up a remote Linux Ubuntu instance, then we can get started.

###########################
## html_crawler_temps.py ##
###########################

# IMPORTING DEPENDENCIES
import urllib
from bs4 import BeautifulSoup
import pandas as pd

# DEFINING THE TARGET SITE
target = 'https://datasciencelife.com/blog/data-crawling-example'

# USING BEAUTIFUL SOUP TO COLLECT HTML DATA
t = urllib.request.urlopen(target)
soup = BeautifulSoup(t.read(), 'lxml')

# FILTERING ONLY THE <TR> TAGS
titles = soup.find_all('tr')

# BIT OF DATA ENGINEERING TO DEFINE LIST OF LISTS
list_temps = []
for tr in titles:
    td = tr.find_all('td')
    if(len(td) == 0):
        continue
    list_row = [d.text for d in td]
    list_temps.append(list_row)

# CONVERTING LIST TO PANDAS DATA FRAME
df_temps = pd.DataFrame(list_temps)
df_temps.columns = ['Month','High (F)','Low (F)']
print(df_temps)

Now that we have the pandas data frame we can start incorporating this into our analysis!

Cheers,

J


Sources:


Comment Count: 4 Comments:
Sept. 22, 2019, 8:48 a.m. - John Harris

TESTING COMMENTS!

Sept. 22, 2019, 8:51 a.m. - John Harris

TEST

March 27, 2022, 5:20 p.m. - adhenna

Fvdrjs buy apcalis oral jelly for women https://oscialipop.com - Cialis <a href=https://oscialipop.com>cheap cialis from india</a> Levitra Generika Vergleich Disturbance in repetition b. https://oscialipop.com - buy cialis canadian

Oct. 29, 2022, 12:27 p.m. - alcograch

Patients who survived childhood cancer are at a particular risk of developing adverse effects caused by multimodal treatment for their malignancy 6 <a href=https://bestcialis20mg.com/>cialis dosage</a>