Convert table sourced from html webpage in to pandas dataframe - python

I'm trying to obtain a table from a webpage and convert in to a dataframe to be used in analysis. I've used the BeautifulSoup package to scrape the url and parse the table info, but I can't seem to export the info to a dataframe. My code is below:
from bs4 import BeautifulSoup as bs
from urllib import request
source = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
for tr in table_rows:
td = tr.find_all("td")
row = [i.text for i in td]
print(row)
By doing this I can see each row, but I'm not sure how to convert it to df. Any ideas?

please try this.
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import pandas as pd
source = urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
postal_codes = []
for tr in table_rows:
td = tr.find_all("td")
row = [ i.text[:-1] for i in td]
postal_codes.append(row)
#print(row)
postal_codes.pop(0)
df = pd.DataFrame(postal_codes, columns=['PostalCode', 'Borough', 'Neighborhood'])
print(df)

u can utilize pandas read_html
# read's all the tables & return as an array, pick the data table that meets your need
table_list = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
print(table_list[0])
Postal Code Borough Neighborhood
0 M1A Not assigned NaN
1 M2A Not assigned NaN
2 M3A North York Parkwoods
3 M4A North York Victoria Village

Related

Web scrape and pull an attribute value instead of the text value out of td for the entire table beautiful soup

I am trying to scrape some data from a table, but they have the content that I actually would like in an attribute.
Example xml:
'''
<tr data-row="0">
<th scope ="row" class="left" data_append-csv="AlleRi00" data-stat="player" csk="Allen, Ricardo">
Ricardo Allen
</th>
<td class="center poptip out dnp" data-stat="week_4" data-tip"Out: Concussion" csk= "4">
<strong>O</strong>
</td>
'''
When scraping the table I use the following code:
'''
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr.text for tr in td]
final_data.append(row)
df = pd.DataFrame(final_data[1:],final_data[0])
'''
With my current code, I get a good looking dataframe with headers and all the info that is visible when looking at the table. However, I would like to get "Out: Concussion" instead of "O" within the table. I've been trying numerous ways and cannot figure it out. Please let me know if this is possible with the current process or if I am approaching it all wrong.
This should help you:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')
final_data = []
for tr in table_rows:
td = tr.find_all(['th','td'])
row = [tr['data-tip'] if tr.has_attr("data-tip") else tr.text for tr in td]
final_data.append(row)
m = final_data[1:]
final_dataa = [[m[j][i] for j in range(len(m))] for i in range(len(m[0]))]
df = pd.DataFrame(final_dataa,final_data[0]).T
df.to_csv("D:\\injuries.csv", index = False)
Screenshot of csv file (I've done some formatting so that it looks neat):

Python BeautifulSoup and Pandas extract table from list of urls and save all the tables into single dataframe or save as csv

I am trying to extract tabular data from a list of urls and I want to save all the table into a single csv file.
I am new and relatively beginner in python and from non-CS background, however I am very eager to learn the same.
import pandas as pd
import urllib.request
import bs4 as bs
urls = ['A', 'B','C','D',...'Z']
for url in urls:
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find('table', class_='tbldata14 bdrtpg')
table_rows = table.find_all('tr')
data = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)
What I Get from above code in newly created csv file is -
ABC XYZ PQR MNL CYP ZXS
1 2 3 4 5 6
My above code only gets table from last url- 'Z', which, as I have checked is actually the table from last url in list.
What I am trying to achieve here is getting all tables from list of urls - i.e. A to Z into single csv file.
This is an issue with indentation and order. table_rows gets reset every time through the for url in urls loop, so you only end up with the last URLs worth of data. If you want all of the URLs worth of data in one final CSV, see the changes I made below.
import pandas as pd
import urllib.request
import bs4 as bs
urls = ['A', 'B','C','D',...'Z']
data = [] # Moved to the start
for url in urls:
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find('table', class_='tbldata14 bdrtpg')
table_rows = table.find_all('tr')
#indented the following loop so it runs with every URL data
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
data.append(row)
final_table = pd.DataFrame(data, columns=["ABC", "XYZ",...])
final_table.to_csv (r'F:\Projects\McData.csv', index = False, header=True)

Web scraping with BeautifulSoup on Wikipedia

i am new to python and trying to use BeautifulSoup to extract all the train Stations Names on Wikipedia page from the third column of the wikitable.
i have tried the code below but it seems to return every row of cells as 1 group of information
contentTable = soup.find('table', { "class" : "wikitable"})
cols = contentTable.find_all('td')
for col in cols:
soup.find_all("a")
print(col.get_text())
output as below representing 1 row from the table:
CG2
TE [a]
Changi Airport
樟宜机场
சாங்கி விமானநிலையம்
8 February 2002
Changi Airport
CGA
Changi
Singapore Changi Airport, Changi Airport PTB2 Bus Terminal
expected dataframe column Station Names :
Station Names
Jurong East
Bukit Batok
etc...
can someone teach me how to code this correctly?
Thank you!
Your program is just simply printing the text contents of each 'td' tag on the wikitable.
Try this instead:
contentTable = soup.find('table', {"class": "wikitable"})
trs = contentTable.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
for td in tds:
if tds.index(td) == 2:
print(td.get_text())
First it scrapes each row, finds each 'td'-tag on that row, and prints out its contents if it's the third 'td'-tag on the said row.
try this
import requests
from bs4 import BeautifulSoup
# url to be scrape
URL = "https://en.wikipedia.org/wiki/List_of_Singapore_MRT_stations"
PAGE = requests.get(URL)
# get HTML content
SOUP = BeautifulSoup(PAGE.content, 'lxml') # lxml is faster then html.parser
contentTable = SOUP.find('table', {"class": "wikitable"})
rows = contentTable.findAll('tr')
for tr in rows:
columns = tr.find_all('td')
for index, td in enumerate(columns):
if index == 2:
print(td.text)

Beautiful Soup scrape table with table breaks

I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.
If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.

Beautiful Soup:Scrape Table Data

I'm looking to extract table data from the url below. Specifically I would like to extract the data in first column. When I run the code below, the data in the first column repeats multiple times. How can I get the values to show only once as it appears in the table?
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
for cell in data:
print(data[0].text)
Try this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if (len(data) > 0):
cell = data[0]
print(cell.text)
Using requests module in combination with selectors you can try like the following as well:
import requests
from bs4 import BeautifulSoup
link = 'http://www.pythonscraping.com/pages/page3.html'
soup = BeautifulSoup(requests.get(link).text, 'lxml')
for table in soup.select('table#giftList tr')[1:]:
cell = table.select_one('td').get_text(strip=True)
print(cell)
Output:
Vegetable Basket
Russian Nesting Dolls
Fish Painting
Dead Parrot
Mystery Box

Categories

Resources