Beautiful Soup:Scrape Table Data - python

I'm looking to extract table data from the url below. Specifically I would like to extract the data in first column. When I run the code below, the data in the first column repeats multiple times. How can I get the values to show only once as it appears in the table?
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
for cell in data:
print(data[0].text)

Try this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if (len(data) > 0):
cell = data[0]
print(cell.text)

Using requests module in combination with selectors you can try like the following as well:
import requests
from bs4 import BeautifulSoup
link = 'http://www.pythonscraping.com/pages/page3.html'
soup = BeautifulSoup(requests.get(link).text, 'lxml')
for table in soup.select('table#giftList tr')[1:]:
cell = table.select_one('td').get_text(strip=True)
print(cell)
Output:
Vegetable Basket
Russian Nesting Dolls
Fish Painting
Dead Parrot
Mystery Box

Related

I cannot separate scraped data into different category

I need help from python expert. This is a site where I have to scrape table data and separate into four different category then convert it into excel file but problem is all table category's classes are same.
There should be different four classes but same four classes
Thanks
Mariful
Website for scrape
import requests
from bs4 import BeautifulSoup
import csv
import re
import pandas as pd
url = "https://www.kpaa.or.kr/kpaa/eng/list.do?"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all(class_='title')
for item in items:
n = item.text
print(n)
df = pd.Dataframe({'name':n, 'office':n, 'phone':n, 'email':n})
Here is i try to convert single data to 2D list to use in data pandas data frame.
from bs4 import BeautifulSoup
import csv
import re
import pandas as pd
url = "https://www.kpaa.or.kr/kpaa/eng/list.do?"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
data_list = [td.getText(strip=True, separator=',').split(',') for td in soup.find('div', {'class':'cont_box2'}).find_all('tr')[:-1]]
df = pd.DataFrame(data_list)
df.to_excel('x.xlsx')

Convert table sourced from html webpage in to pandas dataframe

I'm trying to obtain a table from a webpage and convert in to a dataframe to be used in analysis. I've used the BeautifulSoup package to scrape the url and parse the table info, but I can't seem to export the info to a dataframe. My code is below:
from bs4 import BeautifulSoup as bs
from urllib import request
source = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
for tr in table_rows:
td = tr.find_all("td")
row = [i.text for i in td]
print(row)
By doing this I can see each row, but I'm not sure how to convert it to df. Any ideas?
please try this.
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
import pandas as pd
source = urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").read()
soup = bs(source, "html.parser")
table = soup.table
table_rows = table.find_all("tr")
postal_codes = []
for tr in table_rows:
td = tr.find_all("td")
row = [ i.text[:-1] for i in td]
postal_codes.append(row)
#print(row)
postal_codes.pop(0)
df = pd.DataFrame(postal_codes, columns=['PostalCode', 'Borough', 'Neighborhood'])
print(df)
u can utilize pandas read_html
# read's all the tables & return as an array, pick the data table that meets your need
table_list = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
print(table_list[0])
Postal Code Borough Neighborhood
0 M1A Not assigned NaN
1 M2A Not assigned NaN
2 M3A North York Parkwoods
3 M4A North York Victoria Village

How to specify table for BeautifulSoup to find?

I'm trying to grab the table on this page https://nces.ed.gov/collegenavigator/?id=139755 under the Net Price expandable object. I've gone through tutorials for BS4, but I get so confused by the complexity of the html in this case that I can't figure out what syntax and which tags to use.
Here's a screenshot of the table and html I'm trying to get:
This is what I have so far. How do I add other tags to narrow down the results to just that one table?
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
soup = soup.find(id="divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02")
print(soup.prettify())
Once I can parse that data, I will format into a dataframe with pandas.
In this case I'd probably just use pandas to retrieve all tables then index in for appropriate
import pandas as pd
table = pd.read_html('https://nces.ed.gov/collegenavigator/?id=139755')[10]
print(table)
If you are worried about future ordering you could loop the tables returned by read_html and test for presence of a unique string to identify table or use bs4 functionality of :has , :contains (bs4 4.7.1+) to identify the right table to then pass to read_html or continue handling with bs4
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
print(table)
ok , maybe this can help you , I add pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find("div", {"id": "divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02"})
table = div.findAll("table", {"class": "tabular"})[1]
l = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
if td:
row = [i.text for i in td]
l.append(row)
df=pd.DataFrame(l, columns=["AVERAGE NET PRICE BY INCOME","2015-2016","2016-2017","2017-2018"])
print(df)
Here is a basic script to scrape that first table in that accordion:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
print(desired_table.prettify())
I assume you only want the values within the table so I did an overkill version of this as well that will combine the column names and values together:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
header_row = desired_table.find_all('th')
headers = []
for header in header_row:
header_text = header.get_text()
headers.append(header_text)
money_values = []
data_row =desired_table.find_all('td')
for rows in data_row:
row_text = rows.get_text()
money_values.append(row_text)
for yrs,money in zip(headers,money_values):
print(yrs,money)
This will print out the following:
Average net price
2015-2016 $13,340
2016-2017 $15,873
2017-2018 $16,950

Beautiful Soup scrape table with table breaks

I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.
If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.

Just like the BeautifulSoup example...what's wrong?

So I am trying to scrape this vocabulary table using beautifulsoup:
http://www.homeplate.kr/korean-baseball-vocabulary
I tried to scrape it just as I did this table of football teams:
http://www.bcsfootball.org/
The first case:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.homeplate.kr/korean-baseball-vocabulary'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for row in soup('table',{'class': 'tableizer-table'}):
tds = row('td')
print tds[0].string, tds[1].string
This outputs only one line of the table.
The second case:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.bcsfootball.org').read())
for row in soup('table',{'class': 'mod-data'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
This outputs the rank and school name for all 25 schools.
What am I doing wrong between the two examples?
Only one of them has ...[0].tbody('tr').
In the first code snippet, you're iterating over tables (despite your variable name of row), of which there is (presumably) only one.

Categories

Resources