I'm trying to grab the table on this page https://nces.ed.gov/collegenavigator/?id=139755 under the Net Price expandable object. I've gone through tutorials for BS4, but I get so confused by the complexity of the html in this case that I can't figure out what syntax and which tags to use.
Here's a screenshot of the table and html I'm trying to get:
This is what I have so far. How do I add other tags to narrow down the results to just that one table?
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
soup = soup.find(id="divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02")
print(soup.prettify())
Once I can parse that data, I will format into a dataframe with pandas.
In this case I'd probably just use pandas to retrieve all tables then index in for appropriate
import pandas as pd
table = pd.read_html('https://nces.ed.gov/collegenavigator/?id=139755')[10]
print(table)
If you are worried about future ordering you could loop the tables returned by read_html and test for presence of a unique string to identify table or use bs4 functionality of :has , :contains (bs4 4.7.1+) to identify the right table to then pass to read_html or continue handling with bs4
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
print(table)
ok , maybe this can help you , I add pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find("div", {"id": "divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02"})
table = div.findAll("table", {"class": "tabular"})[1]
l = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
if td:
row = [i.text for i in td]
l.append(row)
df=pd.DataFrame(l, columns=["AVERAGE NET PRICE BY INCOME","2015-2016","2016-2017","2017-2018"])
print(df)
Here is a basic script to scrape that first table in that accordion:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
print(desired_table.prettify())
I assume you only want the values within the table so I did an overkill version of this as well that will combine the column names and values together:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
header_row = desired_table.find_all('th')
headers = []
for header in header_row:
header_text = header.get_text()
headers.append(header_text)
money_values = []
data_row =desired_table.find_all('td')
for rows in data_row:
row_text = rows.get_text()
money_values.append(row_text)
for yrs,money in zip(headers,money_values):
print(yrs,money)
This will print out the following:
Average net price
2015-2016 $13,340
2016-2017 $15,873
2017-2018 $16,950
Related
I have searched through a lot of similar questions, but I'm unable to resolve the issue with the code below.
I am trying to scrape the same information from 2 separate URLs.
There is no issue when I scrape 1 URL (code 1). I then attempt to for loop through multiple URLs (code 2) and it throws this error:
ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Is it a case that the line where the error is returned (highlighted below) should not be included within the For Loop? (I have tried this unsuccessfully)
Could someone please educate me in why this is not working (my guess would be that the structure is wrong in someway - but I've been unable to adjust it correctly), or if this is infact not the optimal method at all
First code:
import csv
import pandas as pd
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import numpy as np
import re
url = "https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-royal-challengers-bangalore-42nd-match-734013/full-scorecard"
url_contents = ureq(url) #opening the URL
soup = soup(url_contents,"html.parser") #parse the
batsmen = soup.find_all("table", { "class":["table batsman"]})
bowlers = soup.find_all("table", { "class":["table bowler"]})
for batsman in batsmen[0]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (batsmen)
for x in batsman:
rows = batsman.find_all('tr')[:-2] #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
for bowler in bowlers[1]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (bowlers)
for x in bowler:
rows = bowler.find_all('tr') #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
Second code:
import csv # to do operations on CSV
import pandas as pd # file operations
from bs4 import BeautifulSoup as soup #Scraping tool
from urllib.request import urlopen as ureq # For requesting data from link
import numpy as np
import re
urls = ["https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-royal-challengers-bangalore-42nd-match-734013/full-scorecard",
"https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-kolkata-knight-riders-21st-match-733971/full-scorecard"]
for url in urls:
url_contents = ureq(url) #opening the URL
soup = soup(url_contents,"html.parser") #parse the
**batsmen = soup.find_all("table", { "class":["table batsman"]})** #error here
bowlers = soup.find_all("table", { "class":["table bowler"]})
for batsman in batsmen[0]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (batsmen)
for x in batsman:
rows = batsman.find_all('tr')[:-2] #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
for bowler in bowlers[1]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (bowlers)
for x in bowler:
rows = bowler.find_all('tr') #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
Your problem is because you use the same name soup for class/function soup(...) and for result soup = ... - and you run it in loop.
from bs4 import BeautifulSoup as soup
for url in urls:
soup = soup(...)
In first loop all work correctly but class/function soup() is replaces by result soup = ... and in next loop it tries to use result soup as a class/function - and this makes problem.
In first code you run soup = soup() only once so it makes no problem.
If you use different names - ie. BeautifoulSoup instead of soup - then it will work
from bs4 import BeautifulSoup
for url in urls:
soup = BeautifulSoup(...)
BTW:
In second code you have wrong indentations - you should run for batsman in ... and for bowler in ... inside for url in urls: but you run it outside (after exiting from loop for url in urls:) and this will give you results only for last url
You can use request lib and try this
import requests as req
from bs4 import BeautifulSoup
urls = ["https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-royal-challengers-bangalore-42nd-match-734013/full-scorecard",
"https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-kolkata-knight-riders-21st-match-733971/full-scorecard"]
for url in urls:
otp = req.get(url)
if otp.ok:
soup = BeautifulSoup(otp.text,'lxml')
batsmen = soup.find_all('table', {'class': 'table batsman'})
bowlers = soup.find_all('table', {'class': 'table bowler'})
for bat in batsmen:
print(bat.find_all('td')) # here you can use find/find_all method
for bowl in bowlers:
print(bowl.find_all('td')) # here you can use find/find_all method
I would like to scratch the reviews from this page and save them as a data frame, but I do not download star ratings and the text of the review. Just only text. What i did wrong?
import csv
import pandas as pd
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.morele.net/pralka-candy-cs4-1062d3-950636/?sekcja=reviews-all")
soup = BeautifulSoup(page.content, "html.parser",
).find_all("div", {"class":"reviews-item"})
# print(soup)
morele = [div.getText(strip=True) for div in soup]
print(morele)
csv_table = pd.DataFrame(morele)
csv_table = csv_table.reset_index(drop=True)
csv_table.insert(0,'No.',csv_table.index)
You are mostly there - just further navigate the DOM and you can get just the text.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.morele.net/pralka-candy-cs4-1062d3-950636/?sekcja=reviews-all")
soup = BeautifulSoup(page.content, "html.parser",)
data = [{"text":ri.find("div", {"class":"rev-desc"}).getText(strip=True) ,
"stars":ri.find("div", {"class":"rev-stars"}).getText(strip=True)}
for ri in soup.find_all("div", {"class":"reviews-item"})
]
pd.DataFrame(data)
How can I extract the value of the last price of the strike price 12,000.00 from the given URL in python?
https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-10006&symbol=NIFTY&symbol=NIFTY&instrument=-&date=-&segmentLink=17&symbolCount=2&segmentLink=17
LTP of 12,000.00 strike price is 25.35.
With bs4 4.7.1 use :has and :contains. Use :contains with td:nth-of-type to search the right column, then :has to retrieve the parent row and descendant combinator and td:nth-of-type again, to get the ltp column value for that row.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-10006&symbol=NIFTY&symbol=NIFTY&instrument=-&date=-&segmentLink=17&symbolCount=2&segmentLink=17')
soup = bs(r.content, 'lxml')
ltp = soup.select_one('#octable tr:has(td:nth-of-type(12):contains("12000.00")) td:nth-of-type(6)').text.strip()
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nseindia.com/live_market/dynaContent/live_watch/option_chain/optionKeys.jsp?symbolCode=-10006&symbol=NIFTY&symbol=NIFTY&instrument=-&date=-&segmentLink=17&symbolCount=2&segmentLink=17')
soup = BeautifulSoup(page.content,"lxml")
data = []
for tr in soup.select('table#octable tr')[2:-1]:
data.append([td.text.strip() for td in tr.select('td')])
def get_ltp(data, strike_price):
for d in data:
if strike_price == d[11]:
return d[5]
print(get_ltp(data, '12000.00'))
Prints:
25.35
I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.
If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.
I'm looking to extract table data from the url below. Specifically I would like to extract the data in first column. When I run the code below, the data in the first column repeats multiple times. How can I get the values to show only once as it appears in the table?
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
for cell in data:
print(data[0].text)
Try this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page3.html').read()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('table',{'id':'giftList'})
rows = table.find_all('tr')
for row in rows:
data = row.find_all('td')
if (len(data) > 0):
cell = data[0]
print(cell.text)
Using requests module in combination with selectors you can try like the following as well:
import requests
from bs4 import BeautifulSoup
link = 'http://www.pythonscraping.com/pages/page3.html'
soup = BeautifulSoup(requests.get(link).text, 'lxml')
for table in soup.select('table#giftList tr')[1:]:
cell = table.select_one('td').get_text(strip=True)
print(cell)
Output:
Vegetable Basket
Russian Nesting Dolls
Fish Painting
Dead Parrot
Mystery Box