How do I get the second table class? - python

I am trying to find a table in a Wikipedia page using BeautifulSoup. I know how to get the first table, but how do I get the second table (Recent changes to the list of S&P 500 Components) with the same class wikitable sortable?
my code:
import bs4 as bs
import requests
url='https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
r=requests.get(url)
url=r.content
soup = bs.BeautifulSoup(url,'html.parser')
tab = soup.find("table",{"class":"wikitable sortable"})
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies

You can use soup.find_all and access the last table. Since there are only two table tags with wikitable sortable as its class, the last element in the resulting list will be the "Recent Changes" table:
soup.find_all("table", {"class":"wikitable sortable"})[-1]

You could use an nth-of-type css selector to specify the second matching table
import bs4 as bs
import requests
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
r = requests.get(url)
url = r.content
soup = bs.BeautifulSoup(url,'lxml')
tab = soup.select_one("table.wikitable.sortable:nth-of-type(2)")
print(tab)

Related

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?
As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

Extract specific value from a table using Beautiful Soup (Python)

I looked around on Stackoverflow, and most guides seem to be very specific on extracting all data from a table. However, I only need to extract one, and just can't seem to extract that specific value from the table.
Scrape link:
https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919
I am looking to extract the "Style" value from the table within the link.
Code:
import bs4
styleData=[]
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
table=cleanbyAddPD.find('div',{'id':'MainContent_ctl01_panView'})
style=table.findall('tr')[3]
style=style.findall('td')[1].text
print(style)
styleData.append(style)
Probably you misused find_all function, try this solution:
style=table.find_all('tr')[3]
style=style.find_all('td')[1].text
print(style)
It will give you the expected output
You can use a CSS Selector:
#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)
Which will select the "MainContent_ctl01_grdCns" id, the fourth <tr>, the second <td>.
To use a CSS Selector, use the .select() method instead of find_all(). Or select_one() instead of find().
import requests
from bs4 import BeautifulSoup
URL = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print(
soup.select_one(
"#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)"
).text
)
Output:
Townhouse End
Could also do something like:
import bs4
import requests
style_data = []
url = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = bs4.BeautifulSoup(requests.get(url).content, 'html.parser')
# select the first `td` tag whose text contains the substring `Style:`.
row = soup.select_one('td:-soup-contains("Style:")')
if row:
# if that row was found get its sibling which should be that vlue you want
home_style_tag = row.next_sibling
style_data.append(home_style_tag.text)
A couple of notes
This uses CSS selectors rather than the find methods. See SoupSieve docs for more details.
The select_one relies on the fact that the table is always ordered in a certain way, if this is not the case use select and iterate through the results to find the bs4.Tag whose text is exactly 'Style:' then grab its next sibling
Using select:
rows = soup.select('td:-soup-contains("Style:")')
row = [r for r in rows if r.text == 'Style:']
home_style_text = row.text
You can use :contains on a td to get the node with innerText "Style" then an adjacent sibling combinator with td type selector to get the adjacent td value.
import bs4, requests
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
print(cleanpagedata.select_one('td:contains("Style") + td').text)

BeautifulSoup: "Exception has occurred: KeyError 'id'" when using a function with .find_all() method

I am trying use BeautifulSoup to build a scraper that will pull box scores off of www.basketball-reference.com. An example box score page would be this. The box score tables that I want are under a table tag have an id that contains the word 'basic' (this distinguishes it from the advanced stats tables). I figured a function would be best for picking out this distinction. Html looks like this.
My code:
r = requests.get(https://www.basketball-reference.com/boxscores/202003110ATL.html).content
soup = BeautifulSoup(r, 'lxml')
def get_boxscore_basic_table(tag):
return ('basic' in tag.attrs['id']) and ('sortable' in tag.attrs['class'])
tables = soup.find_all(get_boxscore_basic_table)
This throws the: "KeyError 'id'" and I am confused on how to fix this. I've checked the keys by grabbing just the first instance using .find():
table = soup.find('table')
print('table.attrs')
And the key 'id' is there. Why can't it locate my specific request when searching through the whole html and how can I fix this?
You were quite close! The issue is that some elements don't have an id and class, which leads to an error when you try to access the missing attribute(s).
This should work correctly:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.basketball-reference.com/boxscores/202003110ATL.html")
soup = BeautifulSoup(r.content, 'lxml')
def valid_boxscore_basic_table_elem(tag):
tag_id = tag.get("id")
tag_class = tag.get("class")
return (tag_id and tag_class) and ("basic" in tag_id and "sortable" in tag_class)
tables = soup.find_all(valid_boxscore_basic_table_elem)
print(tables)
Be careful when using in, though, remember that "cat" in "caterpillar" is True.
The code can be simplified and made more versatile through the use of some basic regex:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.basketball-reference.com/boxscores/202003110ATL.html")
soup = BeautifulSoup(r.content, 'lxml')
valid_id_re = re.compile(r"-basic$")
valid_class_re = re.compile(r" ?sortable ?")
tables = soup.find_all("table", attrs={"id": valid_id_re.search, "class": valid_class_re.search})
You can try this, it uses a CSS selector to find an id containing basic and a class containing sortable
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.basketball-reference.com/boxscores/202003110ATL.html').content
soup = BeautifulSoup(r, 'html.parser')
print(soup.select('table[id*="basic"][class*="sortable"]'))

Unable to find the class for price - web scraping

I want to extract the price off the website
However, I'm having trouble locating the class type.
on this website
we see that the price for this course is $5141. When I check the source code the class for the price should be "field-items".
from bs4 import BeautifulSoup
import pandas as pd
import requests
url =
"https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-
advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.find(class_='field-items')
print(price)
However when I ran the code I got a description of the course instead of the price..not sure what I did wrong. Any help appreciated, thanks!
There are actually several "field-item even" classes on your webpage so you have to pick the one inside the good class. Here's the code :
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
section = soup.find(class_='field field-name-field-price field-type-number-decimal field-label-inline clearfix view-mode-full')
price = section.find(class_="field-item even").text
print(price)
And the result :
5141.00
With bs4 4.7.1 + you can use :contains to isolate the appropriate preceeding tag then use adjacent sibling and descendant combinators to get to the target
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education')
soup = bs(r.content, 'lxml')
print(soup.select_one('.field-label:contains("Price:") + div .field-item').text)
This
.field-label:contains("Price:")
looks for an element with class field-label, the . is a css class selector, which contains the text Price:. Then the + is an adjacent sibling combinator specifying to get the adjacent div. The .field-item (space dot field-item) is a descendant combinator (the space) and class selector for a child of the adjacent div having class field-item. select_one returns the first match in the DOM for the css selector combination.
Reading:
css selectors
To get the price you can try using .select() which is precise and less error prone.
import requests
from bs4 import BeautifulSoup
url = "https://www.learningconnection.philips.com/en/course/pinnacle%C2%B3-advanced-planning-education"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
price = soup.select_one("[class*='field-price'] .even").text
print(price)
Output:
5141.00
Actually the class I see, using Firefox inspector is : field-item even, it's where the text is:
<div class="field-items"><div class="field-item even">5141.00</div></div>
But you need to change a little bit your code:
price = soup.find_all("div",{"class":'field-item even'})[2]
There are more than one "field-item even" labeled class, price is not the first one.

BeautifulSoup - find table with specified class on Wikipedia page

I am trying to find a table in a Wikipedia page using BeautifulSoup and for some reason I don't get the table.
Can anyone tell why I don't get the table?
my code:
import BeautifulSoup
import requests
url='https://en.wikipedia.org/wiki/List_of_National_Historic_Landmarks_in_Louisiana'
r=requests.get(url)
url=r.content
soup = BeautifulSoup(url,'html.parser')
tab=soup.find("table",{"class":"wikitable sortable jquery-tablesorter"})
print tab
prints: None
You shouldn't use jquery-tablesorter to select against in the response you get from requests because it is dynamically applied after the page loads. If you omit that, you should be good to go.
tab = soup.find("table",{"class":"wikitable sortable"})

Categories

Resources