Extract specific value from a table using Beautiful Soup (Python) - python

I looked around on Stackoverflow, and most guides seem to be very specific on extracting all data from a table. However, I only need to extract one, and just can't seem to extract that specific value from the table.
Scrape link:
https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919
I am looking to extract the "Style" value from the table within the link.
Code:
import bs4
styleData=[]
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
table=cleanbyAddPD.find('div',{'id':'MainContent_ctl01_panView'})
style=table.findall('tr')[3]
style=style.findall('td')[1].text
print(style)
styleData.append(style)

Probably you misused find_all function, try this solution:
style=table.find_all('tr')[3]
style=style.find_all('td')[1].text
print(style)
It will give you the expected output

You can use a CSS Selector:
#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)
Which will select the "MainContent_ctl01_grdCns" id, the fourth <tr>, the second <td>.
To use a CSS Selector, use the .select() method instead of find_all(). Or select_one() instead of find().
import requests
from bs4 import BeautifulSoup
URL = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print(
soup.select_one(
"#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)"
).text
)
Output:
Townhouse End

Could also do something like:
import bs4
import requests
style_data = []
url = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = bs4.BeautifulSoup(requests.get(url).content, 'html.parser')
# select the first `td` tag whose text contains the substring `Style:`.
row = soup.select_one('td:-soup-contains("Style:")')
if row:
# if that row was found get its sibling which should be that vlue you want
home_style_tag = row.next_sibling
style_data.append(home_style_tag.text)
A couple of notes
This uses CSS selectors rather than the find methods. See SoupSieve docs for more details.
The select_one relies on the fact that the table is always ordered in a certain way, if this is not the case use select and iterate through the results to find the bs4.Tag whose text is exactly 'Style:' then grab its next sibling
Using select:
rows = soup.select('td:-soup-contains("Style:")')
row = [r for r in rows if r.text == 'Style:']
home_style_text = row.text

You can use :contains on a td to get the node with innerText "Style" then an adjacent sibling combinator with td type selector to get the adjacent td value.
import bs4, requests
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
print(cleanpagedata.select_one('td:contains("Style") + td').text)

Related

BeautifulSoup/Python Problems Parsing Websites

I'm sure this may have been asked in the past but I am attempting to parse a website (hopefully somehow automate it to parse multiple websites at once eventually) but it's not working properly. I may be having issues grabbing appropriate tags or something but essentially I want to go to this website and pull off all of the items from the lists created (possibly with hrefs intact or in a separate document) and stick them into a file where I can share in an easy-to-read format. So far this is my code:
url = "http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/" `
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())
results = soup.find_all('div', class_"tab_content")
for element in results:
title_elem = element.find('h1')
h2_elem = element.find('h2')
h3_elem = element.find('h3')
href_elem = element.find('href')
if None in (title_elem, h2_elem, h3_elem, href_elem):
continue
print(title_elem.text.strip())
print(h2_elem.text.strip())
print(h3_elem.text.strip())
print(href_elem.text.strip())
print()
I even attempted to write this for a table but I get the same type of output, which are a bunch of empty elements:
for table in soup.find_all('table'):
for subtable in table.find_all('table'):
print(subtable)
Does anyone have any insight as to why this may be the case? If possible I would also not be opposed to regex parsing, but the main goal here is to go into this site (and hopefully others like it) and take the entire table/lists/descriptions of the individual programs for each major and write the information into an easy-to-read file
Similar approach in that I also selected to combine bs4 with pandas but I tested for the presence of the hyperlink class.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'http://catalog.apu.edu/academics/college-liberal-arts-sciences/math-physics-statistics/applied-mathematics-bs/'
r = requests.get(url)
soup = bs(r.content, 'lxml')
for table in soup.select('.sc_courselist'):
tbl = pd.read_html(str(table))[0]
links_column = ['http://catalog.apu.edu' + i.select_one('.bubblelink')['href'] if i.select_one('.bubblelink') is not None else '' for i in table.select('td:nth-of-type(1)')]
tbl['Links'] = links_column
print(tbl)
With BeautifulSoup, an alternative to find/find_all is select_one/select. The latter two apply css selectors with select_one returning the first match for the css selector passed in, and select returning a list of all matches. "." is a class selector, meaning it will select attributes with the specified class e.g. sc_courselist or bubblelink. bubblelink is the class of the element with the desired hrefs. These are within the first column of each table which is selected using td:nth-of-type(1).

How to fetch text from BeautifulSoup, getting error

I am trying to fetch text from a webpage - https://www.symantec.com/security_response/definitions.jsp?pid=sep14
Exactly where is says -
File-Based Protection (Traditional Antivirus)
Extended Version: 4/18/2019 rev. 2
But I am still facing errors, can I get the part where it says - 4/18/2019 rev. 2
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find_all('div', class_='unit size1of2 feedBody')
print(extended)
You can actually use CSS selectors to do this. This is done with Beautiful Soup 4.7+. Here we target the same div and classes that you did above, but we also look for the descendant li and it's direct child > strong. We then use the custom pseudo-class :contains() to ensure that the strong element contains the text Extended Version:. We use select_one API call as it will return the first element that matches, select would return all elements that match in a list, but we only need one.
Once we have the strong element, we know the next sibling text node has the information we want, so we can just use next_sibling to grab that text:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.select_one('div.unit.size1of2.feedBody li:contains("Extended Version:") > strong')
print(extended.next_sibling)
Output
4/18/2019 rev. 7
EDIT: As #QHarr mentions in the comments, you can most likely get away with a more simplified strong:contains("Extended Version:"). It is important to remember that :contains() searches all child text nodes of the given element, even sub text nodes of child elements, so being specific is important. I wouldn't use :contains("Extended Version:") as it would find the div, the list elements, etc., so by specify (at the very minimum) strong should narrow the selection enough to give you exactly what you need.
i changed your code like below, now it's showing that you want
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').find_all('li')
print(extended[2])
Try this maybe?
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').findAll('li')
print(extended[2].text.strip())

How do I get the second table class?

I am trying to find a table in a Wikipedia page using BeautifulSoup. I know how to get the first table, but how do I get the second table (Recent changes to the list of S&P 500 Components) with the same class wikitable sortable?
my code:
import bs4 as bs
import requests
url='https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
r=requests.get(url)
url=r.content
soup = bs.BeautifulSoup(url,'html.parser')
tab = soup.find("table",{"class":"wikitable sortable"})
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
You can use soup.find_all and access the last table. Since there are only two table tags with wikitable sortable as its class, the last element in the resulting list will be the "Recent Changes" table:
soup.find_all("table", {"class":"wikitable sortable"})[-1]
You could use an nth-of-type css selector to specify the second matching table
import bs4 as bs
import requests
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
r = requests.get(url)
url = r.content
soup = bs.BeautifulSoup(url,'lxml')
tab = soup.select_one("table.wikitable.sortable:nth-of-type(2)")
print(tab)

How do I get the HTML between two div elements in Python

I'm trying to scrape all the paragraphs from Wikipedia that come between the main heading of the page and the table of contents. I noticed that they always come between two div elements as shown below:
<div id="some-div">...</div>
<p>...</p>
<p>...</p>
<p>...</p>
<div id="some-other-div">...</div>
I want to grab all of the HTML between the two div elements (not just the text)
Looking for a solution in Python.
I doubt you can depend on utterly consistent formatting. However, this seems to work for the 'Python (programming language)' page, where the introductory text is delimited by the 'Contents' box.
I offer a few notes:
fetchPreviousSiblings returns the paragraphs in reverse order.
I would check the length of contents against the unlikely possibility of more than one occurrence.
It's almost certainly necessary with this approach to check for rubbish.
from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
HTML = str ( urlopen ( URL ) . read() )
soup = BeautifulSoup ( HTML )
contents = soup.findAll('div', attrs={'id': 'toc'})
paras = contents[0].fetchPreviousSiblings('p')
With BeautifulSoup you will find the first div and the second div by their ids:
from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"html.parser")
first_div = bs.find(id="some-div")
second_div = bs.find(id="some-other-div")
After this, we create a list with all the elements in between the two divs (converted to strings) and afterwards join them together. For this we loop through all the siblings after the first_div and break when we reach the second div:
in_between = []
for sibling in first_div.next_siblings:
if sibling == second_div:
break
else:
in_between.append(str(sibling))
in_between = "".join(in_between)
The previous codeblock can be replaced by this list comprehension in one line:
in_between = "".join([str(sibling) for sibling in takewhile(lambda x: x != second_div, first_div.next_siblings)])

Using Python and BeautifulSoup to Parse a Table

I am trying to access content in certain td tags with Python and BeautifulSoup. I can either get the first td tag meeting the criteria (with find), or all of them (with findAll).
Now, I could just use findAll, get them all, and get the content I want out of them, but that seems like it is inefficient (even if I put limits on the search). Is there anyway to go to a certain td tag meeting the criteria I want? Say the third, or the 10th?
Here's my code so far:
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
br = Browser()
url = "http://finance.yahoo.com/q/ks?s=goog+Key+Statistics"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
td = soup.findAll("td", {'class': 'yfnc_tablehead1'})
for x in range(len(td)):
var1 = td[x]
var2 = var1.contents[0]
print(var2)
Is there anyway to go to a certain td
tag meeting the criteria I want? Say
the third, or the 10th?
Well...
all_tds = [td for td in soup.findAll("td", {'class': 'yfnc_tablehead1'})]
print all_tds[3]
...there is no other way..
find and findAll are very flexible, the BeautifulSoup.findAll docs say
5. You can pass in a callable object
which takes a Tag object as its only
argument, and returns a boolean. Every
Tag object that findAll encounters
will be passed into this object, and
if the call returns True then the tag
is considered to match.

Categories

Resources