I am trying to access content in certain td tags with Python and BeautifulSoup. I can either get the first td tag meeting the criteria (with find), or all of them (with findAll).
Now, I could just use findAll, get them all, and get the content I want out of them, but that seems like it is inefficient (even if I put limits on the search). Is there anyway to go to a certain td tag meeting the criteria I want? Say the third, or the 10th?
Here's my code so far:
from __future__ import division
from __future__ import unicode_literals
from __future__ import print_function
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
br = Browser()
url = "http://finance.yahoo.com/q/ks?s=goog+Key+Statistics"
page = br.open(url)
html = page.read()
soup = BeautifulSoup(html)
td = soup.findAll("td", {'class': 'yfnc_tablehead1'})
for x in range(len(td)):
var1 = td[x]
var2 = var1.contents[0]
print(var2)
Is there anyway to go to a certain td
tag meeting the criteria I want? Say
the third, or the 10th?
Well...
all_tds = [td for td in soup.findAll("td", {'class': 'yfnc_tablehead1'})]
print all_tds[3]
...there is no other way..
find and findAll are very flexible, the BeautifulSoup.findAll docs say
5. You can pass in a callable object
which takes a Tag object as its only
argument, and returns a boolean. Every
Tag object that findAll encounters
will be passed into this object, and
if the call returns True then the tag
is considered to match.
Related
I looked around on Stackoverflow, and most guides seem to be very specific on extracting all data from a table. However, I only need to extract one, and just can't seem to extract that specific value from the table.
Scrape link:
https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919
I am looking to extract the "Style" value from the table within the link.
Code:
import bs4
styleData=[]
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
table=cleanbyAddPD.find('div',{'id':'MainContent_ctl01_panView'})
style=table.findall('tr')[3]
style=style.findall('td')[1].text
print(style)
styleData.append(style)
Probably you misused find_all function, try this solution:
style=table.find_all('tr')[3]
style=style.find_all('td')[1].text
print(style)
It will give you the expected output
You can use a CSS Selector:
#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)
Which will select the "MainContent_ctl01_grdCns" id, the fourth <tr>, the second <td>.
To use a CSS Selector, use the .select() method instead of find_all(). Or select_one() instead of find().
import requests
from bs4 import BeautifulSoup
URL = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print(
soup.select_one(
"#MainContent_ctl01_grdCns tr:nth-of-type(4) td:nth-of-type(2)"
).text
)
Output:
Townhouse End
Could also do something like:
import bs4
import requests
style_data = []
url = "https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919"
soup = bs4.BeautifulSoup(requests.get(url).content, 'html.parser')
# select the first `td` tag whose text contains the substring `Style:`.
row = soup.select_one('td:-soup-contains("Style:")')
if row:
# if that row was found get its sibling which should be that vlue you want
home_style_tag = row.next_sibling
style_data.append(home_style_tag.text)
A couple of notes
This uses CSS selectors rather than the find methods. See SoupSieve docs for more details.
The select_one relies on the fact that the table is always ordered in a certain way, if this is not the case use select and iterate through the results to find the bs4.Tag whose text is exactly 'Style:' then grab its next sibling
Using select:
rows = soup.select('td:-soup-contains("Style:")')
row = [r for r in rows if r.text == 'Style:']
home_style_text = row.text
You can use :contains on a td to get the node with innerText "Style" then an adjacent sibling combinator with td type selector to get the adjacent td value.
import bs4, requests
pagedata = requests.get("https://gis.vgsi.com/portsmouthnh/Parcel.aspx?pid=38919")
cleanpagedata = bs4.BeautifulSoup(pagedata.text, 'html.parser')
print(cleanpagedata.select_one('td:contains("Style") + td').text)
I am trying to fetch text from a webpage - https://www.symantec.com/security_response/definitions.jsp?pid=sep14
Exactly where is says -
File-Based Protection (Traditional Antivirus)
Extended Version: 4/18/2019 rev. 2
But I am still facing errors, can I get the part where it says - 4/18/2019 rev. 2
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find_all('div', class_='unit size1of2 feedBody')
print(extended)
You can actually use CSS selectors to do this. This is done with Beautiful Soup 4.7+. Here we target the same div and classes that you did above, but we also look for the descendant li and it's direct child > strong. We then use the custom pseudo-class :contains() to ensure that the strong element contains the text Extended Version:. We use select_one API call as it will return the first element that matches, select would return all elements that match in a list, but we only need one.
Once we have the strong element, we know the next sibling text node has the information we want, so we can just use next_sibling to grab that text:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.select_one('div.unit.size1of2.feedBody li:contains("Extended Version:") > strong')
print(extended.next_sibling)
Output
4/18/2019 rev. 7
EDIT: As #QHarr mentions in the comments, you can most likely get away with a more simplified strong:contains("Extended Version:"). It is important to remember that :contains() searches all child text nodes of the given element, even sub text nodes of child elements, so being specific is important. I wouldn't use :contains("Extended Version:") as it would find the div, the list elements, etc., so by specify (at the very minimum) strong should narrow the selection enough to give you exactly what you need.
i changed your code like below, now it's showing that you want
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').find_all('li')
print(extended[2])
Try this maybe?
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("https://www.symantec.com/security_response/definitions.jsp?pid=sep14")
soup = BeautifulSoup(page.content, 'html.parser')
extended = soup.find('div', class_='unit size1of2 feedBody').findAll('li')
print(extended[2].text.strip())
I am having a problem finding a value in a soup based on text. Here is the code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text)
findit=soup.find("td", text=re.compile('Market Cap'))
This returns [], yet there absolutely is text in a 'td' tag with 'Market Cap'.
When I use
soup.find_all("td")
I get a result set which includes:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>
Explanation:
The problem is that this particular tag has other child elements and the .string value, which is checked when you apply the text argument, is None (bs4 has it documented here).
Solutions/Workarounds:
Don't specify the tag name here at all, find the text node and go up to the parent:
soup.find(text=re.compile('Market Cap')).parent.get_text()
Or, you can use find_parent() if td is not the direct parent of the text node:
soup.find(text=re.compile('Market Cap')).find_parent("td").get_text()
You can also use a "search function" to search for the td tags and see if the direct text child nodes has the Market Cap text:
soup.find(lambda tag: tag and
tag.name == "td" and
tag.find(text=re.compile('Market Cap'), recursive=False))
Or, if you are looking to find the following number 5:
soup.find(text=re.compile('Market Cap')).next_sibling.get_text()
You can't use regex with tag. It just won't work. Don't know if it's a bug of specification. I just search after all, and then get the parent back in a list comprehension cause "td" "regex" would give you the td tag.
Code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text, "lxml")
findit=soup.find_all(text=re.compile('Market Cap'))
findit=[x.parent for x in findit if x.parent.name == "td"]
print(findit)
Output
[<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>]
Regex is just a terrible thing to integrate into parsing code and in my humble opinion should be avoided whenever possible.
Personally, I don't like BeautifulSoup due to its lack of XPath support. What you're trying to do is the sort of thing that XPath is ideally suited for. If I were doing what you're doing, I would use lxml for parsing rather than BeautifulSoup's built in parsing and/or regex. It's really quite elegant and extremely fast:
from lxml import etree
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = etree.HTML(source)
tds_w_market_cap = parsed.xpath('//td[contains(., "Market Cap")]')
FYI the above returns an lxml object rather than the text of the page source. In lxml you don't really work with the source directly, per se. If you need to return a list of the actual source for some reason, you would add something like:
print [etree.tostring(i) for i in tds_w_market_cap]
If you absolutely have to use BeautifulSoup for this task, then I'd use a list comprehension:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = bs(source, 'lxml')
tds_w_market_cap = [i for i in parsed.find_all('td') if 'Market Cap' in i.get_text()]
I am trying to parse a webpage (forums.macrumors.com) and get a list of all the threads posted.
So I have got this so far:
import urllib2
import re
from BeautifulSoup import BeautifulSoup
address = "http://forums.macrumors.com/forums/os/"
website = urllib2.urlopen(address)
website_html = website.read()
text = urllib2.urlopen(address).read()
soup = BeautifulSoup(text)
Now the webpage source has this code at the start of each thread:
<li id="thread-1880" class="discussionListItem visible sticky WikiPost "
data-author="ABCD">
How do I parse this so I can then get to the thread link within this li tag? Thanks for the help.
So from your code here, you have the soup object which contains the BeautifulSoup object of your html. The question is what part of that tag you're looking for is static? Is the id always the same? the class?
Finding by the id:
my_li = soup.find('li', {'id': 'thread-1880'})
Finding by the class:
my_li = soup.find('li', {'class': 'discussionListItem visible sticky WikiPost "})
Ideally you would figure out the unique class you can check for and use that instead of a list of classes.
if you are expecting an a tag inside of this object, you can do this to check:
if my_li and my_li.a:
print my_li.a.attrs.get('href')
I always recommend checking though, because if the my_li ends up being None or there is no a inside of it, your code will fail.
For more details, check out the BeautifulSoup documentation
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
The idea here would be to use CSS selectors and to get the a elements inside the h3 with class="title" inside the div with class="titleText" inside the li element having the id attribute starting with "thread":
for link in soup.select("div.discussionList li[id^=thread] div.titleText h3.title a[href]"):
print link["href"]
You can tweak the selector further, but this should give you a good starting point.
from bs4 import BeautifulSoup
source_code = """
"""
soup = BeautifulSoup(source_code)
print soup.a['name'] #prints 'One'
Using BeautifulSoup, i can grab the first name attribute which is one, but i am not sure how i can print the second, which is Two
Anyone able to help me out?
You should read the documentation. There you can see that soup.find_all returns a list
so you can iterate over the list and, for each element, extract the tag you are looking for. So you should do something like (not tested here):
from bs4 import BeautifulSoup
soup = BeautifulSoup(source_code)
for item in soup.find_all('a'):
print item['name']
To get any a child element other than the first, use find_all. For the second a tag:
print soup.find_all('a', recursive=False)[1]['name']
To stay on the same level and avoid a deep search, pass the argument: recursive=False
This will give you all the tags of "a":
>>> from BeautifulSoup import BeautifulSoup
>>> aTags = BeautifulSoup(source_code).findAll('a')
>>> for tag in aTags: print tag["name"]
...
One
Two