Get text and child from a span with BeautifulSoup - python

I am in a specific situation where I want to extract both the text and a child node from a span:
<span>condition:<b>good</b></span>
However, when I try to select the span with the text:
x = soup.find('span', text=re.compile(r'^condition:$'))
I get None back.
I've verified that that tag exists in the HTML document I am working with.
And I can't figure out how to get the internal tag either.
What am I doing wrong?

Following have the same issue BeautifulSoup - search by text inside a tag
You can separate the function to solve the problem, like this:
def find_all_with_regex(soup, target_tag, regex):
elements = soup.find_all(target_tag)
return list(filter(lambda tag_found: regex.match(tag_found.text), elements))
print(find_all_with_regex(soup, 'span', re.compile(r'^condition:.*')))

Try Following css selector.
print(soup.select_one('span:contains("condition:")').text)
Code:
from bs4 import BeautifulSoup
html='''<span>condition:<b>good</b></span>'''
soup=BeautifulSoup(html,"html.parser")
print(soup.select_one('span:contains("condition:")').text)

Related

Using BeautifulSoup, is it possible to move to the parent tag when using the search for text function?

Is it possible to move from the current position in the DOM up and down when only the text is an common identifier?
<div>changing text</div>
<div>fixed text</div>
How to get the text changing text when searching for the fixed text and moving up to parent div?
What I tried:
x = soup.body.findAll(text=re.compile('fixed text')).parent
AttributeError: 'ResultSet' object has no attribute 'parent'
This program might do what you want:
from bs4 import BeautifulSoup
import re
html = '<body><div>changing text</div><div>fixed text</div><body>'
soup = BeautifulSoup(html)
x = soup.body.findAll(text=re.compile('fixed text'))[0].parent.previous_sibling
assert x.text == 'changing text'
The error you are having is due to call parent in a ResultSet, a list of results. If you need to have multiple results, try:
x = soup.body.find_all(text=re.compile('fixed text'))
for i in x:
previous_div = i.previous_sibling
If you doesnt want to find multiple results, just change find_all to find:
x = soup.body.find(text=re.compile('fixed text')).previous_sibling
Note that I replace parent to previous_sibling, as the divs are in the same level

Scrape a specific h2 tag inside a div class

I am trying to scrape the emojis inside the h2 tag 'Events' from http://emojipedia.org/food-drink/. I have written the following code, but the head_links is an empty list:
import requests
from bs4 import BeautifulSoup
import json
import csv
url2 = "http://emojipedia.org/food-drink/"
html2 = requests.get(url2).content
soup2 = BeautifulSoup(html2)
head_links = soup2.findAll('h2', {'class':'Events'})
I also tried to use soup.select commands, but again I got an empty list..
Any help is much appreciated!
The thing you're looking for isn't actually an h2 tag with the class Events, you're looking for a div tag that contains an h2 tag whose content is "Events".
This should get you started:
div_contents = soup2.find('h2', text='Events').findParent()

Python Beautiful Soup - find value based on text in HTML

I am having a problem finding a value in a soup based on text. Here is the code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text)
findit=soup.find("td", text=re.compile('Market Cap'))
This returns [], yet there absolutely is text in a 'td' tag with 'Market Cap'.
When I use
soup.find_all("td")
I get a result set which includes:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>
Explanation:
The problem is that this particular tag has other child elements and the .string value, which is checked when you apply the text argument, is None (bs4 has it documented here).
Solutions/Workarounds:
Don't specify the tag name here at all, find the text node and go up to the parent:
soup.find(text=re.compile('Market Cap')).parent.get_text()
Or, you can use find_parent() if td is not the direct parent of the text node:
soup.find(text=re.compile('Market Cap')).find_parent("td").get_text()
You can also use a "search function" to search for the td tags and see if the direct text child nodes has the Market Cap text:
soup.find(lambda tag: tag and
tag.name == "td" and
tag.find(text=re.compile('Market Cap'), recursive=False))
Or, if you are looking to find the following number 5:
soup.find(text=re.compile('Market Cap')).next_sibling.get_text()
You can't use regex with tag. It just won't work. Don't know if it's a bug of specification. I just search after all, and then get the parent back in a list comprehension cause "td" "regex" would give you the td tag.
Code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text, "lxml")
findit=soup.find_all(text=re.compile('Market Cap'))
findit=[x.parent for x in findit if x.parent.name == "td"]
print(findit)
Output
[<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>]
Regex is just a terrible thing to integrate into parsing code and in my humble opinion should be avoided whenever possible.
Personally, I don't like BeautifulSoup due to its lack of XPath support. What you're trying to do is the sort of thing that XPath is ideally suited for. If I were doing what you're doing, I would use lxml for parsing rather than BeautifulSoup's built in parsing and/or regex. It's really quite elegant and extremely fast:
from lxml import etree
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = etree.HTML(source)
tds_w_market_cap = parsed.xpath('//td[contains(., "Market Cap")]')
FYI the above returns an lxml object rather than the text of the page source. In lxml you don't really work with the source directly, per se. If you need to return a list of the actual source for some reason, you would add something like:
print [etree.tostring(i) for i in tds_w_market_cap]
If you absolutely have to use BeautifulSoup for this task, then I'd use a list comprehension:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = bs(source, 'lxml')
tds_w_market_cap = [i for i in parsed.find_all('td') if 'Market Cap' in i.get_text()]

parsing webpage using python

I am trying to parse a webpage (forums.macrumors.com) and get a list of all the threads posted.
So I have got this so far:
import urllib2
import re
from BeautifulSoup import BeautifulSoup
address = "http://forums.macrumors.com/forums/os/"
website = urllib2.urlopen(address)
website_html = website.read()
text = urllib2.urlopen(address).read()
soup = BeautifulSoup(text)
Now the webpage source has this code at the start of each thread:
<li id="thread-1880" class="discussionListItem visible sticky WikiPost "
data-author="ABCD">
How do I parse this so I can then get to the thread link within this li tag? Thanks for the help.
So from your code here, you have the soup object which contains the BeautifulSoup object of your html. The question is what part of that tag you're looking for is static? Is the id always the same? the class?
Finding by the id:
my_li = soup.find('li', {'id': 'thread-1880'})
Finding by the class:
my_li = soup.find('li', {'class': 'discussionListItem visible sticky WikiPost "})
Ideally you would figure out the unique class you can check for and use that instead of a list of classes.
if you are expecting an a tag inside of this object, you can do this to check:
if my_li and my_li.a:
print my_li.a.attrs.get('href')
I always recommend checking though, because if the my_li ends up being None or there is no a inside of it, your code will fail.
For more details, check out the BeautifulSoup documentation
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
The idea here would be to use CSS selectors and to get the a elements inside the h3 with class="title" inside the div with class="titleText" inside the li element having the id attribute starting with "thread":
for link in soup.select("div.discussionList li[id^=thread] div.titleText h3.title a[href]"):
print link["href"]
You can tweak the selector further, but this should give you a good starting point.

How can i extract only text in scrapy selector in python

I have this code
site = hxs.select("//h1[#class='state']")
log.msg(str(site[0].extract()),level=log.ERROR)
The ouput is
[scrapy] ERROR: <h1 class="state"><strong>
1</strong>
<span> job containing <strong>php</strong> in <strong>region</strong> paying <strong>$30-40k per year</strong></span>
</h1>
Is it possible to only get the text without any html tags
//h1[#class='state']
in your above xpath you are selecting h1 tag that has class attribute state
so that's why it's selecting everything that comes in h1 element
if you just want to select text of h1 tag all you have to do is
//h1[#class='state']/text()
if you want to select text of h1 tag as well as its children tags, you have to use
//h1[#class='state']//text()
so the difference is /text() for specific tag text and //text() for text of specific tag as well as its children tags
below mentioned code works for you
site = ''.join(hxs.select("//h1[#class='state']/text()").extract()).strip()
You can use BeautifulSoup get_text() feature.
from bs4 import BeautifulSoup
text = '''
<td>Please can you strip me?
<br/>I am waiting....
</td>
'''
soup = BeautifulSoup(text)
print(soup.get_text())
I haven't got a scrapy instance running so I couldn't test this; but you could try to use text() within your search expression.
For example:
site = hxs.select("//h1[#class='state']/text()")
(got it from the tutorial)
You can use BeautifulSoup to strip html tags, here is an example:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))
You can then strip all the additional whitespaces, new lines etc.
if you don't want to use additional modules, you can try simple regex:
# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))
You can use html2text
import html2text
converter = html2text.HTML2Text()
print converter.handle("<div>Please!!!<span>remove me</span></div>")

Categories

Resources