BeautifulSoup fails at finding element using regex because of extra <br> [duplicate] - python

I'm trying to extract the text from inside a <dt> tag with a <span> inside on www.uszip.com:
Here is an example of what I'm trying to get:
<dt>Land area<br><span class="stype">(sq. miles)</span></dt>
<dd>14.28</dd>
I want to get the 14.28 out of the tag. This is how I'm currently approaching it:
Note: soup is the BeautifulSoup version of the entire webpage's source code:
soup.find("dt",text="Land area").contents[0]
However, this is giving me a
AttributeError: 'NoneType' object has no attribute 'contents'
I've tried a lot of things and I'm not sure how to approach this. This method works for some of the other data on this page, like:
<dt>Total population</dt>
<dd>22,234<span class="trend trend-down" title="-15,025 (-69.77% since 2000)">▼</span></dd>
Using soup.find("dt",text="Total population").next_sibling.contents[0] on this returns '22,234'.
How should I try to first identify the correct tag and then get the right data out of it?

Unfortunately, you cannot match tags with both text and nested tags, based on the contained text alone.
You'd have to loop over all <dt> without text:
for dt in soup.find_all('dt', text=False):
if 'Land area' in dt.text:
print dt.contents[0]
This sounds counter-intuitive, but the .string attribute for such tags is empty, and that is what BeautifulSoup is matching against. .text contains all strings in all nested tags combined, and that is not matched against.
You could also use a custom function to do the search:
soup.find_all(lambda t: t.name == 'dt' and 'Land area' in t.text)
which essentially does the same search with the filter encapsulated in a lambda function.

Related

Regular Expressions or BeautifulSoup - Varying Cases

I have 3 strings I am looking to retrieve that are characterized by the presence of two words: section and front. I'm terrible with regex.
contentFrame wsj-sectionfront economy_sf
contentFrame wsj-sectionfront business_sf
section-front markets
How can I match both of these words using one regular expression? This will be used to match the contents of a html page parsed by BeautifulSoup.
UPDATE:
I want to extract the main body of a webpage (https://www.wsj.com/news/business) that has the div tag: Main Content Housing. For some reason, BeautifulSoup isn't recognizing the highlighted class attribute using:
wsj_soup.find('div', attrs = {'class':'contentFrame wsj-sectionfront business_sf')
# Returns []
I'm trying to stay in BeautifulSoup as much as possible, but if regex is the way to go I will use that. From there I will more than likely search using the contents attribute to search for relevant keywords, but if anyone has a better idea of how to approach it please share.
One way to handle this would be to use two separate lookaheads which check for each of these words:
^(?=.*section)(?=.*front).*$
Demo

Get the text from multiple elements with the same class in Selenium for Python?

I'm trying to scrape data from a page with JavaScript loaded content. For example, the content I want is in the following format:
<span class="class">text</span>
...
<span class="class">more text</span>
I used the find_element_by_xpath(//span[#class="class"]').text function but it only returned the first instance of the specified class. Basically, I would want a list like [text, more text] etc. I found the find_elements_by_xpath() function, but the .text at the end results in an error exceptions.AttributeError: 'list' object has no attribute 'text'.
find_element_by_xpath returns one element, which has text attribute.
find_elements_by_xpath() returns all matching elements, which is a list, so you need to loop through and get text attribute for each of the element.
all_spans = driver.find_elements_by_xpath("//span[#class='class']")
for span in all_spans:
print span.text
Please refer to Selenium Python API docs here for more details about find_elements_by_xpath(xpath).
This returns a list of items:
all_spans = driver.find_elements_by_xpath("//span[#class='class']")
for span in all_spans:
print span.text

Find different elements according to different criteria using BeautifulSoup in Python

I would like using the findAll() function of the Beautiful Soup Python library, find several elements in an HTML. These elements must meet several criteria, but independently.
For example, suppose that my object looks like this:
<div class="my_class">
<span class="not_cool">
<p name="p_1">A</p>
<p name="p_2">B</p>
</span>
<span class="cool">
<p name="p_3">C</p>
</span>
</div>
And I want to find every span of class="cool" and every p with name="p_1" (here there is only one of each, but imagine that this is not the case).
Individually, I will do:
.findAll("span",attrs={"class":"cool"})
.findAll("p",attrs={"name":"p_1"})
In a perfect world, I would like to do:
.findAll([
["span",attrs={"class":"cool"}],
["p",attrs={"name":"p_1"}]
]}
But of course, it does not work like this.
Actually, I try to make a function that translates HTML to BBCode (I do not want and can not use an existing one).
So, I need to keep only some tag that interest me.
However, I must also know the order of these elements. If I use two different .findAll(), I will not know what is before what, and what is after what.
Does anyone have a solution please?
You'd have to use a search function:
.find_all(lambda t: (t.name == 'span' and 'cool' in t['class']) or
(t.name == 'p' and t.get('name') == 'p_1'))
A callable argument will be passed each tag object in the tree; if the callable returns True it is included. The above lambda tests if the tag name matches and if a specific attribute is there. The class attribute is special in that when present, it is always parsed out to a list.
Note that for BeautifulSoup 4, the camel-case function names have been deprecated; the lower_case_with_underscore names are the canonical methods. If you are still using BeautifulSoup 3, you probably want to upgrade. Version 3 has not seen updates in over 2 years now.
Simply find all children of each span with that particular class by iterating over all desired spans.
spans = soup.findAll("span",attrs={"class":"cool"})
for span in spans:
ps = span.findAll("p",attrs={"name":"p_1"})

Finding Tags And Text In BeautifulSoup

I'm having some trouble formulating a findAll query for BeautifulSoup that'll do what I want. Previously, I was using findAll to extract only the text from some html, essentially stripping away all the tags. For example, if I had:
<b>Cows</b> are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.
It would be reduced to:
Cows are being abducted by aliens according to the Washington Post.
I would do this by using ''.join(html.findAll(text=True)). This was working great, until I decided I would like to keep only the <a> tags, but strip the rest of the tags away. So, given the initial example, I would end up with this:
Cows are being abducted by aliens according to the
<a href="www.washingtonpost.com>Washington Post</a>.
I initially thought that the following would do the trick:
''.join(html.findAll({'a':True}, text=True))
However, this doesn't work, since the text=True seems to indicate that it will only find text. What I'm in need of is some OR option - I would like to find text OR <a> tags. It's important that the tags stay around the text they are tagging - I can't have the tags or text appearing out of order.
Any thoughts?
Note: The BeautifulSoup.findAll is a search API. The first named argument of findAll which is name can be used to restrict the search to a given set of tags. With just a single findAll it is not possible to select all text between tags and at the same time select the text and tag for <a>. So I came up with the below solution.
This solution depends on BeautifulSoup.Tag being imported.
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup('<b>Cows</b> are being abducted by aliens according to the <a href="www.washingtonpost.com>Washington Post</a>.')
parsed_soup = ''
We navigate the parse tree like a list with the contents method. We extract text only when it's a tag and when the tag is not <a>. Otherwise we get the entire string with tag included. This uses navigating the parse tree API.
for item in soup.contents:
if type(item) is Tag and u'a' != item.name:
parsed_soup += ''.join(item.findAll(text = True))
else:
parsed_soup += unicode(item)
The order of the text is preserved.
>>> print parsed_soup
u'Cows are being abducted by aliens according to the <a href=\'"www.washingtonpost.com\'>Washington Post</a>.'

Using BeautifulSoup's findAll to search html element's innerText to get same result as searching attributes?

For instance if I am searching by an element's attribute like id:
soup.findAll('span',{'id':re.compile("^score_")})
I get back a list of the whole span element that matches (which I like).
But if I try to search by the innerText of the html element like this:
soup.findAll('a',text = re.compile("discuss|comment"))
I get back only the innerText part of element back that matches instead of the whole element with tags and attributes like I would above.
Is this possible to do with out finding the match and then getting it's parent?
Thanks.
You don't get back the text. You get a NavigableString with the text. That object has methods to go to the parent, etc.
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup('<html><p>foo</p></html>')
r = soup.findAll('p', text=re.compile('foo'))
print r[0].parent
prints
<p>foo</p>

Categories

Resources