Is there any strict findAll function in BeautifulSoup?

Is there any strict findAll function in BeautifulSoup? - python

I am using Python- 2.7 and BeautifulSoup
Apologies if I am unable to explain what exactly I want
There is this html page in which data is embedded in specific structure
I want to pull the data ignoring the first block
But the problem is when I do-
self.tab = soup.findAll("div","listing-row")
It also gives me the first block which is actually (unwanted html block)-
("div","listing-row wide-featured-listing")
I am not using
soup.find("div","listing-row")
since I want all the classes named "listing-row" only in that entire page.
How can I ignore the class named "listing-row wide-featured-listing"?
Help/Guidance in any form is appreciated. Thanks a lot !

Or, you may make a CSS selector to match the class exactly to listing-row:
soup.select("div[class=listing-row]")
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div>
... <div class="listing-row">result1</div>
... <div class="listing-row wide-featured-listing">result2</div>
... <div class="listing-row">result3</div>
... </div>
... """
>>>
>>> soup = BeautifulSoup(data, "html.parser")
>>> print [row.text for row in soup.select("div[class=listing-row]")]
[u'result1', u'result3']

You could just filter out that element:
self.tab = [el for el in soup.find_all('div', class_='listing-row')
if 'wide-featured-listing' not in el.attr['class']]
You could use a custom function:
self.tab = soup.find_all(lambda e: e.name == 'div' and
'listing-row' in e['class'] and
'wide-featured-listing' not in el.attr['class'])

Related

Beautifulsoup get content without next tag

I have some html code like this
<p><span class="map-sub-title">abc</span>123</p>
I used Beautifulsoup,and here's my code :
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
p = soup1.text
I get the result 'abc123'
But I want to get the result '123' not 'abc123'

You can use the function decompose() to remove the span tag and then get the text you want.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup = BeautifulSoup(html, "lxml")
for span in soup.find_all("span", {'class':'map-sub-title'}):
span.decompose()
print(soup.text)

You can also use extract() to remove unwanted tag before you get the text from tag like below.
from bs4 import BeautifulSoup
html = '<p><span class="map-sub-title">abc</span>123</p>'
soup1 = BeautifulSoup(html,"lxml")
soup1.p.span.extract()
print(soup1.text)

Although every response on this thread seems acceptable I shall point out another method for this case:
soup.find("span", {'class':'map-sub-title'}).next_sibling
You can use next_sibling to navigate between elements that are on the same parent, in this case the p tag.

One of the many ways, is to use contents over the parent tag (in this case it's <p>).
If you know the position of the string, you can directly use this:
>>> from bs4 import BeautifulSoup, NavigableString
>>> soup = BeautifulSoup('<p><span class="map-sub-title">abc</span>123</p>', 'lxml')
>>> # check the contents
... soup.find('p').contents
[<span class="map-sub-title">abc</span>, '123']
>>> soup.find('p').contents[1]
'123'
If, you want a generalized solution, where you don't know the position, you can check if the type of content is NavigableString like this:
>>> final_text = [x for x in soup.find('p').contents if isinstance(x, NavigableString)]
>>> final_text
['123']
With the second method, you'll be able to get all the text that is directly a child of the <p> tag. For completeness's sake, here's one more example:
>>> html = '''
... <p>
... I want
... <span class="map-sub-title">abc</span>
... foo
... <span class="map-sub-title">abc2</span>
... text
... <span class="map-sub-title">abc3</span>
... only
... </p>
... '''
>>> soup = BeautifulSoup(html, 'lxml')
>>> ' '.join([x.strip() for x in soup.find('p').contents if isinstance(x, NavigableString)])
'I want foo text only'

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:
>>> from bs4 import BeautifulSoup
>>> html = '<p><span class="map-sub-title">abc</span>123</p>'
>>> soup1 = BeautifulSoup(html,"lxml")
>>> soup1.p.strings
<generator object _all_strings at 0x00000008768C50>
>>> list(soup1.strings)
['abc', '123']
>>> list(soup1.strings)[1]
'123'

Searching on class tags with multiple spaces and wildcards with BeautifulSoup

I am trying to use BeautifulSoup to find all div containers with the class attribute beginning by "foo bar". I had hoped the following would work:
from bs4 import BeautifulSoup
import re
soup.findAll('div',class_=re.compile('^foo bar'))
However, it seems that the class definition is separated into a list, like ['foo','bar'], such that regular expressions are not able to accomplish my task. Is there a way I can accomplish this task? (I have reviewed a number of other posts, but have not found a working solution)

You can use a syntax with a function that needs to return True or False, a lambda can do the trick too:
from bs4 import BeautifulSoup as soup
html = '''
<div class="foo bar bing"></div>
<div class="foo bang"></div>
<div class="foo bar1 bang"></div>
'''
soup = soup(html, 'lxml')
res = soup.find_all('div', class_=lambda s:s.startswith('foo bar '))
print(res)
>>> [<div class="foo bar bing"></div>]
res = soup.find_all('div', class_=lambda s:s.startswith('foo bar')) # without space
print(res)
>>> [<div class="foo bar bing"></div>, <div class="foo bar1 bang"></div>]
Another possible syntax with a function :
def is_a_match(clas):
return clas.startswith('foo bar')
res = soup.find_all('div', class_=is_a_match)
Maybe this answer can help you too : https://stackoverflow.com/a/46719313/6655211

BeautifulSoup - How to extract text after specified string

I have HTML like:
<tr>
<td>Title:</td>
<td>Title value</td>
</tr>
I have to specify after which <td> with text i want to grab text of second <td>. Something like: Grab text of first next <td> after <td> which contain text Title:. Result should be: Title value
I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class to specify.
I have tried this:
row = soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)
and I receive error: AttributeError: 'ResultSet' object has no attribute 'nextSibling'

First of all, soup.find_all() returns a ResultSet which contains all the elements with tag td and string as Title: .
For each such element in the result set , you will need to get the nextSibling separately (also, you should loop through until you find the nextSibling of tag td , since you can get other elements in between (like a NavigableString)).
Example -
>>> from bs4 import BeautifulSoup
>>> s="""<tr>
... <td>Title:</td>
... <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row = soup.find_all('td', string='Title:')
>>> for r in row:
... nextSib = r.nextSibling
... while nextSib.name != 'td' and nextSib is not None:
... nextSib = nextSib.nextSibling
... print(nextSib.text)
...
Title value
Or you can use another library that has support for XPATH , and with Xpath you can do this easily. Other libraries like - lxml or xml.etree .

What you're intending to do is relatively easier with lxml using xpath. You can try something like this,
from lxml import etree
tree = etree.parse(<your file>)
path_list = tree.xpath('//<xpath to td>')
for i in range(0, len(path_list)) :
if path_list[i].text == '<What you want>' and i != len(path_list) :
your_text = path_list[i+1].text

Get immediate parent tag with BeautifulSoup in Python

I've researched this question but haven't seen an actual solution to solving this. I'm using BeautifulSoup with Python and what I'm looking to do is get all image tags from a page, loop through each and check each to see if it's immediate parent is an anchor tag.
Here's some pseudo code:
html = BeautifulSoup(responseHtml)
for image in html.findAll('img'):
if (image.parent.name == 'a'):
image.hasParent = image.parent.link
Any ideas on this?

You need to check parent's name:
for img in soup.find_all('img'):
if img.parent.name == 'a':
print "Parent is a link"
Demo:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <body>
... <img src="image.png"/>
... </body>
... """
>>> soup = BeautifulSoup(data)
>>> img = soup.img
>>>
>>> img.parent.name
a
You can also retrieve the img tags that have a direct a parent using a CSS selector:
soup.select('a > img')

Python: How to extract URL from HTML Page using BeautifulSoup?

I have a HTML Page with multiple divs like
<div class="article-additional-info">
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t...
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece">
<span class="arrows">»</span>
</a>
</div>
<div class="article-additional-info">
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe...
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece">
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments">
</div>
and I need to get the <a href=> value for all the divs with class article-additional-info
I am new to BeautifulSoup
so I need the the urls
"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"
What is the best way to achieve this?

According to your criteria, it returns three URLs (not two) - did you want to filter out the third?
Basic idea is to iterate through the HTML, pulling out only those elements in your class, and then iterating through all of the links in that class, pulling out the actual links:
In [1]: from bs4 import BeautifulSoup
In [2]: html = # your HTML
In [3]: soup = BeautifulSoup(html)
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
This limits your search to just those elements with the article-additional-info class tag, and inside of there looks for all anchor (a) tags and grabs their corresponding href link.

After working with the documentation, I did it the following way, thank you all for your answers, I appreciate them
>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
... print link.find('a').attrs['href']
...
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>>

from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
for links in text.find_all('a'):
print links.get('href')
Which prints:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there any strict findAll function in BeautifulSoup? - python

Related

Beautifulsoup get content without next tag

Searching on class tags with multiple spaces and wildcards with BeautifulSoup

BeautifulSoup - How to extract text after specified string

Get immediate parent tag with BeautifulSoup in Python

Python: How to extract URL from HTML Page using BeautifulSoup?

Categories

Resources