Python Beautifulsoup find all function without duplication - python

markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
soup = BeautifulSoup(markup)
data = soup.find_all('p','li')
print(data)
The result looks like this
['<p>hey</p>,<p>How </p>,<li><p>How</p></li>,<p>are you </p>,<li><p>are you </p></li>']
How can I make it only return
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
or any ways that I can remove the duplicated p tags Thanks

BeautifulSoup is not analyzing the text of the html, instead, it is acquiring the desired tag structure. In this case, you need to perform an additional step and check if the text of an <p> is already seen in a different structure:
from bs4 import BeautifulSoup as soup
markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
s = soup(markup, 'lxml')
final_s = s.find_all(re.compile('p|li'))
final_data = [','.join(map(str, [a for i, a in enumerate(final_s) if not any(a.text == b.text for b in final_s[:i])]))]
Output:
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']

If you are looking for the text and not the <li> tags, you can just search for the <p> tags and get the desired result without duplication.
>>> data = soup.find_all('p')
>>> data
[<p>hey</p>, <p>How</p>, <p>are you </p>]
Or, if you are simply looking for the text, you can use this:
>>> data = [x.text for x in soup.find_all('p')]
>>> data
['hey', 'How', 'are you ']

Related

how to use beautiful soup to get all text "except" a specific class

I'm trying to use soup.get_text to get some text out of a webpage, but I want to exclude a specific class.
I tried to use a = soup.find_all(class_ = "something") and b=[i.get_text() for i in a], but that allows me to choose one class, and doesn't allow me to exclude one specific class.
I also tried:
a = soup.select('span:not([class_ ="something"])') b = [i.get_text() for i in a]
first, the output wasn't really text only. but most important; it gave me all classes including "something" that I wanted to exclude.
Is there some other way to do that?
Thanks in advance.
example:
link = "https://stackoverflow.com/questions/74620106/how-to-use-beautiful-soup-to-get-all-text-except-a-specific-class"
f = urlopen(link)
soup = BeautifulSoup(f, 'html.parser')
a = soup.find_all(class_ = "mt24 mb12")
b = [i.get_text() for i in a]
text = soup.select('div:not([class_ ="mt24 mb12"])')
text1 = [i.get_text() for i in text]
If you want to get all classes but one for example, you can loop through all element and choose the ones you keep:
for p in soup.find_all("p", "review_comment"):
if p.find(class_="something-archived"):
continue
# p is now a wanted p
source: Excluding unwanted results of findAll using BeautifulSoup

BeautifulSoup to get 'span' contents next to each other

A part of HTML looks like below. I want to extract the contents in the 'span' tags:
from bs4 import BeautifulSoup
data = """
<section><h2>Team</h2><ul><li><ul><li><span>J36</span>—<span>John</span></li><li><span>B56</span>—<span>Bratt</span></li><li><span>K3</span>—<span>Kate</span></li></ul></li></ul></section>
... """
soup = BeautifulSoup(data, "html.parser")
classification = soup.find_all('section')[0].find_all('span')
for c in classification:
print (c.text)
It works out:
J36
John
B56
Bratt
K3
Kate
But the wanted:
J36-John
B56-Bratt
K3-Kate
What's the proper BeautifulSoup way to extract the contents, other than below? Thank you.
contents = [c.text for c in classification]
l = contents[0::2]
ll = contents[1::2]
for a in zip(l, ll):
print ('-'.join(a))
You could get the next sibling tag. If it's the dash, it will be printed along with the text, otherwise just the text will be printed .
for c in classification:
if c.next_sibling:
print(c.text + str(c.next_sibling), end='')
else:
print(c.text)

Hashtags python html

I want to extract all the hashtags from a given website:
For example, "I love #stack overflow because #people are very #helpful!"
This should pull the 3 hashtags into a table.
In the website I am targeting there is a table with a #tag description
So we can find #love this hashtag speaks about love
This is my work:
#import the library used to query a website
import urllib2
#specify the url
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the
website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup
format
soup = BeautifulSoup(page, "lxml")
print soup.prettify()
s = soup.get_text()
import re
re.findall("#(\w+)", s)
I have an issues in the output :
The first one is that the output look like this :
[u'eeeeee',
u'333333',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'AASTGrandRoundsacute'
The output concatenate the Hashtag with the first word in the description. If I compare to the example I evoked before the output is 'lovethis'.
How can I do to extract only the one word after the hashtag.
Thank you
I think there's no need to use regex to parse the text you get from the page, you can use BeautifulSoup itself for that. I'm using Python3.6 in the code below, just to show the entire code, but the important line is hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'}). Notice all hashtags in the table have td tag and id attribute = tweetchatlist_hashtag, so calling .findAll is the way to go here:
import requests
import re
from bs4 import BeautifulSoup
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")
hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})
Now let's have a look at the first item of our list:
>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location">#AASTGrandRounds</td>
So we see that what we really want is the value of title attribute of a:
>>> hashtags[0].a['title']
'#AASTGrandRounds'
To proceed to get a list of all hashtags using list comprehension:
>>> lst = [hashtag.a['title'] for hashtag in hashtags]
If you are not used with list comprehension syntax, the line above is similar to this:
>>> lst = []
>>> for hashtag in hashtags:
lst.append(hashtag.a['title'])
lst then is the desired output, see the first 20 items of the list:
>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

Removing new line '\n' from the output of python BeautifulSoup

I am using python Beautiful soup to get the contents of:
<div class="path">
abc
def
ghi
</div>
My code is as follows:
html_doc="""<div class="path">
abc
def
ghi
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)
print breadcrum
The output is as follow,
[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']
How can I only get the result in this form: abc,def,ghi as a single string?
Also I want to know about the output so obtained.
You could do this:
breadcrum = [item.strip() for item in breadcrum if str(item)]
The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.
If you want to join the strings, then do:
','.join(breadcrum)
This will give you abc,def,ghi
EDIT
Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:
path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
data.append(ele.text)
And then do a ','.join(data)
If you just strip items in breadcrum you would end up with empty item in your list. You can either do as shaktimaan suggested and then use
breadcrum = filter(None, breadcrum)
Or you can strip them all before hand (in html_doc):
mystring = mystring.replace('\n', ' ').replace('\r', '')
Either way to get your string output, do something like this:
','.join(breadcrum)
Unless I'm missing something, just combine strip and list comprehension.
Code:
from bs4 import BeautifulSoup as bsoup
ofile = open("test.html", "r")
soup = bsoup(ofile)
res = ",".join([a.get_text().strip() for a in soup.find("div", class_="path").find_all("a")])
print res
Result:
abc,def,ghi
[Finished in 0.2s]

Remove <br> tags from a parsed Beautiful Soup list?

I'm currently getting into a for loop with all the rows I want:
page = urllib2.urlopen(pageurl)
soup = BeautifulSoup(page)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
At this point, I have my information, but the
<br />
tags are ruining my output.
What's the cleanest way to remove these?
for e in soup.findAll('br'):
e.extract()
If you want to translate the <br />'s to newlines, do something like this:
def text_with_newlines(elem):
text = ''
for e in elem.recursiveChildGenerator():
if isinstance(e, basestring):
text += e.strip()
elif e.name == 'br':
text += '\n'
return text
replace tags at the start with a space
Beautiful soup also accepts the .read() on the urlopen object so this should work - - -
page = urllib2.urlopen(pageurl)
page_text=page.read()
new_text=re.sub('</br>',' ',page_text)
soup = BeautifulSoup(new_text)
tables = soup.find("td", "bodyTd")
for row in tables.findAll('tr'):
.....
the re.sub replaces the br tag with a whitespace
Maybe some_string.replace('<br />','\n') to replace the breaks with newlines.
>>> print 'Some data<br />More data<br />'.replace('<br />','\n')
Some data
More data
You might want to check out html5lib and lxml, which are both pretty great at parsing html. lxml is really fast and html5lib is designed to be extremely robust.

Categories

Resources