Removing new line '\n' from the output of python BeautifulSoup - python

I am using python Beautiful soup to get the contents of:
<div class="path">
abc
def
ghi
</div>
My code is as follows:
html_doc="""<div class="path">
abc
def
ghi
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
path = soup.find('div',attrs={'class':'path'})
breadcrum = path.findAll(text=True)
print breadcrum
The output is as follow,
[u'\n', u'abc', u'\n', u'def', u'\n', u'ghi',u'\n']
How can I only get the result in this form: abc,def,ghi as a single string?
Also I want to know about the output so obtained.

You could do this:
breadcrum = [item.strip() for item in breadcrum if str(item)]
The if str(item) will take care of getting rid of the empty list items after stripping the new line characters.
If you want to join the strings, then do:
','.join(breadcrum)
This will give you abc,def,ghi
EDIT
Although the above gives you what you want, as pointed out by others in the thread, the way you are using BS to extract anchor texts is not correct. Once you have the div of your interest, you should be using it to get it's children and then get the anchor text. As:
path = soup.find('div',attrs={'class':'path'})
anchors = path.find_all('a')
data = []
for ele in anchors:
data.append(ele.text)
And then do a ','.join(data)

If you just strip items in breadcrum you would end up with empty item in your list. You can either do as shaktimaan suggested and then use
breadcrum = filter(None, breadcrum)
Or you can strip them all before hand (in html_doc):
mystring = mystring.replace('\n', ' ').replace('\r', '')
Either way to get your string output, do something like this:
','.join(breadcrum)

Unless I'm missing something, just combine strip and list comprehension.
Code:
from bs4 import BeautifulSoup as bsoup
ofile = open("test.html", "r")
soup = bsoup(ofile)
res = ",".join([a.get_text().strip() for a in soup.find("div", class_="path").find_all("a")])
print res
Result:
abc,def,ghi
[Finished in 0.2s]

Related

How can I split the text elements out from this HTML String? Python

Good Morning,
I'm doing some HTML parsing in Python and I've run across the following which is a time & name pairing in a single table cell. I'm trying to extract each piece of info separately and have tried several different approaches to split the following string.
HTML String:
<span><strong>13:30</strong><br/>SecondWord</span></a>
My output would hopefully be:
text1 = 13:30
text2 = "SecondWord"
I'm currently using a loop through all the rows in the table, where I'm taking the text and splitting it by a new line. I noticed the HTML has a line break character in-between so it renders separately on the web, I was trying to replace this with a new line and run my split on that - however my string.replace() and re.sub() approaches don't seem to be working.
I'd love to know what I'm doing wrong.
Latest Approach:
resub_pat = r'<br/>'
rows=list()
for row in table.findAll("tr"):
a = re.sub(resub_pat,"\n",row.text).split("\n")
This is a bit hashed together, but I hope I've captured my problem! I wasn't able to find any similar issues.
You could try:
from bs4 import BeautifulSoup
import re
# the soup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# the regex object
rx = re.compile(r'(\d+:\d+)(.+)')
# time, text
text = soup.find('span').get_text()
x,y = rx.findall(text)[0]
print(x)
print(y)
Using recursive=False to get only direct text and strong.text to get the other one.
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# text1
print(soup.find("span").strong.text) # --> 13:30
# text2
print(soup.find("span").find(text=True, recursive=False)) # --> SecondWord
from bs4 import BeautifulSoup
txt = '''<span><strong>13:30</strong><br/>SecondWord</span></a>'''
soup = BeautifulSoup(txt, 'html.parser')
text1, text2 = soup.span.get_text(strip=True, separator='|').split('|')
print(text1)
print(text2)
Prints:
13:30
SecondWord

Change scraped output

I have a loop putting URLs into my broswer and scraping its content, generating this output:
2PRACE,0.0014
Hispanic,0.1556
API,0.0688
Black,0.0510
AIAN,0.0031
White,0.7200
The code looks like this:
f1 = open('urlz.txt','r',encoding="utf8")
ethnicity_urls = f1.readlines()
f1.close()
from urllib import request
from bs4 import BeautifulSoup
import time
import openpyxl
import pprint
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
print(soup1)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup1))
resultFile.close()
My problem is quite simple yet I do not find any tool that helps me achieve it. I would like to change the output from a list with "\n" in it to this:
2PRACE,0.0014 Hispanic,0.1556 API,0.0688 Black,0.0510 AIAN,0.0031 White,0.7200
I did not succeed by using replace as it told me I am treating a number of elements the same as a single element.
My approach here was:
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
soup2 = soup1.replace('\n',' ')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
Can you help me find the correct approach to mutate the output before writing it to a csv?
The error message I get:
AttributeError: ResultSet object has no attribute 'replace'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
See the solution to the problem in my answer below. Thanks for all the responses!
soup1 seems to be an iterable, so you cannot just call replace on it.
Instead you could loop through all string items in soup1 and then call replace for every single one of them and then save the changes string to your soup2 variable. Something like this:
for e in soup1:
soup2.append(e.replace('\n',' '))
You need to iterate over the soup.
Soup is a list of elements:
The BS4 Documentation is excellent and has many many examples:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Use strip() to remove the \n
for x in soup1:
for r in x.children:
try:
print(r.strip())
except TypeError:
pass
Thank you both for the ideas and resources. I think I could implement what you suggested. The current build is
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
for e in soup1:
soup2 = str(soup1)
soup2 = soup2.replace('\n','')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
And works just fine. I can do the final adjustments now in excel.

Python Beautifulsoup find all function without duplication

markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
soup = BeautifulSoup(markup)
data = soup.find_all('p','li')
print(data)
The result looks like this
['<p>hey</p>,<p>How </p>,<li><p>How</p></li>,<p>are you </p>,<li><p>are you </p></li>']
How can I make it only return
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
or any ways that I can remove the duplicated p tags Thanks
BeautifulSoup is not analyzing the text of the html, instead, it is acquiring the desired tag structure. In this case, you need to perform an additional step and check if the text of an <p> is already seen in a different structure:
from bs4 import BeautifulSoup as soup
markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
s = soup(markup, 'lxml')
final_s = s.find_all(re.compile('p|li'))
final_data = [','.join(map(str, [a for i, a in enumerate(final_s) if not any(a.text == b.text for b in final_s[:i])]))]
Output:
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
If you are looking for the text and not the <li> tags, you can just search for the <p> tags and get the desired result without duplication.
>>> data = soup.find_all('p')
>>> data
[<p>hey</p>, <p>How</p>, <p>are you </p>]
Or, if you are simply looking for the text, you can use this:
>>> data = [x.text for x in soup.find_all('p')]
>>> data
['hey', 'How', 'are you ']

Remove lines getting empty after BeautifulSoup decompose

I am trying to strip certain HTML tags and their content from a file with BeautifulSoup. How can I remove lines that get empty after applying decompose()? In this example, I want the line between a and 3 to be gone, as this is where the <span>...</span> block was, but not the line in the end.
from bs4 import BeautifulSoup
Rmd_data = 'a\n<span class="answer">\n2\n</span>\n3\n'
print(Rmd_data)
#OUTPUT
# a
# <span class="answer">
# 2
# </span>
# 3
#
# END OUTPUT
soup = BeautifulSoup(Rmd_data, "html.parser")
answers = soup.find_all("span", "answer")
for a in answers:
a.decompose()
Rmd_data = str(soup)
print(Rmd_data)
# OUTPUT
# a
#
# 3
#
# END OUTPUT
I'm surprised that BeatifulSoup does not offer a prettify() option. Instead of manipulating the html manually you could re-parse your html:
str(BeautifulSoup(str(soup), 'html.parser'))
As always, enjoy.
For removing empty lines most easy will be via re
import re
re.sub(r'[\n\s]+', r'\n', text, re.MULTLINE)

re.sub isn't matching when it seems it should

any help as to why this regex isnt' matching<td>\n etc? i tested it successfully on pythex.org. Basically i'm just trying to clean up the output so it just says myfile.doc. I also tried (<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]
It's hard to tell without seeing the repr(filename), but I think your problem is the confusing of real newline characters with escaped newline characters.
Compare and contrast the examples below:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'
If your goal is to just get the stripped string from within the <td> tag you can just let BeautifulSoup do it for you by getting the stripped_strings attribute of a tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
If you want to extract further strings from tags of the same type you can then use findNext to extract the next td tag after the current one:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
And then loop through...

Categories

Resources