How can I turn <br> and <p> into line breaks? - python

Let's say I have an HTML with <p> and <br> tags inside. Aftewards, I'm going to strip the HTML to clean up the tags. How can I turn them into line breaks?
I'm using Python's BeautifulSoup library, if that helps at all.

Without some specifics, it's hard to be sure this does exactly what you want, but this should give you the idea... it assumes your b tags are wrapped inside p elements.
from BeautifulSoup import BeautifulSoup
import six
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, six.string_types):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""
soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
print line
Running this results in...
(py26_default)[mpenning#Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt
Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning#Bucksnort ~]$

get_text seems to do what you need
>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

This a python3 version of #Mike Pennington's Answer(it really helps),I did a litter refactor.
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, str):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
def get_plain_text(soup):
plain_text = ''
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
plain_text+=line
return plain_text
To use this,just pass the Beautifulsoup object to get_plain_text methond.
soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)

I use the following small library to accomplish this:
https://github.com/TeamHG-Memex/html-text
pip install html-text
As simple as:
>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'

I'm not fully sure what you're trying to accomplish but if you're just trying to remove the HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All as well. Make sure if you replace the <p> element that you also remove the closing as well (</p>). Additionally just an FYI the proper HTML5 is <br /> instead of <br> but it doesn't really matter. Python wouldn't be my first choice for this so it's a little out of my area of knowledge, sorry I couldn't help more.

Related

How can I clean html code to return only the number values?

<div class="bb-fl" style="background:Tomato;width:0.63px" title="10">​</div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3">​</div>,
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18">​</div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3">​</div>,
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24">​</div>,
I currently have the above html code that is in a list. I wish to use python so that it may output the following and then append to a list:
10
3
18
3
24
I would recommend using Beautiful Soup which is a very popular html parsing module that is uniquely suited for this kind of thing. If each element has the attribute of title then you could do something like this:
from bs4 import BeautifulSoup
import requests
def randomFacts(url):
r = requests.get(url)
bs = BeautifulSoup(r.content, 'html.parser')
title = bs.find_all('div')
for each in title:
print(each['title'])
Beautiful Soup is my normal go to for html parsing, hope this helps.
Here are 3 possibilities. In the first 2 versions we make sure the class checks out before appending it to the list - just in case there are other divs that you don't want to include. In the third method there isn't really a good way to do that. Unlike adrianp's method of splitting, mine doesn't care where the title is.
The third method may be a bit confusing so, allow me to explain it. First we split everywhere that title=" appears. We dump the first index of that list because it is everything before the first title. We then loop over the remainder and split on the first quote. Now the number you want is in the first index of that split. We do an inline pop to get that value so we can keep everything in a list comprehension, instead of expanding the entire loop and wrestling the values out with specific indexes.
To load the html remotely, uncomment the commented html var and replace "yourURL" with the proper one for you.
I think I have given you every possible way of doing this - certainly the most obvious ones.
from bs4 import BeautifulSoup
import re, requests
html = '<div class="bb-fl" style="background:Tomato;width:0.63px" title="10">​</div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3">​</div> \
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18">​</div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3">​</div> \
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24">​</div>'
#html = requests.get(yourURL).content
# possibility 1: BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# assumes that all bb-fl classed divs have a title and all divs have a class
# you may need to disassemble this generator and add some extra checks
bs_titleval = [div['title'] for div in soup.find_all('div') if 'bb-fl' in div['class']]
print(bs_titleval)
# possibility 2: Regular Expressions ~ not the best way to go
# this isn't going to work if the tag attribute signature changes
title_re = re.compile('<div class="bb-fl" style="[^"]*" title="([0-9]+)">', re.I)
re_titleval = [m.group(1) for m in title_re.finditer(html)]
print(re_titleval)
# possibility 3: String Splitting ~
# probably the best method if there is nothing extra to weed out
title_sp = html.split('title="')
title_sp.pop(0) # get rid of first index
# title_sp is now ['10"></div>...', '3"></div>...', '18"></div>...', etc...]
sp_titleval = [s.split('"').pop(0) for s in title_sp]
print(sp_titleval)
Assuming that each div is saved as a string in the variable div, you can do the following:
number = div.split()[3].split('=')[1]
Each div should be in the same format for this to work.

Parse element's tail with requests-html

I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.

Extracting a line of text using BeautifulSoup

I have two numbers (NUM1; NUM2) I am trying to extract across webpages that have the same format:
<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
NUM1 and NUM2 are always followed by the same text across webpages
</div>
I am thinking that regex might be the way to go for these particular fields. Here's my attempt (borrowed from various sources):
def nums(self):
nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages')
nums_match = nums_regex.search(self)
nums_text = nums_match.group(0)
digits = [int(s) for s in re.findall(r'\d+', nums_text)]
return digits
By itself, outside of a function, this code works when specifying the actual source of the text (e.g., nums_regex.search(text)). However, I am modifying another person's code and I myself have never really worked with classes or functions before. Here's an example of their code:
#property
def title(self):
tag = self.soup.find('span', class_='summary')
title = unicode(tag.string)
return title.strip()
As you might have guessed, my code isn't working. I get the error:
nums_match = nums_regex.search(self)
TypeError: expected string or buffer
It looks like I'm not feeding in the original text correctly, but how do I fix it?
You can use the same regular expression pattern to find with BeautifulSoup by text and then to extract the desired numbers:
import re
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Note that, since you are trying to match a part of text and not anything HTML structure related, I think it's pretty much okay to just apply your regular expression to the complete document instead.
Complete working sample code samples below.
With BeautifulSoup regex/"by text" search:
import re
from bs4 import BeautifulSoup
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Regex-only search:
import re
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data)) # prints [('10', '20')]

get specific text such as "Something new" using BeautifulSoup Python

I am making a focused crawler and facing an issue while find a for a key phrase in the document.
Supposing the key phrase I want to search in the document is "Something new"
using BeautifulSoup with python I do the following
if soup.find_all(text = re.compile("Something new",re.IGNORECASE)):
print true
I want it to print true only for the following cases
"something new" --> true
"$#something new,." --> true
AND not for the following cases:
"thisSomething news" --> false
"Somethingnew" --> false
assuming special characters are allowed.
Has anyone ever done something like this before. ??
Thanks for the help.
Then, search for something new and don't apply re.IGNORECASE:
import re
from bs4 import BeautifulSoup
data = """
<div>
<span>something new</span>
<span>$#something new,.</span>
<span>thisSomething news</span>
<span>Somethingnew</span>
</div>
"""
soup = BeautifulSoup(data)
for item in soup.find_all(text=re.compile("something new")):
print item
Prints:
something new
$#something new,.
You can also take a non-regex approach and pass a function instead of a compiled regex pattern:
for item in soup.find_all(text=lambda x: 'something new' in x):
print item
For the example HTML used above, it also prints:
something new
$#something new,.
This is one of the alternative methods that I used:
soup.find_all(text = re.compile("\\bSomething new\\b",re.IGNORECASE))
Thanks everyone.

How can i remove <p> </p> with python sub

I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with &nbsp. With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)

Categories

Resources