I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with  . With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)
Related
I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.
I have two numbers (NUM1; NUM2) I am trying to extract across webpages that have the same format:
<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
NUM1 and NUM2 are always followed by the same text across webpages
</div>
I am thinking that regex might be the way to go for these particular fields. Here's my attempt (borrowed from various sources):
def nums(self):
nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages')
nums_match = nums_regex.search(self)
nums_text = nums_match.group(0)
digits = [int(s) for s in re.findall(r'\d+', nums_text)]
return digits
By itself, outside of a function, this code works when specifying the actual source of the text (e.g., nums_regex.search(text)). However, I am modifying another person's code and I myself have never really worked with classes or functions before. Here's an example of their code:
#property
def title(self):
tag = self.soup.find('span', class_='summary')
title = unicode(tag.string)
return title.strip()
As you might have guessed, my code isn't working. I get the error:
nums_match = nums_regex.search(self)
TypeError: expected string or buffer
It looks like I'm not feeding in the original text correctly, but how do I fix it?
You can use the same regular expression pattern to find with BeautifulSoup by text and then to extract the desired numbers:
import re
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Note that, since you are trying to match a part of text and not anything HTML structure related, I think it's pretty much okay to just apply your regular expression to the complete document instead.
Complete working sample code samples below.
With BeautifulSoup regex/"by text" search:
import re
from bs4 import BeautifulSoup
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Regex-only search:
import re
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data)) # prints [('10', '20')]
I am making a focused crawler and facing an issue while find a for a key phrase in the document.
Supposing the key phrase I want to search in the document is "Something new"
using BeautifulSoup with python I do the following
if soup.find_all(text = re.compile("Something new",re.IGNORECASE)):
print true
I want it to print true only for the following cases
"something new" --> true
"$#something new,." --> true
AND not for the following cases:
"thisSomething news" --> false
"Somethingnew" --> false
assuming special characters are allowed.
Has anyone ever done something like this before. ??
Thanks for the help.
Then, search for something new and don't apply re.IGNORECASE:
import re
from bs4 import BeautifulSoup
data = """
<div>
<span>something new</span>
<span>$#something new,.</span>
<span>thisSomething news</span>
<span>Somethingnew</span>
</div>
"""
soup = BeautifulSoup(data)
for item in soup.find_all(text=re.compile("something new")):
print item
Prints:
something new
$#something new,.
You can also take a non-regex approach and pass a function instead of a compiled regex pattern:
for item in soup.find_all(text=lambda x: 'something new' in x):
print item
For the example HTML used above, it also prints:
something new
$#something new,.
This is one of the alternative methods that I used:
soup.find_all(text = re.compile("\\bSomething new\\b",re.IGNORECASE))
Thanks everyone.
I am trying to remove all the html surrounding the data that I seek from a webpage so that all that is left is the raw data that I will then be able to input into a database. so if I have something like:
<p class="location"> Atlanta, GA </p>
The following code would return
Atlanta, GA </p>
But what I expect is not what is returned. This is a more specific solution to the basic problem I found here. Any help would be appreciated, thanks! Code is found below.
def delHTML(self, html):
"""
html is a list made up of items with data surrounded by html
this function should get rid of the html and return the data as a list
"""
for n,i in enumerate(html):
if i==re.match('<p class="location">',str(html[n])):
html[n]=re.sub('<p class="location">', '', str(html[n]))
return html
As rightfully pointed out in the comments, you should be using a specific library to parse HTML and extract text, here are some examples:
html2text: Limited functionnality, but exactly what you need.
BeautifulSoup: More complex, more powerful.
Assuming all you want is to extract the data contained in <p class="location"> tags, you could use a quick & dirty (but correct) approach with the Python HTMLParser module (a simple HTML SAX parser), like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
PLocationID=0
PCount=0
buf=""
out=[]
def handle_starttag(self, tag, attrs):
if tag=="p":
self.PCount+=1
if ("class", "location") in attrs and self.PLocationID==0:
self.PLocationID=self.PCount
def handle_endtag(self, tag):
if tag=="p":
if self.PLocationID==self.PCount:
self.out.append(self.buf)
self.buf=""
self.PLocationID=0
self.PCount-=1
def handle_data(self, data):
if self.PLocationID:
self.buf+=data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out
Output:
['This will', 'This too', "nested Ps shouldn't be allowed this will work"]
This will extract all the text contained inside any <p class="location"> tag, stripping all the tags inside it. Separate tags (if not nested - which shouldn't be allowed anyhow for paragraphs) will have a separate entry in the out list.
Notice that for more complex requirements this can easily get out of hand; in those cases a DOM parser is way more appropriate.
Let's say I have an HTML with <p> and <br> tags inside. Aftewards, I'm going to strip the HTML to clean up the tags. How can I turn them into line breaks?
I'm using Python's BeautifulSoup library, if that helps at all.
Without some specifics, it's hard to be sure this does exactly what you want, but this should give you the idea... it assumes your b tags are wrapped inside p elements.
from BeautifulSoup import BeautifulSoup
import six
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, six.string_types):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""
soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
print line
Running this results in...
(py26_default)[mpenning#Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt
Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning#Bucksnort ~]$
get_text seems to do what you need
>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
This a python3 version of #Mike Pennington's Answer(it really helps),I did a litter refactor.
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, str):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
def get_plain_text(soup):
plain_text = ''
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
plain_text+=line
return plain_text
To use this,just pass the Beautifulsoup object to get_plain_text methond.
soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)
I use the following small library to accomplish this:
https://github.com/TeamHG-Memex/html-text
pip install html-text
As simple as:
>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'
I'm not fully sure what you're trying to accomplish but if you're just trying to remove the HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All as well. Make sure if you replace the <p> element that you also remove the closing as well (</p>). Additionally just an FYI the proper HTML5 is <br /> instead of <br> but it doesn't really matter. Python wouldn't be my first choice for this so it's a little out of my area of knowledge, sorry I couldn't help more.