I am trying to remove all the html surrounding the data that I seek from a webpage so that all that is left is the raw data that I will then be able to input into a database. so if I have something like:
<p class="location"> Atlanta, GA </p>
The following code would return
Atlanta, GA </p>
But what I expect is not what is returned. This is a more specific solution to the basic problem I found here. Any help would be appreciated, thanks! Code is found below.
def delHTML(self, html):
"""
html is a list made up of items with data surrounded by html
this function should get rid of the html and return the data as a list
"""
for n,i in enumerate(html):
if i==re.match('<p class="location">',str(html[n])):
html[n]=re.sub('<p class="location">', '', str(html[n]))
return html
As rightfully pointed out in the comments, you should be using a specific library to parse HTML and extract text, here are some examples:
html2text: Limited functionnality, but exactly what you need.
BeautifulSoup: More complex, more powerful.
Assuming all you want is to extract the data contained in <p class="location"> tags, you could use a quick & dirty (but correct) approach with the Python HTMLParser module (a simple HTML SAX parser), like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
PLocationID=0
PCount=0
buf=""
out=[]
def handle_starttag(self, tag, attrs):
if tag=="p":
self.PCount+=1
if ("class", "location") in attrs and self.PLocationID==0:
self.PLocationID=self.PCount
def handle_endtag(self, tag):
if tag=="p":
if self.PLocationID==self.PCount:
self.out.append(self.buf)
self.buf=""
self.PLocationID=0
self.PCount-=1
def handle_data(self, data):
if self.PLocationID:
self.buf+=data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out
Output:
['This will', 'This too', "nested Ps shouldn't be allowed this will work"]
This will extract all the text contained inside any <p class="location"> tag, stripping all the tags inside it. Separate tags (if not nested - which shouldn't be allowed anyhow for paragraphs) will have a separate entry in the out list.
Notice that for more complex requirements this can easily get out of hand; in those cases a DOM parser is way more appropriate.
Related
To support both a JPEG and WEBP compressed image, I'd like to include the following HTML code in a web page:
<picture>
<source srcset="img/awesomeWebPImage.webp" type="image/webp">
<source srcset="img/creakyOldJPEG.jpg" type="image/jpeg">
<img src="img/creakyOldJPEG.jpg" alt="Alt Text!">
</picture>
I've been using Python Dominate and it has generally worked well for me.
But the Picture and Source tags I think are not supported by Dominate.
I could add the HTML as a raw() Dominate tag, but was wondering if there was a way to get Dominate to recognize these tags.
p = picture()
with p:
source(srcset=image.split('.')[0]+'.webp', type="image/webp")
source(srcset=image, type="image/jpeg")
img(src=image, alt=imagealt)
I am seeing this kind of error:
p = picture()
NameError: global name 'picture' is not defined
You can create a picture class by inheriting from the dominate.tags.html_tag class
from dominate.tags import html_tag
class picture(html_tag):
pass
This can now be used as any of the predefined tags.
Dominate is used to generate HTML(5) documents.
The list of elements are defined in the tags.py file, see the repository in GitHub: https://github.com/Knio/dominate/blob/master/dominate/tags.py.
But, picture is not a standard tag.
You may look at the lxml library which contains a ElementMaker similar to Dominate to build XML tree easily. See the E-Factory.
For instance:
>>> from lxml.builder import E
>>> def CLASS(*args): # class is a reserved word in Python
... return {"class":' '.join(args)}
>>> html = page = (
... E.html( # create an Element called "html"
... E.head(
... E.title("This is a sample document")
... ),
... E.body(
... E.h1("Hello!", CLASS("title")),
... E.p("This is a paragraph with ", E.b("bold"), " text in it!"),
... E.p("This is another paragraph, with a", "\n ",
... E.a("link", href="http://www.python.org"), "."),
... E.p("Here are some reserved characters: <spam&egg>."),
... etree.XML("<p>And finally an embedded XHTML fragment.</p>"),
... )
... )
... )
>>> print(etree.tostring(page, pretty_print=True))
<html>
<head>
<title>This is a sample document</title>
</head>
<body>
<h1 class="title">Hello!</h1>
<p>This is a paragraph with <b>bold</b> text in it!</p>
<p>This is another paragraph, with a
link.</p>
<p>Here are some reserved characters: <spam&egg>.</p>
<p>And finally an embedded XHTML fragment.</p>
</body>
</html>
I have the following code for parsing some HTML. I need to save the output (html result) as a single line of code with the escaped character sequences there such as \n but I'm either getting a representation I can't use from repr() because of the single quotes or the output is being written to multiple lines like so (interpreting the escape sequences):
<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>
What I require (including the escape sequences):
<section class="prog__container">\n <span class="prog__sub">Title</span>\n <p>PEP 336 - Make None Callable</p>\n <span class="prog__sub">Description</span>\n <p>\n <p>\n <code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n </p>\n </section>
My Code
soup = BeautifulSoup(html, "html.parser")
for match in soup.findAll(['div']):
match.unwrap()
for match in soup.findAll(['a']):
match.unwrap()
html = soup.contents[0]
html = str(html)
html = html.splitlines(True)
html = " ".join(html)
html = re.sub(re.compile("\n"), "\\n", html)
html = repl(html) # my current solution works, but unusable
The above is my solution, but an object representation is no good, I need the string representation. How can I achieve this?
Why don't use just repr?
a = """this is the first line
this is the second line"""
print repr(a)
Or even (if I clear with your issue of exact output without literal quotes)
print repr(a).strip("'")
Output:
'this is the first line\nthis is the second line'
this is the first line\nthis is the second line
import bs4
html = '''<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>'''
soup = bs4.BeautifulSoup(html, 'lxml')
str(soup)
out:
'<html><body><section class="prog__container">\n<span class="prog__sub">Title</span>\n<p>PEP 336 - Make None Callable</p>\n<span class="prog__sub">Description</span>\n<p>\n</p><p>\n<code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n</section></body></html>'
There are more complex way to output the html code in the Document
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen('https://www.example.com')
soup = BeautifulSoup(r.read(), 'html.parser')
html = str(soup)
This will give your html as one string and lines separated by \n
I have a little bit of screen scraping code in Python, using BeautifulSoup, that is giving me headache. A small change to the html made my code break, but I can't see why it fails to work. This is basically a demo of how the html looked when parsed:
soup=BeautifulSoup("""
<td>
<a href="https://alink.com">
Foo Some text Bar
</a>
</td>
""")
links = soup.find_all('a',text=re.compile('Some text'))
links[0]['href'] # => "https://alink.com"
After an upgrade, the a tag body now includes an img tag, which makes the code break.
<td>
<a href="https://alink.com">
<img src="dummy.gif" >
Foo Some text Bar
</a>
</td>
'links' is now an empty list, so the regex is not finding anything.
I hacked around it by matching on the text alone, then finding
its parent, but that seems even more fragile:
links = soup.find_all(text=re.compile('Some text'))
links[0].parent['href'] # => "https://alink.com"
What is the addition of an img tag as a sibling to the text
content breaking the search done by BeautifulSoup, and is there
a way of modifying the first code to work?
The difference is that the 2nd example has an incomplete img tag:
it should be either
<img src="dummy.gif" />
Foo Some text Bar
or
<img src="dummy.gif" > </img>
Foo Some text Bar
Instead, it is parsed as
<img src="dummy.gif" >
Foo Some text Bar
</img>
So the element found isn't a any longer, but img, whose parent is a.
The first example works only if a.string is not None i.e., iff the text is the only child.
As a workaround, you could use a function predicate:
a = soup.find(lambda tag: tag.name == 'a' and tag.has_attr('href') and 'Some text' in tag.text)
print(a['href'])
# -> 'https://alink.com'
How I can add a subexpression of a regular expression in python?
Indicating that some html code may or may not appear in the text.
It's because I'm making an API for filmaffinity and want to make a RE to filter search results, but I have problems.
In the html code of a result there's an rating image, and in other code this isn't, then I would add to a subexpression to the RE, where the image appears, it can take rate for the movie (an integer), and if not, it returns an empty string.
For example, this is a section os resoults html:
<div class="mc-title">Movie Name (2012) <img src="/imgs/countries/CF.jpg" title="Country Name"></div>
<img src="http://www.filmaffinity.com/imgs/ratings/8.png" border="0" alt="Notable" > <div class="mc-director">Some Director</div>
In this other html code is not the img tag.
<div class="mc-title">Another movie name (2015) <img src="/imgs/countries/XY.jpg" title="Another Country"></div>
<div class="mc-director">Another director</div>
So... I need a RE that return this:
>>>R=findall(expression, html_Code)
>>>print R
[('111111', 'Movie Name', '2012', '8', 'Some Director'), ('000000', 'Another Movie Name', '2015', '', 'Another director')]
Note that in the second tuple, there is not a rating... only a '' string.
My poor RE is this:
<div class="mc-title">([^<]*)\s*\((\d{4})\)\s*<img src="/imgs/countries/([A-Z]{2}).jpg" title="[^"]*"></div>\s*<img src="http://www.filmaffinity.com/imgs/ratings/(\d+).png" border="0" alt="\w*" ?>\s*<div class="mc-director">[^<]*</div>
For parsing HTML, I find BeautifulSoup better than using straight regular expressions. There's also PyQuery which seems nice, but I've never used it.
I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with  . With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)