I have two numbers (NUM1; NUM2) I am trying to extract across webpages that have the same format:
<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
NUM1 and NUM2 are always followed by the same text across webpages
</div>
I am thinking that regex might be the way to go for these particular fields. Here's my attempt (borrowed from various sources):
def nums(self):
nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages')
nums_match = nums_regex.search(self)
nums_text = nums_match.group(0)
digits = [int(s) for s in re.findall(r'\d+', nums_text)]
return digits
By itself, outside of a function, this code works when specifying the actual source of the text (e.g., nums_regex.search(text)). However, I am modifying another person's code and I myself have never really worked with classes or functions before. Here's an example of their code:
#property
def title(self):
tag = self.soup.find('span', class_='summary')
title = unicode(tag.string)
return title.strip()
As you might have guessed, my code isn't working. I get the error:
nums_match = nums_regex.search(self)
TypeError: expected string or buffer
It looks like I'm not feeding in the original text correctly, but how do I fix it?
You can use the same regular expression pattern to find with BeautifulSoup by text and then to extract the desired numbers:
import re
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Note that, since you are trying to match a part of text and not anything HTML structure related, I think it's pretty much okay to just apply your regular expression to the complete document instead.
Complete working sample code samples below.
With BeautifulSoup regex/"by text" search:
import re
from bs4 import BeautifulSoup
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
for elm in soup.find_all("div", text=pattern):
print(pattern.search(elm.text).groups())
Regex-only search:
import re
data = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
10 and 20 are always followed by the same text across webpages
</div>
</div>
"""
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages")
print(pattern.findall(data)) # prints [('10', '20')]
Related
I am crawling a website where I need to grab sentence starting with "Confirmed ..."
The html for corresponding sentence looks like this
<span class='text-secondary ml-2 d-none d-sm-inline-block'
title='Estimated duration between time First Seen and included in block'> | <i class='fal fa-stopwatch ml-1'></i>
Confirmed within 25 secs</span>
Using requests_html from Python I can retrieve:
r.html.find("span", containing="Confirmed "
[<Element 'span' class=('text-secondary', 'ml-2', 'd-none', 'd-sm-inline-block') title='Estimated duration between time First Seen and included in block'>]`
But for some reason, it doesn't return the rest. What am I missing?
Have you try to find span element using parameter string "Confirmed "?
Like this:
r.html.find("span", containing="Confirmed ")
I do some testing in localhost using your html and it does return the element : screenshot
<div class="bb-fl" style="background:Tomato;width:0.63px" title="10"></div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div>,
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18"></div>,
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div>,
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24"></div>,
I currently have the above html code that is in a list. I wish to use python so that it may output the following and then append to a list:
10
3
18
3
24
I would recommend using Beautiful Soup which is a very popular html parsing module that is uniquely suited for this kind of thing. If each element has the attribute of title then you could do something like this:
from bs4 import BeautifulSoup
import requests
def randomFacts(url):
r = requests.get(url)
bs = BeautifulSoup(r.content, 'html.parser')
title = bs.find_all('div')
for each in title:
print(each['title'])
Beautiful Soup is my normal go to for html parsing, hope this helps.
Here are 3 possibilities. In the first 2 versions we make sure the class checks out before appending it to the list - just in case there are other divs that you don't want to include. In the third method there isn't really a good way to do that. Unlike adrianp's method of splitting, mine doesn't care where the title is.
The third method may be a bit confusing so, allow me to explain it. First we split everywhere that title=" appears. We dump the first index of that list because it is everything before the first title. We then loop over the remainder and split on the first quote. Now the number you want is in the first index of that split. We do an inline pop to get that value so we can keep everything in a list comprehension, instead of expanding the entire loop and wrestling the values out with specific indexes.
To load the html remotely, uncomment the commented html var and replace "yourURL" with the proper one for you.
I think I have given you every possible way of doing this - certainly the most obvious ones.
from bs4 import BeautifulSoup
import re, requests
html = '<div class="bb-fl" style="background:Tomato;width:0.63px" title="10"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.14px" title="18"></div> \
<div class="bb-fl" style="background:SkyBlue;width:0.19px" title="3"></div> \
<div class="bb-fl" style="background:Tomato;width:1.52px" title="24"></div>'
#html = requests.get(yourURL).content
# possibility 1: BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# assumes that all bb-fl classed divs have a title and all divs have a class
# you may need to disassemble this generator and add some extra checks
bs_titleval = [div['title'] for div in soup.find_all('div') if 'bb-fl' in div['class']]
print(bs_titleval)
# possibility 2: Regular Expressions ~ not the best way to go
# this isn't going to work if the tag attribute signature changes
title_re = re.compile('<div class="bb-fl" style="[^"]*" title="([0-9]+)">', re.I)
re_titleval = [m.group(1) for m in title_re.finditer(html)]
print(re_titleval)
# possibility 3: String Splitting ~
# probably the best method if there is nothing extra to weed out
title_sp = html.split('title="')
title_sp.pop(0) # get rid of first index
# title_sp is now ['10"></div>...', '3"></div>...', '18"></div>...', etc...]
sp_titleval = [s.split('"').pop(0) for s in title_sp]
print(sp_titleval)
Assuming that each div is saved as a string in the variable div, you can do the following:
number = div.split()[3].split('=')[1]
Each div should be in the same format for this to work.
I use lxml for parsing html files in Python.
And I use cssselect.
Something like that:
from lxml.html import parse
page = parse('http://.../').getroot()
img = page.cssselect('div.photo cover div.outer a') # problem
But I have a problem. There are spaces in class-names in HTML:
<div class="photo cover"><div class=outer>
<a href=...
Without them everything is ok. How can I parse it (I can't edit html code)?
To match div with photo and cover class, use div.photo.cover.
img = page.cssselect('div.photo.cover div.outer a')
Instead of thinkg class="photo cover" as class attribute with photo cover as a value, think it as a class attribute with photo and cover as values.
Let's say I have an HTML with <p> and <br> tags inside. Aftewards, I'm going to strip the HTML to clean up the tags. How can I turn them into line breaks?
I'm using Python's BeautifulSoup library, if that helps at all.
Without some specifics, it's hard to be sure this does exactly what you want, but this should give you the idea... it assumes your b tags are wrapped inside p elements.
from BeautifulSoup import BeautifulSoup
import six
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, six.string_types):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
page = """<html>
<body>
<p>America,<br>
Now is the<br>time for all good men to come to the aid<br>of their country.</p>
<p>pile on taxpayer debt<br></p>
<p>Now is the<br>time for all good men to come to the aid<br>of their country.</p>
</body>
</html>
"""
soup = BeautifulSoup(page)
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
print line
Running this results in...
(py26_default)[mpenning#Bucksnort ~]$ python thing.py
America,
Now is the
time for all good men to come to the aid
of their country.
pile on taxpayer debt
Now is the
time for all good men to come to the aid
of their country.
(py26_default)[mpenning#Bucksnort ~]$
get_text seems to do what you need
>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'
This a python3 version of #Mike Pennington's Answer(it really helps),I did a litter refactor.
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
if isinstance(elem, str):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
def get_plain_text(soup):
plain_text = ''
lines = soup.find("body")
for line in lines.findAll('p'):
line = replace_with_newlines(line)
plain_text+=line
return plain_text
To use this,just pass the Beautifulsoup object to get_plain_text methond.
soup = BeautifulSoup(page)
plain_text = get_plain_text(soup)
I use the following small library to accomplish this:
https://github.com/TeamHG-Memex/html-text
pip install html-text
As simple as:
>>> import html_text
>>> html_text.extract_text('<h1>Hello</h1> world!')
'Hello\n\nworld!'
I'm not fully sure what you're trying to accomplish but if you're just trying to remove the HTML elements, I would just use a program like Notepad2 and use the Replace All function - I think you can also insert a new line using Replace All as well. Make sure if you replace the <p> element that you also remove the closing as well (</p>). Additionally just an FYI the proper HTML5 is <br /> instead of <br> but it doesn't really matter. Python wouldn't be my first choice for this so it's a little out of my area of knowledge, sorry I couldn't help more.
I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with  . With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)