I have inside an HTML page some lines like this :
<div>
<p class="match"> this sentence should match </p>
some text
<a class="a"> some text </a>
</div>
<div>
<p class="match"> this sentence shouldnt match</p>
some text
<a class ="b"> some text </a>
</div>
I want to extract the lines inside the <p class="match"> but only when there are inside div containing <a class="a">.
What I've done so far is below (I firstly find the paragraphs with <a class="a"> inside and I iterate on the result to find the sentence inside a <p class="match">) :
import re
file_to_r = open("a")
regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)
regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
print(regex_match.findall(m))
but I wonder if there is an other (still efficient) way to do it at once?
Use an HTML Parser, like BeautifulSoup.
Find the a tag with a class and then find previous sibling - p tag with class match:
from bs4 import BeautifulSoup
data = """
<div>
<p class="match"> this sentence should match </p>
some text
<a class="a"> some text </a>
</div>
<div>
<p class="match"> this sentence shouldn't match</p>
some text
<a class ="b"> some text </a>
</div>
"""
soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text
Prints:
this sentence should match
Also see why you should avoid using regex for parsing HTML here:
RegEx match open tags except XHTML self-contained tags
You should use a html parser but if you still wat a regex you can use something like this:
<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>
Working demo
<div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))
You can use this.
See demo.
http://regex101.com/r/lK9iD2/7
Related
I will ask my question with an example that does "almost what I want":
from bs4 import BeautifulSoup
p = BeautifulSoup(features='lxml').new_tag('p')
p.append('This is my paragraph. I can add a ')
a = BeautifulSoup(features='lxml').new_tag('a', href='www.google.com')
a.string = 'link to Google'
p.append(a)
p.append(' and finish my paragraph.')
div = BeautifulSoup(features='lxml').new_tag('div')
div.append("I want to append the paragraph content into this div, but only its content without the <p> and </p>, and don't want to escape anything in the contents, i.e. I want to keep the a tag. ")
div.append(p)
print(div.prettify())
As a result, print(div) shows
<div>
I want to append the paragraph content into this div, but only its content without the <p> and </p>, and don't want to escape anything in the contents, i.e. I want to keep the a tag.
<p>
This is my paragraph. I can add a
<a href="www.google.com">
link to Google
</a>
and finish my paragraph.
</p>
</div>
As the text in the example itself says, I want to append the inner HTML of p without the <p> and </p> tags, but keeping all other tags (in this case the a tag). So for this example this is the result I want to get:
<div>
I want to append the paragraph content into this div, but only its content without the <p> and </p>, and don't want to escape anything in the contents, i.e. I want to keep the a tag. This is my paragraph. I can add a
<a href="www.google.com">
link to Google
</a>
and finish my paragraph.
</div>
How can this be done? I have tried a number of options like div.append(p.unwrap()) or div.append(p.text) and some others without luck. div.append(str(p)[3:-4]) does not work because it escapes all the < and > from the inner elements, in this case a.
You can use unwrap() like this to get the desired result.
import bs4 as bs
s = '''
<div>
I want to append the paragraph content into this div, but only its content without the <p> and </p>, and don't want to escape anything in the contents, i.e. I want to keep the a tag.
<p>
This is my paragraph. I can add a
<a href="www.google.com">
link to Google
</a>
and finish my paragraph.
</p>
</div>
'''
soup = bs.BeautifulSoup(s, 'html.parser')
div_tag = soup.find('div')
div_tag.p.unwrap()
print(soup)
Output
<div>
I want to append the paragraph content into this div, but only its content without the <p> and </p>, and don't want to escape anything in the contents, i.e. I want to keep the a tag.
This is my paragraph. I can add a
<a href="www.google.com">
link to Google
</a>
and finish my paragraph.
</div>
I have a website that I want to scrape, and I have a part of the website where the HTML is like so:
<p class="abc xyz">
<em class="efg">Whatever:</em>
test
</p>
<p class="abc xyz">
<em class="efg">Phone:</em>
+1-111-222-3333
</p>
I only want to get the text of the p tag where the em tag's text, inside the p tag, is Phone. So in the example above, I want to save +1-111-222-3333.
The classes are all the same, and the structure is the same also, but I don't need any other data other then the phone.
Is there any way that I can do this, or do I have to just save all the data, and then remove it in my csv file afterwards?
If you don't want to parse it with specific tools and prefer simple string search, you can try below regex search.
<p[^>]*>\s*<em[^>]*>\s*Phone[^<]*<\/em>\s*(.*)\s*<\/p>
Demo
Example code
import re
html="""
<p class="abc xyz">
<em class="efg">Whatever:</em>
test
</p>
<p class="abc xyz">
<em class="efg">Phone:</em>
+1-111-222-3333
</p>
"""
print(re.findall("<p[^>]*>\s*<em[^>]*>\s*Phone[^<]*<\/em>\s*(.*)\s*<\/p>",html))
Output
['+1-111-222-3333']
create list of all parent tag:
t = """
<p class="abc xyz">
<em class="efg">Whatever:</em>
test
</p>
<p class="abc xyz">
<em class="efg">Phone:</em>
+1-111-222-3333
</p>"""
soup = BeautifulSoup(t)
result = soup.find_all('p',class_='abc xyz')
the result would be a list of with parent tag p here with class abc xyz i.e.
[<p class="abc xyz">
<em class="efg">Whatever:</em>
test
</p>,
<p class="abc xyz">
<em class="efg">Phone:</em>
+1-111-222-3333
</p>]
now iterate over the list, and compare the text of child tag that you want, here->"Phone:"
for iter in result:
if (iter.find('em').text)=='Phone:':
print(iter.text.replace("\n","").split(" ")[-1])
this would give the result:+1-111-222-3333
if you check result[0].text then you get:
'\nWhatever: \n test \n'
and result[1].text:'\nPhone: \n +1-111-222-3333\n'
it contains newline so just replace them.
Consider the following html snippet
<html>
.
.
.
<div>
<p> Hello </p>
<div>
<b>
Text1
</b>
<p>
This is a huge paragraph text
</p>
.
.
.
</div>
</div>
.
.
.
<div>
<i>
Text2
</i>
</div>
Let us say that I need to extract everything from Text1 to Text2, including the tags.
Using a few methods, I have been able to extract the tags of those two, i.e. their unique ID.
Essentially I have 2 Element.etree elements, corresponding to the two tags I require.
How do I extract everything in between the two tags?
(One possible solution I can think of is to find the two tags common ancestor, and do a iterwalk() and start extracting at Element1, and stop at 2. However, I'm not exactly sure how this would be)
Any solution would be appreciated.
Please note that I have already found the two tags that I require, and I'm not looking for solutions to find those tags (for eg using xpath)
Edit: My desired output is
<b>
Text1
</b>
<p>
This is a huge paragraph text
</p>
.
.
.
</div>
</div>
.
.
.
<div>
<i>
Text2
</i>
Please note that I wouldn't mind the initial 2 <div> tags, but do not want the Hello.
The same goes with the closing tags of the end. I'm mostly interested in the inbetween contents.
Edit 2: I have extracted the Etree elements using complex xpath conditions, which was not feasible with other alternatives such as bs4, so any solution using the lxml elements would be appreciated :)
After review and questioning:
from essentials.tokening import CreateToken # This was imported just to generate a random string - pip install mknxgn_essentials
import bs4
HTML = """<html>
<div>
<div>
<div id="start">
Hello, My name is mark
</div>
</div>
</div>
<div>
This is in the middle
</div>
<div>
<div id="end">
This is the end
</div>
</div>
<div>
Do not include this.
</div>
</html>"""
RandomString = CreateToken(30, HTML) #Generate a random string that could never occur on it's own in the file, if it did occur, use something else
soup = bs4.BeautifulSoup(HTML, features="lxml") # Convert the text into soup
start_div = soup.find("div", attrs={"id": "start"}) #assuming you can find this element
start_div.insert_before(RandomString) # insert the random string before this element
end_div = soup.find("div", attrs={"id": "end"}) #again, i was assuming you can also find this element
end_div.insert_after(RandomString) # insert the random string after this element
print(str(soup).split(RandomString)[1]) # Get between both random strings
The output of this returns:
>>> <div id="start">
>>> Hello, My name is mark
>>> </div>
>>> </div>
>>> </div>
>>> <div>
>>> This is in the middle
>>> </div>
>>> <div>
>>> <div id="end">
>>> This is the end
>>> </div>
I am trying to comment out parts of an HTML page that I want later instead of extracting it with the beautiful soup tag.extract() function. Ex:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want blah blah blah</p>
<h2> References </h2>
<p>Html I want commented out</p>
I want everything below and including the References heading commented out. Obviously I can extract things like so using beautiful soup's extract features:
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
I also know that beautiful soup allows a comment creation feature which you can use like so
from bs4 import Comment
commented_tag = Comment(chunk_of_html_parsed_somewhere_else)
soup.append(commented_tag)
This seems very unpythonic and a cumbersome way to simply encapsulate html comment tags directly outside of a specific tag, especially if the tag was located in the middle of a thick html tree. Isn't there some easier way you can just find a tag on beautifulsoup and simply place <!-- --> directly before and after it? Thanks in advance.
Assuming I understand the problem correctly, you can use the replace_with() to replace a tag with a Comment instance. This is probably the simplest way to comment an existing tag:
import re
from bs4 import BeautifulSoup, Comment
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want blah blah blah</p>
<h2> References </h2>
<p>Html I want commented out</p>
</div>"""
soup = BeautifulSoup(data, "lxml")
elm = soup.find("h2", text=re.compile("References"))
elm.replace_with(Comment(str(elm)))
print(soup.prettify())
Prints:
<html>
<body>
<div>
<h1>
Name of Article
</h1>
<p>
First Paragraph I want
</p>
<p>
More Html I'm interested in
</p>
<h2>
Subheading in the article I also want
</h2>
<p>
Even more Html i want blah blah blah
</p>
<!--<h2> References </h2>-->
<p>
Html I want commented out
</p>
</div>
</body>
</html>
I'm trying to parse a HTML document using the BeautifulSoup Python library, but the structure is getting distorted by <br> tags. Let me just give you an example.
Input HTML:
<div>
some text <br>
<span> some more text </span> <br>
<span> and more text </span>
</div>
HTML that BeautifulSoup interprets:
<div>
some text
<br>
<span> some more text </span>
<br>
<span> and more text </span>
</br>
</br>
</div>
In the source, the spans could be considered siblings. After parsing (using the default parser), the spans are suddenly no longer siblings, as the br tags became part of the structure.
The solution I can think of to solve this is to strip the <br> tags altogether, before pouring the html into Beautifulsoup, but that doesn't seem very elegant, as it requires me to change the input. What's a better way to solve this?
Your best bet is to extract() the line breaks. It's easier than you think :).
>>> from bs4 import BeautifulSoup as BS
>>> html = """<div>
... some text <br>
... <span> some more text </span> <br>
... <span> and more text </span>
... </div>"""
>>> soup = BS(html)
>>> for linebreak in soup.find_all('br'):
... linebreak.extract()
...
<br/>
<br/>
>>> print soup.prettify()
<html>
<body>
<div>
some text
<span>
some more text
</span>
<span>
and more text
</span>
</div>
</body>
</html>
You could also do something like that:
str(soup).replace("</br>", "")
This is a super old question but I just had a similar problem because my document contained closong </br>tags. Because of this, massive chunks of document were simply ignored by beatifulsoup ( bs trying to deal with a closing tag, I assume.) soup.find_all('br') didn't actually find anything because there was no opening br tag, so I couldn't use the extract() method.
After bashing my head for an hour I found that using lxml parser instead of the default html fixed the problem.
soup = BeautifulSoup(page, 'lxml')
Here are two more workarounds:
First solution works for the example because there are spaces in between, so you can remove redundant whitespaces to get text properly formatted
I myself had an html, where solution 1 would not work because all text parts would be concatenated without spaces. There I used workaround 2 and processed the list items afterwards.
from bs4 import BeautifulSoup
html = """<div>
some text <br>
<span> some more text </span> <br>
<span> and more text </span>
</div>"""
soup = BeautifulSoup(html, 'lxml')
#Workaround 1
soup.find("div").text
#returns '\n some text \n some more text \n and more text \n'
#Workaround 2
[t for t in soup.find("div").children]
#and process further list items by removing br tags, whitespaces etc.