python html parsing - python

i need do some html parsing use python .if i have a html file like bellow:
《body》
《div class="mydiv"》
《p》i want got it《/p》
《div》
《p》 good 《/p》
《a》 boy 《/a》
《/div》
《/div》
《/body》
how can i get the content of 《div class="mydiv"》 ,say , i want got .
《p》i want got it《/p》
《div》
《p》 good 《/p》
《a》 boy 《/a》
《/div》
i have try HTMLParser, but i fount it can't.
anyway else ? thanks!

With BeautifulSoup it is as simple as:
from BeautifulSoup import BeautifulSoup
html = """
<body>
<div class="mydiv">
<p>i want got it</p>
<div>
<p> good </p>
<a> boy </a>
</div>
</div>
</body>
"""
soup = BeautifulSoup(html)
result = soup.findAll('div', {'class': 'mydiv'})
tag = result[0]
print tag.contents
[u'\n', <p>i want got it</p>, u'\n', <div>
<p> good </p>
<a> boy </a>
</div>, u'\n']

Use lxml. Or BeautifulSoup.

I would prefer lxml.html.
import lxml.html as H
doc = H.fromstring(html)
node = doc.xpath("//div[#class='mydiv']")

Related

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?
The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

How to ignore tags on beautifulsoup4 python

I'm working on a new project and I have some issues.
My problem as like that.
<div class="news">
<p class="breaking"> </p>
...
<p> i need to pull here. </p>
but class = "breaking" is not let me to do it. I want to ignore the class "breaking" and pull the <p>.
Maybe, class='' would do with find_all or findAll:
from bs4 import BeautifulSoup
html = """
<div class="news">
<p class="breaking"> </p>
...
<p> i need to pull here. </p>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p', class_=''))
print(soup.findAll(True, {'class': ''}))
Output
[<p> i need to pull here. </p>]
[<p> i need to pull here. </p>]

Python : I used the beautifulsoup. but result value of href is disappeared

This is the html
<div class="s_write">
<p style="text-align:left;"></P>
<div app_paragraph="Dc_App_Img_0" app_editorno="0">
<img src="http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f244141456484ebe788e4b1ac8601ef468abc7cad6754f440d9ddbfc0370c7" style="cursor:pointer;" onclick="javascript:imgPop('http://image.dcinside.com/viewimagePop.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8b9fb90ae74e4d6a2435010d29956ad37f400586d9cb','image','fullscreen=yes,scrollbars=yes,resizable=no,menubar=no,toolbar=no,location=no,status=no');"></div>
I want to extract
http://image.dcinside.com/viewimagePop.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8b9fb90ae74e4d6a2435010d29956ad37f400586d9cb
So I programmed like this
for link in internal.find_all('div',class_="s_write"):
print (link)
But result is
<div class="s_write"><p style="text-align:left;"><div app_editorno="0" app_paragraph="Dc_App_Img_0"><img src="http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8a9dbf0ae34e4a6a20370e5a2d9633d5701c48dc23ac"/></p></div>
href is not the result
What's problem?
Issue with your code :
You are searching for dev which will return you only dev block . If you want to find img href you need to searh for img tag.
Following is a rough example to get this done.
import bs4 as bs
markup = """<div class="s_write">
<p style="text-align:left;"></P>
<div app_paragraph="Dc_App_Img_0" app_editorno="0">
<img src="http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f244141456484ebe788e4b1ac8601ef468abc7cad6754f440d9ddbfc0370c7" style="cursor:pointer;" onclick="javascript:imgPop('http://image.dcinside.com/viewimagePop.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8b9fb90ae74e4d6a2435010d29956ad37f400586d9cb','image','fullscreen=yes,scrollbars=yes,resizable=no,menubar=no,toolbar=no,location=no,status=no');"></div>"""
soup = bs.BeautifulSoup(markup,"html.parser")
imglink = soup.find_all("img")[0]
print(imglink.attrs["src"])
Output
http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f244141456484ebe788e4b1ac8601ef468abc7cad6754f440d9ddbfc0370c7

How to wrap string by tag in Beautifulsoup?

I want to wrap the content of a lot of div-elements/blocks with p tags:
<div class='value'>
some content
</div>
It should become:
<div class='value'>
<p>
some content
</p>
</div>
My idea was to get the content (using bs4) by filtering strings with find_all and then wrap it with the new tag. Don't know, if its working. I cant filter content from tags with specific attributes/values.
I can do this instead of bs4 with regex. But I'd like to do all transformations (there are some more beside this one) in bs4.
Believe it or not, you can use wrap. :-)
Because you might, or might not, want to wrap inner div elements I decided to alter your HTML code a little bit, so that I could give you code that shows how to alter an inner div without changing the one 'outside' it. You will see how to alter all divs, I'm sure.
Here's how.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('pjoern.htm').read(), 'lxml')
>>> inner_div = soup.findAll('div')[1]
>>> inner_div
<div>
some content
</div>
>>> inner_div.contents[0].wrap(soup.new_tag('p'))
<p>
some content
</p>
>>> print(soup.prettify())
<html>
<body>
<div class="value">
<div>
<p>
some content
</p>
</div>
</div>
</body>
</html>

Beautiful Soup: Best ways to comment out a tag instead of extracting it?

I am trying to comment out parts of an HTML page that I want later instead of extracting it with the beautiful soup tag.extract() function. Ex:
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want blah blah blah</p>
<h2> References </h2>
<p>Html I want commented out</p>
I want everything below and including the References heading commented out. Obviously I can extract things like so using beautiful soup's extract features:
soup = BeautifulSoup(data, "lxml")
references = soup.find("h2", text=re.compile("References"))
for elm in references.find_next_siblings():
elm.extract()
references.extract()
I also know that beautiful soup allows a comment creation feature which you can use like so
from bs4 import Comment
commented_tag = Comment(chunk_of_html_parsed_somewhere_else)
soup.append(commented_tag)
This seems very unpythonic and a cumbersome way to simply encapsulate html comment tags directly outside of a specific tag, especially if the tag was located in the middle of a thick html tree. Isn't there some easier way you can just find a tag on beautifulsoup and simply place <!-- --> directly before and after it? Thanks in advance.
Assuming I understand the problem correctly, you can use the replace_with() to replace a tag with a Comment instance. This is probably the simplest way to comment an existing tag:
import re
from bs4 import BeautifulSoup, Comment
data = """
<div>
<h1> Name of Article </h2>
<p>First Paragraph I want</p>
<p>More Html I'm interested in</p>
<h2> Subheading in the article I also want </h2>
<p>Even more Html i want blah blah blah</p>
<h2> References </h2>
<p>Html I want commented out</p>
</div>"""
soup = BeautifulSoup(data, "lxml")
elm = soup.find("h2", text=re.compile("References"))
elm.replace_with(Comment(str(elm)))
print(soup.prettify())
Prints:
<html>
<body>
<div>
<h1>
Name of Article
</h1>
<p>
First Paragraph I want
</p>
<p>
More Html I'm interested in
</p>
<h2>
Subheading in the article I also want
</h2>
<p>
Even more Html i want blah blah blah
</p>
<!--<h2> References </h2>-->
<p>
Html I want commented out
</p>
</div>
</body>
</html>

Categories

Resources