How Can I Get The Text In Class BeautifulSoup - python

How Can I Get The Text Of This Line:
<p class="Type__TypeElement-sc-9snywk-0 dHxvMA ProfileSection__value--1bo-L" data-hj-suppress="true" data-qa="Profile Field: Country">SE</p>
I Want To Get The "SE" I tried a lot of things but non of them worked

You could do something along the following lines:
soup.find('p').getText()

Recommend a simple library. Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<p class="Type__TypeElement-sc-9snywk-0 dHxvMA ProfileSection__value--1bo-L" data-hj-suppress="true" data-qa="Profile Field: Country">SE</p>
'''
doc = SimplifiedDoc(html)
print(doc.p.text)

Related

use pyquery to filter html

I'm trying to use pyquery parse html. I'm facing one uncertain issue. My code as below:
from pyquery import PyQuery as pq
document = pq('<p id="hello">Hello</p><p id="world">World !!</p>')
p = document('p')
print(p.filter("#hello"))
And the expectation of print result should as following :
<p id="hello">Hello</p>
But the actual response as below:
<p id="hello">Hello</p><p id="world">World !!</p></div></html>
if I just want to the specify part html instead of the rest of the entire html content, how should I write it.
Thanks
You can use built in library ElementTree
import xml.etree.ElementTree as ET
html = '''<html><p id="hello">Hello</p><p id="world">World !!</p></html>'''
root = ET.fromstring(html)
p = root.find('.//p[#id="hello"]')
print(ET.tostring(p))
output
b'<p id="hello">Hello</p>'

Parse element's tail with requests-html

I want to parse an HTML document like this with requests-html 0.9.0:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('.data', first=True)
print(data.html)
# <span class="data">important data</span> and some rubbish
print(data.text)
# important data and some rubbish
I need to distinguish the text inside the tag (enclosed by it) from the tag's tail (the text that follows the element up to the next tag). This is the behaviour I initially expected:
data.text == 'important data'
data.tail == ' and some rubbish'
But tail is not defined for Elements. Since requests-html provides access to inner lxml objects, we can try to get it from lxml.etree.Element.tail:
from lxml.etree import tostring
print(tostring(data.lxml))
# b'<html><span class="data">important data</span></html>'
print(data.lxml.tail is None)
# True
There's no tail in lxml representation! The tag with its inner text is OK, but the tail seems to be stripped away. How do I extract 'and some rubbish'?
Edit: I discovered that full_text provides the inner text only (so much for “full”). This enables a dirty hack of subtracting full_text from text, although I'm not positive it will work if there are any links.
print(data.full_text)
# important data
I'm not sure I've understood your problem, but if you just want to get 'and some rubbish' you can use below code:
from requests_html import HTML
from lxml.html import fromstring
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = fromstring(html.html)
# or without using requests_html.HTML: data = fromstring('<span><span class="data">important data</span> and some rubbish</span>')
print(data.xpath('//span[span[#class="data"]]/text()')[-1]) # " and some rubbish"
NOTE that data = html.find('.data', first=True) returns you <span class="data">important data</span> node which doesn't contain " and some rubbish" - it's a text child node of parent span!
the tail property exists with objects of type 'lxml.html.HtmlElement'.
I think what you are asking for is very easy to implement.
Here is a very simple example using requests_html and lxml:
from requests_html import HTML
html = HTML(html='<span><span class="data">important data</span> and some rubbish</span>')
data = html.find('span')
print (data[0].text) # important data and some rubbish
print (data[-1].text) # important data
print (data[-1].element.tail) # and some rubbish
The element attribute points to the 'lxml.html.HtmlElement' object.
Hope this helps.

Python - ccsselect with spaces

I use lxml for parsing html files in Python.
And I use cssselect.
Something like that:
from lxml.html import parse
page = parse('http://.../').getroot()
img = page.cssselect('div.photo cover div.outer a') # problem
But I have a problem. There are spaces in class-names in HTML:
<div class="photo cover"><div class=outer>
<a href=...
Without them everything is ok. How can I parse it (I can't edit html code)?
To match div with photo and cover class, use div.photo.cover.
img = page.cssselect('div.photo.cover div.outer a')
Instead of thinkg class="photo cover" as class attribute with photo cover as a value, think it as a class attribute with photo and cover as values.

get specific text such as "Something new" using BeautifulSoup Python

I am making a focused crawler and facing an issue while find a for a key phrase in the document.
Supposing the key phrase I want to search in the document is "Something new"
using BeautifulSoup with python I do the following
if soup.find_all(text = re.compile("Something new",re.IGNORECASE)):
print true
I want it to print true only for the following cases
"something new" --> true
"$#something new,." --> true
AND not for the following cases:
"thisSomething news" --> false
"Somethingnew" --> false
assuming special characters are allowed.
Has anyone ever done something like this before. ??
Thanks for the help.
Then, search for something new and don't apply re.IGNORECASE:
import re
from bs4 import BeautifulSoup
data = """
<div>
<span>something new</span>
<span>$#something new,.</span>
<span>thisSomething news</span>
<span>Somethingnew</span>
</div>
"""
soup = BeautifulSoup(data)
for item in soup.find_all(text=re.compile("something new")):
print item
Prints:
something new
$#something new,.
You can also take a non-regex approach and pass a function instead of a compiled regex pattern:
for item in soup.find_all(text=lambda x: 'something new' in x):
print item
For the example HTML used above, it also prints:
something new
$#something new,.
This is one of the alternative methods that I used:
soup.find_all(text = re.compile("\\bSomething new\\b",re.IGNORECASE))
Thanks everyone.

How can i remove <p> </p> with python sub

I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with &nbsp. With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)

Categories

Resources