Converting BeautifulSoup soup to lxml element

Converting BeautifulSoup soup to lxml element - python

I would like to use BeautifulSoup or lxml to parse some web pages. Since the raw data is not a clean xml so it cannot be parsed directly by lxml.etree.fromstring. However, Beautifulsoup(page_source,'lxml') works and I can get the soup data of the page. As I need some functions in lxml such as query by xpath. Are there any functions or variables that I can call to convert the soup object of the whole raw web page to a etree object? (I guess Beautifulsoup should have converted the raw page to an etree object before generating a soup object via lxml parser, but I cannot find out where it stores the object.)
p.s. I have tried the answer from Is it possible to use bs4 soup object with lxml? to parse the web pages. But I still find some pages cannot be parsed, here is the example:
>>> from urllib.request import urlopen
>>> html = urlopen('https://www.nature.com/articles/s41558-019-0619-1').read()
>>> soup = BeautifulSoup(html,'lxml') ## return a soup object
>>> from lxml.etree import fromstring
>>> fromstring(soup.prettify()) ## return errors
>>> from lxml.html.soupparser import fromstring
>>> fromstring(soup.prettify()) ## return errors

Related

Python HTML scraping cannot find attribute I know exists?

I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?

I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.

python list.append between text

In Python 3, how would you go about taking the string between header tags, for example, printing Hello, world!, out of <h1>Hello, world!</h1>:
import urllib
from urllib.request import urlopen
#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")
webPage = urllib.request.urlopen(userAddress)
list = []
while webPage != "":
webPage.read()
list.append()

You need an HTML Parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(webPage)
print(soup.find("h1").get_text(strip=True))
Demo:
>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.hobo-web.co.uk/headers/"
>>> webPage = urlopen(url)
>>>
>>> soup = BeautifulSoup(webPage, "html.parser")
>>> print(soup.find("h1").get_text(strip=True))
How To Use H1-H6 HTML Elements Properly
I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way?
If you are, for some reason, not allowed to use third-parties, you can use a built-in html.parser module. Some people also use regular expressions to parse HTML. It is not always a bad thing, but you have to be very careful with that, see:
RegEx match open tags except XHTML self-contained tags

Definitely HTMLParser is your best friend to deal with that issue.
There are related question which already exist and cover your needs.

BeautifulSoup XML parsing not working

I'm trying to parse an XML page with BeautifulSoup and for some reason it's not able to find the XML parser. I don't think it's a path issue as I've used lxml to parse pages in the past, just not XML. Here's the code:
from bs4 import *
import urllib2
import lxml
from lxml import *
BASE_URL = "http://auctionresults.fcc.gov/Auction_66/Results/xml/round/66_115_database_round.xml"
proxy = urllib2.ProxyHandler({'http':'http://myProxy.com})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
page = urllib2.urlopen(BASE_URL)
soup = BeautifulSoup(page,"xml")
print soup
I'm probably missing something simple, but all the XML parsing with BS questions I found on here were around bs3 and I'm using bs4 which uses a different method for parsing XML. Thanks.

If you have lxml installed, just call that as BeautifulSoup's parser instead, like below.
Code:
from bs4 import BeautifulSoup as bsoup
import requests as rq
url = "http://auctionresults.fcc.gov/Auction_66/Results/xml/round/66_115_database_round.xml"
r = rq.get(url)
soup = bsoup(r.content, "lxml")
print soup
Result:
<html><body><dataroot xmlns:od="urn:schemas-microsoft-com:officedata" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:nonamespaceschemalocation="66_database.xsd"><all_bids>
<auction_id>66</auction_id>
<auction_description>Advanced Wireless Services</auction_description>
... really long list follows...
[Finished in 34.9s]
Let us know if this helps.

BeautifulSoup not properly parsing script text/template

I have a fairly complex template script that BeautifulSoup4 isn't understanding for some reason. As you can see below, BS4 is only parsing partially into the tree before giving up. Why is this and is there a way to fix it?
>>> from bs4 import BeautifulSoup
>>> html = """<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script> Other stuff I want to stay"""
>>> soup = BeautifulSoup(html)
>>> soup.findAll('script')
[<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</script>]
Edit: on further testing, for some reason it appears that BS3 is able to parse this correctly:
>>> from BeautifulSoup import BeautifulSoup as bs3
>>> soup = bs3(html)
>>> soup.script
<script id="scriptname" type="text/template"><section class="sectionname"><header><h1>Test</h1></header><table><tr><th>Title</th><td class="class"></td><th>Title</th><td class="class"></td></tr><tr><th>Title</th><td class="class"></td><th>Another row</th><td class="checksum"></td></tr></table></section></script>

Beautiful Soup sometimes fail with its default parser. Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers.
In some cases I have to change the parser to other like : lxml, html5lib or any other.
This is a example of the explanation above :
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup, "lxml")
I recommend you read this http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

can we use XPath with BeautifulSoup?

I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.

Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
There is also a dedicated lxml.html() module with additional functionality.
Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:
import lxml.html
import requests
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.

I can confirm that there is no XPath support within Beautiful Soup.

As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I used this as a reference.

BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[#id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath

when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[#class="shared-components"]/#href')
but when use BeautifulSoup BS4 all simple too:
first remove "//" and "#"
second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/#href" part

I've searched through their docs and it seems there is no XPath option.
Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.

Maybe you can try the following without XPath
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

use soup.find(class_='myclass')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting BeautifulSoup soup to lxml element - python

Related

Python HTML scraping cannot find attribute I know exists?

python list.append between text

BeautifulSoup XML parsing not working

BeautifulSoup not properly parsing script text/template

can we use XPath with BeautifulSoup?

Categories

Resources