can we use XPath with BeautifulSoup? - python

I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.

Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
There is also a dedicated lxml.html() module with additional functionality.
Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:
import lxml.html
import requests
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.

I can confirm that there is no XPath support within Beautiful Soup.

As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I used this as a reference.

BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]

from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[#id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath

when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[#class="shared-components"]/#href')
but when use BeautifulSoup BS4 all simple too:
first remove "//" and "#"
second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/#href" part

I've searched through their docs and it seems there is no XPath option.
Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.

Maybe you can try the following without XPath
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))

This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()

use soup.find(class_='myclass')

Related

Python HTML scraping cannot find attribute I know exists?

I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.

How to extract ids and classes from a webpage using python?

This is my code so far :
import urllib2
with urllib2.urlopen("https://quora.com") as response:
html = response.read()
I am new to Python and somehow I am successful in fetching the webpage, now how to extract ids and classes from the webpage?
A better way to do so would be using the BeautifulSoup (bs4) web-scraping library, and requests.
After having installed both using pip, you can start as so:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://quora.com")
soup = BeautifulSoup(r.content, "html.parser")
To find an element with a specific id:
soup.find(id="your_id")
To find all elements with the "Answer" class:
soup.find_all(class_="Answer")
You can then use .get_text() to remove the html tags and use python string operations to organize your data.
You may try to parse the html code using dedicated libraries, for instance BeautifulSoup.
you can do it easly by xml parsing
from lxml import html
import requests
page = requests.get('http://google.com')
with open('/home/Desktop/test.txt','wb') as f :
f.write(page.content)

Python Beautiful Soup - find value based on text in HTML

I am having a problem finding a value in a soup based on text. Here is the code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text)
findit=soup.find("td", text=re.compile('Market Cap'))
This returns [], yet there absolutely is text in a 'td' tag with 'Market Cap'.
When I use
soup.find_all("td")
I get a result set which includes:
<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>
Explanation:
The problem is that this particular tag has other child elements and the .string value, which is checked when you apply the text argument, is None (bs4 has it documented here).
Solutions/Workarounds:
Don't specify the tag name here at all, find the text node and go up to the parent:
soup.find(text=re.compile('Market Cap')).parent.get_text()
Or, you can use find_parent() if td is not the direct parent of the text node:
soup.find(text=re.compile('Market Cap')).find_parent("td").get_text()
You can also use a "search function" to search for the td tags and see if the direct text child nodes has the Market Cap text:
soup.find(lambda tag: tag and
tag.name == "td" and
tag.find(text=re.compile('Market Cap'), recursive=False))
Or, if you are looking to find the following number 5:
soup.find(text=re.compile('Market Cap')).next_sibling.get_text()
You can't use regex with tag. It just won't work. Don't know if it's a bug of specification. I just search after all, and then get the parent back in a list comprehension cause "td" "regex" would give you the td tag.
Code
from bs4 import BeautifulSoup as bs
import requests
import re
html='http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics'
r = requests.get(html)
soup = bs(r.text, "lxml")
findit=soup.find_all(text=re.compile('Market Cap'))
findit=[x.parent for x in findit if x.parent.name == "td"]
print(findit)
Output
[<td class="yfnc_tablehead1" width="74%">Market Cap (intraday)<font size="-1"><sup>5</sup></font>:</td>]
Regex is just a terrible thing to integrate into parsing code and in my humble opinion should be avoided whenever possible.
Personally, I don't like BeautifulSoup due to its lack of XPath support. What you're trying to do is the sort of thing that XPath is ideally suited for. If I were doing what you're doing, I would use lxml for parsing rather than BeautifulSoup's built in parsing and/or regex. It's really quite elegant and extremely fast:
from lxml import etree
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = etree.HTML(source)
tds_w_market_cap = parsed.xpath('//td[contains(., "Market Cap")]')
FYI the above returns an lxml object rather than the text of the page source. In lxml you don't really work with the source directly, per se. If you need to return a list of the actual source for some reason, you would add something like:
print [etree.tostring(i) for i in tds_w_market_cap]
If you absolutely have to use BeautifulSoup for this task, then I'd use a list comprehension:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('http://finance.yahoo.com/q/ks?s=aapl+Key+Statistics').content
parsed = bs(source, 'lxml')
tds_w_market_cap = [i for i in parsed.find_all('td') if 'Market Cap' in i.get_text()]

python list.append between text

In Python 3, how would you go about taking the string between header tags, for example, printing Hello, world!, out of <h1>Hello, world!</h1>:
import urllib
from urllib.request import urlopen
#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")
webPage = urllib.request.urlopen(userAddress)
list = []
while webPage != "":
webPage.read()
list.append()
You need an HTML Parser. For example, BeautifulSoup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(webPage)
print(soup.find("h1").get_text(strip=True))
Demo:
>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.hobo-web.co.uk/headers/"
>>> webPage = urlopen(url)
>>>
>>> soup = BeautifulSoup(webPage, "html.parser")
>>> print(soup.find("h1").get_text(strip=True))
How To Use H1-H6 HTML Elements Properly
I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way?
If you are, for some reason, not allowed to use third-parties, you can use a built-in html.parser module. Some people also use regular expressions to parse HTML. It is not always a bad thing, but you have to be very careful with that, see:
RegEx match open tags except XHTML self-contained tags
Definitely HTMLParser is your best friend to deal with that issue.
There are related question which already exist and cover your needs.

python url regexp

I have a regexp and i want to add output of regexp to my url
for exmaple
url = 'blabla.com'
r = re.findall(r'<p>(.*?</a>))
r output - /any_string/on/any/server/
but a dont know how to make get-request with regexp output
blabla.com/any_string/on/any/server/
Don't use regex to parse html. Use a real parser.
I suggest using the lxml.html parser. lxml supports xpath, which is a very powerful way of querying structured documents. There's a ready-to-use make_links_absolute() method that does what you ask. It's also very fast.
As an example, in this question's page HTML source code (the one you're reading now) there's this part:
<li><a id="nav-tags" href="/tags">Tags</a></li>
The xpath expression //a[#id='nav-tags']/#href means: "Get me the href attribute of all <a> tags with id attribute equal to nav-tags". Let's use it:
from lxml import html
url = 'http://stackoverflow.com/questions/3423822/python-url-regexp'
doc = html.parse(url).getroot()
doc.make_links_absolute()
links = doc.xpath("//a[#id='nav-tags']/#href")
print links
The result:
['http://stackoverflow.com/tags']
Just get beautiful soup:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing+a+Document
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
soup.findAll('p')

Categories

Resources