why the results of response.xpath('//html') differs than response.body? - python

I'm trying to parse this page using scrapy
http://mobileshop.ae/one-x
I need to extract the links of the products.
The problem is the links are available in the response.body result, but no available if you try response.xpath('//body').extract()
the results of response.body and response.xpath('//body') are different.
>>> body = response.body
>>> body_2 = response.xpath('//html').extract()[0]
>>> len(body)
238731
>>> len(body_2)
67520
same short result for response.xpath('.').extract()[0]
is there any idea why this happens, and how can I extract the data needed ?

So, the issue here is a lot of mal-formed content in that page, including several unclosed tags. One way to solve this problem is to use lxml's soupparser to parse the mal-formed content (using BeautifulSoup under the covers) and build a Scrapy Selector with it.
Example session with scrapy shell http://mobileshop.ae/one-x:
>>> from lxml.html import soupparser
>>> from scrapy import Selector
>>> sel = Selector(_root=soupparser.fromstring(response.body))
>>> sel.xpath('//h4[#class="name"]/a').extract()
[u'HTC One X 3G 16GB Grey',
u'HTC One X 3G 16GB White',
u'HTC One X 3G 32GB Grey',
u'HTC One X 3G 32GB White']
Note that using the BeautifulSoup parser is a lot slower than lxml's default parser. You probably want to do this only in the places where it's really needed.

Response.xpath("//body") returns body of 'html' element contained in response, while response.body returns body (or message-body) of whole HTTP response (so all html in response including head & body elements).
Response.xpath("//body") is actually shortcut that converts body of HTTP response into Selector object that can be navigated with xpaths.
Links that you need are contained in html element body, they cannot really be anywhere else, I'm not sure why you suggest that they are not there. response.xpath("//body//a/#href") will give you all links on page, you probably need to create proper xpath that will select only those links you need.
The length of response.xpath("//body") that you mention in your example results from the fact that your first example len(response.xpath('//body').extract()) returns the number of body elements in html document, response.xpath.extract() returns list of elements matching xpath. There is only one body in document. In your second example len(response.xpath('//body')).extract()[0]) you are actually getting body element as string, and you are getting length of string (number of characters contained in body). len(response.body) also gives you number of characters in whole HTTP response, the number is higher most likely because html HEAD contains lots of scripts and stylesheets, that are not present in HTML body.

Related

Python Requests Module Doesn't Fetch All The Elements It Ought To

This piece of code works properly
from lxml import html
import requests
page = requests.get(c)
tree = html.fromstring(page.content)
link = tree.xpath('//script/text()')
But it doesn't fetch the whole content. Like it is hidden or something.
I can see this is the case because the next thing I do is this
print len(link)
and it returns nine (9)
I then go to the page, which is the string c, above in the code. I go to the source (view-source:) with mozilla. And I hit ctr+f and I write <script with a space in the end.
It returns me thirty three (33) matches. The one I want cannot be fetched.
What's happening? I can't understand. Am I blocked or something? How can I bypass this and make requests module see what mozilla is seeing?
If you try
tree.xpath('//script')
I hope you will get 33 matches.
On your page only nine elements contain something between the opening and closing tags.

xpath how to format path

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.
In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.
I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

using python lxml+xpath to get videos from a page, get a list but can't print out the result?

Newbie for python, would like use lxml+xpath to get video link from web page, what I have now is:
import urllib2
from lxml import etree
url=u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/"
xpath=u"//script[contains(.,'label:\"360p\"')]"
html=urllib2.urlopen(url).read()
selector=etree.HTML(html)
get=selector.xpath(xpath)
print get
I've checke type() of get, which shows me it's a list, but when I print get, it shows me unexpected [<Element script at 0x2a34b88>], what's that mean? and how can I extract the actually url of the video instead of Element script?
finally, I got why I had this problem, thanks #unutbu
xpath=u"//script[contains(.,'label:\"360p\"')]"
should be
xpath=u"//script[contains(.,'label:\"360p\"')]//text()"
which added text() to make sure return only text, but not elements, under the selection element, notice the //, that for compatible when there are many sub-elements of the selection.
selector.xpath(xpath) returns a list of tags (or more accurately, Elements). When you print a list of objects, Python shows the repr of those Elements. <Element script at 0x2a34b88> is the repr of the script Element.
If elt is the script Element, then
elt.text will return the text inside the <script> tag, but you'll need to use something else (besides lxml) to extract the url from the text. You could, for example, use the regex pattern r'"(http[^"]+)"' to search for text which begins with "http and continues until another double quote, ", is found:
import re
import lxml.html as LH
url = u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/"
xpath = u"""//script[contains(.,'label:"360p"')]"""
root = LH.parse(url)
for elt in root.xpath(xpath):
for url in re.findall(r'"(http[^"]+)"', elt.text):
print(url)
yields
http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NS71jbj8NVNANTN7N0Nq7Y7FjeN0NojTN47HNcN77_Nhjh7INm7ONLNijCNc7-7UN_NXNCjcNYjeNwNF7uNQNA7dNvNm7-Nr7vNW7-NtjN72N4jVNCN8NfN-NANm7l7rNP7ff5aa877861da31d8cc9dd087d6ce2417fb1308a676a771b787adbffbaa4a0bffNfNHjtj-N6NDNg7HjLND7F7fjMj.jVjKN1N-jMj7NXj7jNNyjTNwjgjmji7INANtNONsN2NvN6jMNaNTNdNlNON8j7N~NEjO7lNyN.jQNaNuN1NYNjjzNnNENUNmNm7Z707dNaNTNFN0N6N8N.NRNuN_7dNtjhjJN-jmNZNpjjNo7fNHjTNNNSNLjMNqNUjN7IN7NPNfNENKN3jT7dNs&link2=
http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NvNeNVN4N276Nz7JNSjz7lNLNvNV7Ij3Nx7FNn7.Ni7FNU76NDNMN.NqNkNo7QNKNINiNhjPNJjmNKjPNGN.No7B7BNC7Y7B7B7lN67tjb7JNJNT7rNANrNBN7N6Nt7lN1ND0ba06b7bac4bab5fbb42dbff6c27647ea71b4f725a0c73f175eadf3b459424edN0NBNvNZj77wNL7Wj_j_71NnN0jpNfjPNqNvjDN.jEN4NRNDjijejmjXNINqNijEjENKNfNdN3jiNDNOjcNyN4NwNzN4NqNlNqNAjDNQNBN0Nk7a7Rj8NXN_NiN6NFNmNmNLNwNm7YN7j77vNfNpNljw7HjENRjmNMjVNLNEjq7BN0NON57JNyNyjpN8Nbjz7lN-NfNYNMN.7IjD7.NQ&link2=
Note that you do not need to import urllib2. You can pass a url directly to LH.parse.
To get only the url which is followed by the string '360p', you could use
for url in re.findall(r'"(http[^"]+).*360p"', elt.text):
print(url)

Extract Text from HTML div using Python and lxml

I'm trying to get python to extract text from one spot of a website. I've identified the HTML div:
<div class="number">76</div>
which is in:
...div/div[1]/div/div[2]
I'm trying to use lxml to extract the '76' from that, but can't get a return out of it other than:
[]
Here's my code:
from lxml import html
import requests
url = 'https://sleepiq.sleepnumber.com/#/##1'
values = {'username': 'my#gmail.com',
'password': 'mypassword'}
page = requests.get(url, data=values)
tree = html.fromstring(page.content)
hr = tree.xpath('//div[#class="number"]/text()')
print hr
Any suggestions? I feel this should be pretty easy, thanks in advance!
Update: the element I want is not contained in the page.content from requests.get
Updated Update: It looks like this is not logging me in to the page where the content I want is. It is only getting the login screen content.
Have you tried printing your page.content to make sure your requests.get is retrieving the content you want? That is often where things break. And your empty list returned off the xpath search indicates "not found."
Assuming that's okay, your parsing is close. I just tried the following, which is successful:
from lxml import html
tree = html.fromstring('<body><div class="number">76</div></body>')
number = tree.xpath('//div[#class="number"]/text()')[0]
number now equals '76'. Note the [0] indexing, because xpath always returns a list of what's found. You have to dereference to find the content.
A common gotcha here is that the XPath text() function isn't as inclusive or straightforward as it might seem. If there are any sub-elements to the div--e.g. if the text is really <div class="number"><strong>76</strong></div> then text() will return an empty list, because the text belongs to the strong not the div. In real-world HTML--especially HTML that's ever been cut-and-pasted from a word processor, or otherwise edited by humans--such extra elements are entirely common.
While it won't solve all known text management issues, one handy workaround is to use the // multi-level indirection instead of the / single-level indirection to text:
number = ''.join(tree.xpath('//div[#class="number"]//text()'))
Now, regardless of whether there are sub-elements or not, the total text will be concatenated and returned.
Update Ok, if your problem is logging in, you probably want to try a requests.post (rather than .get) at minimum. In simpler cases, just that change might work. In others, the login needs to be done to a separate page than the page you want to retrieve/scape. In that case, you probably want to use a session object:
with requests.Session() as session:
# First POST to the login page
landing_page = session.post(login_url, data=values)
# Now make authenticated request within the session
page = session.get(url)
# ...use page as above...
This is a bit more complex, but shows the logic for a separate login page. Many sites (e.g. WordPress sites) require this. Post-authentication, they often take you to pages (like the site home page) that isn't interesting content (though it can be scraped to identify whether the login was successful). This altered login workflow doesn't change any of the parsing techniques, which work as above.
Beautiful Soup(http://www.pythonforbeginners.com/beautifulsoup/web-scraping-with-beautifulsoup) will help u out.
another way
http://docs.python-guide.org/en/latest/scenarios/scrape/
I'd use plain regex over xml tools in this case. It's easier to handle.
import re
import requests
url = 'http://sleepiq.sleepnumber.com/#/user/-9223372029758346943##2'
values = {'email-email': 'my#gmail.com', 'password-clear': 'Combination',
'password-password': 'mypassword'}
page = requests.get(url, data=values, timeout=5)
m = re.search(r'(\w*)(<div class="number">)(.*)(<\/div>)', page.content)
# m = re.search(r'(\w*)(<title>)(.*)(<\/title>)', page.content)
if m:
print(m.group(3))
else:
print('Not found')

lxml tree head and some other elements broken

I tried many different solutions for the following problem and I couldn't find one that works at the time being.
I need to get some information from meta tags in several webpages. For this purpose I found lxml very useful because I also need to find specific content using xpath to parse it. XPath works on the tree, however, I have a 20% of websites (in a total around 100) that don't work, specifically head seems to be broken.
tree = html.fromstring(htmlfrompage) // using html from lxml package
head_object = tree.head // access to head object from this webpage
In all of these websites accessing head object (which is only a shortcut to xpath) fails with the same error:
print tree.head
IndexError: list index out of range
Because the following xpath fails:
self.xpath('//head|//x:head', namespaces={'x':XHTML_NAMESPACE})[0]
This xpath is empty so accessing the first element fails. I was navigating the tree myself and self.xpath('//head') or self.xpath('//html/head') or even self.xpath('//body') is empty. But if I try to access meta tags directly in any place of the document:
head = tree.xpath("//meta")
for meta_tag in head:
print meta_tag.text # Just printing something
It works, so it means somehow metas are not connected to the head, but they're somewhere floating in the tree. Head doesn't exist anyway. Of course I can try to "patch" this issue accessing head and in case I get an index out of range exception I could navigate metas to find what I'm looking for but I expected lxml fixes broken html (as I read in the documentation).
Is there anybody that had the same issue and could solve it in a better way?
Using requests I can load the tree just fine:
>>> import requests
>>> from lxml import html
>>> r = requests.get('http://www.lanacion.com.ar/1694725-ciccone-manana-debera-declarar-carosso-donatiello-el-inquilino-de-boudou')
>>> tree = html.fromstring(r.content)
>>> tree.head
<Element head at 0x10681b4c8>
Do note that you want to pass a byte string to html.fromstring(); don't use r.text as that'll pass in Unicode instead.
Moreover, if the server did not indicate the encoding in the headers, requests falls back to the HTTP RFC default, which is ISO-8859-1 for text/ responses. For this specific response that is incorrect:
>>> r.headers['Content-Type']
'text/html'
>>> r.encoding
'ISO-8859-1'
>>> r.apparent_encoding # make an educated guess
'utf-8'
This means r.text will use Latin-1 to decode the UTF-8 data, leading to an incorrectly decoded Unicode string, further confusing matters.
The HTML parser, on the other hand, can make use of the <meta> header present to tell it what encoding to use:
>>> tree.find('.//meta').attrib
{'content': 'text/html; charset=utf-8', 'http-equiv': 'Content-Type'}

Categories

Resources