Scrapy xpath selector not parsing - python

I'm trying to scrape https://www.grailed.com/ using scrapy. I have been able to get the elements I want in each listing (price, item title, size). I currently trying to get the ahrefs for each listing at the home page.
When I try response.xpath('.//div[starts-with(#id, "product")]').extract
returns
<bound method SelectorList.extract of [<Selector xpath='.//div[starts-with(#id, "product")]'
data=u'<div id="products">\n<div id="loading">\n<'>]>
Based on the inspect element it should be returning div class="feed-wrapper">?
I'm just trying to get those links so scrapy knows to go into each listing. Thank you for any help.

When you do scrapping always check source of page (not in the inspector but view-source) - that would be real data you operate with.
That div is added dynamically after page loads. JS does that job.
When you send request to server and receive pure HTML - JS will not be executed and so you see real Server response which you support to work with.
div class="feed-wrapper">
Here is real Server response to you. You must deal with it.

Related

No content returns from Scrapy and Instagram

I am trying to get the tags text from an instagram image, for example: https://www.instagram.com/p/CHPoTitFdEz/, but Scrapy returns no content.
In the Scrapy shell I have written:
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').get()
or
response.xpath('//span[#class=""]/a[#class="xil3i"]/text()').extract()
Which should get me to the content of first tag. However, Scrapy shell returns no content or an empty array. I use Scrapy for other, simpler websites and everything always went fine. I also tried to include more divs but the content is always empty.
First rule of thumb of using scrapy is to open view page source on browser. We mostly get the same response in scrapy. If we are getting blocked then that is a totally different question.
Upon viewing source we can see that this website is dynamically loading all the content using ajax requests as nothing is present in the page source. Now you can try searching for your desired content in network tab and then you can try replicating the request that contains that. Or else you can use splash. Documentation for splash can be found here.
If you try to look at the response body, you will see that instagram page is not fully loaded. POST data is saved in other tags. For example:
<meta property="instapp:hashtags" content="cyberpunk" />
So, you may want to change selector path and extract information from meta tags.
In this case it would be:
response.xpath('//meta[#property="instapp:hashtags"]')
and you will get your content back. In case you need other information first try inspecting response.body and see if it is there.

XPath returns no result for some elements with scrapy shell

I am using scrapy shell to extract data of the following web page:
https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html
Most data works, but there is a table in the lower part which content (the PZN e.g.) I seem not to be able to extract.
scrapy shell
fetch('https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html')
>>> response.xpath('//*[#id="accordionContent5e95408f73b10"]/div/table/tbody/tr[1]/td/text()').extract()
It returns: []
I also downloaded the page to view as scrapy sees it:
scrapy fetch --nolog https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html > test.html
Although it looks OK in HTML and although I can grab it in chrome, it does not work in scrapy shell.
How can I retrieve this data?
Problem You have encountered is that id 'accordionContent5e95408f73b10' is dynamically generated. So, id in Your browser and scrapy's response are different ones.
In common cases there is good workaround to write xpath with "substring search" (//*[contains(#id, 'accordionContent')]), but in this case there are a lot of such ids.
I can advise to write more complicated xpath.
//div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table/tbody/tr[1]/td
What this xpath do:
Find all "subpanels" with descriptions //div[#id='accordion']/div[contains(#class, 'panel')];
We get first "subpanel" (where PZN is located) and navigate into table with data: //div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table;
And last part is retrieving first tr's td.
By the way, xpath can be simplified to //div[#id='accordion']/div[contains(#class, 'panel')][1]//table/tbody/tr[1]/td . But i've written full xpath for more accurate understanding what we're navigating.

How to get text inside div

Take a look at this webpage:
https://www.michaelkors.com/large-crossgrain-leather-dome-crossbody-bag/_/R-US_32S9SF5C3L?color=2519
I want to get text under details section. When I look at the div it has class detail and text under it. This is the statement I am using:
details = response.xpath('.//div[#class="detail"]/text()').extract()
However, it is returning nothing.
Looks like the div you're trying to parse does not exist when the page is loaded.
Product data is stored as json inside a script tag, and the div is generated from it using javascript.
This leaves you with a couple of options:
Parse the javascript and extract the data yourself
Use a browser (e.g. scrapy-splash) to run the javascript, and parse the resulting HTML
class detail element not found in the page source. Which means it is not found in the response loaded by scrapy request.
Scrapy deals with static requests, it responses all the elements present in the page source.
If the request is a dynamic request, it responses elements present in the inspect element, loaded by javascript, ajax type requests). we should try some other packages along with scrapy to scrape those data.
Examples: Splash, Selenium etc
In your case you should handle it as dynamic requests.

python scrappy football data

i am trying to learn how to use scrappy with python; i am not familiar with css
the website i am trying to scrape: https://fantasydata.com/nfl-stats/point-spreads-and-odds?season=2018&seasontype=1&week=17
so when i copy the selector for the date, this is the result:
stats_grid > div.k-grid-content.k-auto-scrollable > table > tbody > tr:nth-child(1) > td:nth-child(1) > span
when I bring up the scrappy module by doing: python shell "url"
and type in response.css('selector here')
I get no results!
How do i retrieve the date information?
Thanks for reading this message!
So the issue here is that the data you're trying to scrape is not available when scrappy receives the page response.
If you have your browser's developer console open when the page loads, check out the XHR request on the network tab to this URL:
https://fantasydata.com/NFLTeamStats/Odds_Read
If you check out its payload, you'll see that it contains exactly the data you are trying to scrape. In other words, it's loaded from the site's app via an HTTP fetch AFTER the initial page has loaded.
So, when you use a webscaper (like scrappy), you're unable to see that kind of data. You really only get the initial page template, and anything loaded by javascript after is unavailable.
If you're looking for general NFL and fantasy related stats, there's an app called FFDB that allows you to easily create databases using its engine:
FFDB Github Repository
Disclaimer: I am the author of the app.
As a last note, be aware that a css tag is not relevant for this issue. A scraping or webscrape tag would be more appropriate.
Best of luck!

My first scrapy xpath selector

I am very new at this and have been trying to get my head around my first selector. Can somebody help me? I am trying to extract data from this page:
http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
all the info under div class = listing clearfix shelfListing but I can't seem to figure out how to format response.xpath().
I have managed to launch the scrapy console but no matter what I type in response.xpath() I can't seem to select the right node. I know it works because when I type
>>>response.xpath('//div[#class="container"]')
I get a response. Yet, I don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once I get this bit I can continue working my way through the spider.
PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?
The content inside the div with listings class (and id) is loaded via an XHR request asynchronously. In other words, the html code that Scrapy gets doesn't contain it:
$ scrapy shell http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
>>> response.xpath('//div[#id="listings"]')
[]
Using browser developer tools, you can see the request going to http://groceries.asda.com/api/items/viewitemlist url with a bunch of GET parameters.
One option would be to simulate that request and parse the resulting JSON:
How to do it is actually a part of a different question.
Here's one possible solution using selenium package:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false')
div = driver.find_element_by_id('listings')
for item in driver.find_elements_by_xpath('//div[#id="listings"]//a[#title]'):
print item.text.strip()
driver.close()
Prints:
Kellogg's Coco Pops
Kelloggs Rice Krispies
Kellogg's Coco Pops Croco Copters
...

Categories

Resources