I am very new at this and have been trying to get my head around my first selector. Can somebody help me? I am trying to extract data from this page:
http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
all the info under div class = listing clearfix shelfListing but I can't seem to figure out how to format response.xpath().
I have managed to launch the scrapy console but no matter what I type in response.xpath() I can't seem to select the right node. I know it works because when I type
>>>response.xpath('//div[#class="container"]')
I get a response. Yet, I don't know how to navigate to the listsing cleardix shelflisting. I am hoping that once I get this bit I can continue working my way through the spider.
PS I wonder if it is not possible to scan this site - is it possible for the owners to block spiders?
The content inside the div with listings class (and id) is loaded via an XHR request asynchronously. In other words, the html code that Scrapy gets doesn't contain it:
$ scrapy shell http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false
>>> response.xpath('//div[#id="listings"]')
[]
Using browser developer tools, you can see the request going to http://groceries.asda.com/api/items/viewitemlist url with a bunch of GET parameters.
One option would be to simulate that request and parse the resulting JSON:
How to do it is actually a part of a different question.
Here's one possible solution using selenium package:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false')
div = driver.find_element_by_id('listings')
for item in driver.find_elements_by_xpath('//div[#id="listings"]//a[#title]'):
print item.text.strip()
driver.close()
Prints:
Kellogg's Coco Pops
Kelloggs Rice Krispies
Kellogg's Coco Pops Croco Copters
...
Related
I am using scrapy shell to extract data of the following web page:
https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html
Most data works, but there is a table in the lower part which content (the PZN e.g.) I seem not to be able to extract.
scrapy shell
fetch('https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html')
>>> response.xpath('//*[#id="accordionContent5e95408f73b10"]/div/table/tbody/tr[1]/td/text()').extract()
It returns: []
I also downloaded the page to view as scrapy sees it:
scrapy fetch --nolog https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html > test.html
Although it looks OK in HTML and although I can grab it in chrome, it does not work in scrapy shell.
How can I retrieve this data?
Problem You have encountered is that id 'accordionContent5e95408f73b10' is dynamically generated. So, id in Your browser and scrapy's response are different ones.
In common cases there is good workaround to write xpath with "substring search" (//*[contains(#id, 'accordionContent')]), but in this case there are a lot of such ids.
I can advise to write more complicated xpath.
//div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table/tbody/tr[1]/td
What this xpath do:
Find all "subpanels" with descriptions //div[#id='accordion']/div[contains(#class, 'panel')];
We get first "subpanel" (where PZN is located) and navigate into table with data: //div[#id='accordion']/div[contains(#class, 'panel')][1]/div[contains(#id, 'accordionContent')]/div[#class='panel-body']/table;
And last part is retrieving first tr's td.
By the way, xpath can be simplified to //div[#id='accordion']/div[contains(#class, 'panel')][1]//table/tbody/tr[1]/td . But i've written full xpath for more accurate understanding what we're navigating.
I've tried to get the world population from this website: https://www.worldometers.info/world-population/
but I can only get the html code, not the data of the actual numbers.
I already tried to find children of the object I tried to get data from. I also tried to list the whole object, but nothing seemed to work.
'''just importing stuff '''
import urllib.request
import requests
from bs4 import BeautifulSoup
'''getting html from website to text '''
r = requests.get('https://www.worldometers.info/world-population/')
soup = BeautifulSoup(r.text,'html.parser')
'''here it only finds the one object that's is listed below '''
current_population = soup.find('div',{'class':'maincounter-number'}).find_all('span', recursive=False)
print(current_population)
This is the object the information is stored in:
(span class="rts-counter" rel="current_population">retrieving data... </span>
and in 'inspect-mode' you can see this:
(span class="rts-counter" rel="current_population">(span class="rts-nr-sign"></span>(span class="rts-nr-int rts-nr-10e9">7</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e6">703</span>(span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e3">227</span><span class="rts-nr-thsep">,</span>(span class="rts-nr-int rts-nr-10e0">630</span></span>
I always only get the first one, but want to get the second one from 'inspect-mode'.
Here is a picture of the inspect-mode.
You are going to need a method that lets javascript run such as selenium as this number is set up via a counter that is generated in this script: https://www.realtimestatistics.net/rts/RTSp.js
from selenium import webdriver
d = webdriver.Chrome()
d.get('https://www.worldometers.info/world-population/')
print(d.find_element_by_css_selector('[rel="current_population"]').text)
You could try writing your own version of that javascript script but I wouldn't recommend it.
I didn't need an explicit wait condition for selenium script but that could be added.
The website you are scraping is a JavaScript web app. The element content you see in inspect mode is the result of running some JavaScript code after the page downloads that populates that element. Prior to the JavaScript running, the element only contains the text "retrieving data...", which is what you see in your Python code. Neither the Python requests library nor BeautifulSoup run JavaScript in downloaded HTML -- they only download and parse the HTML, and that is why your code only sees the initial text.
You have two options:
Inspect the JavaScript code or website calls and figure out what HTTP URL the page is calling to retrieve the value it puts into that element. Have your Python code fetch that URL instead and parse the value from the response for that URL.
Use a full browser engine. This StackOverflow answer provides a solution: Web-scraping JavaScript page with Python
Javascript is rendered on the DOM so Beautiful Soup will not work as you want it to.
You will have to make something that lets javascript run(eg: browser) so you can make your own browser using QT4 or the like. Sentdex had a good tutorial on it here:
https://www.youtube.com/watch?v=FSH77vnOGqU
Otherwise, you could use Selenium:
from selenium import webdriver
import time
drive = webdriver.Firefox()
drive.get('https://www.worldometers.info/world-population/')
time.sleep(5)
html = driver.page_source
I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'§ionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks
If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.
I'm trying to scrape https://www.grailed.com/ using scrapy. I have been able to get the elements I want in each listing (price, item title, size). I currently trying to get the ahrefs for each listing at the home page.
When I try response.xpath('.//div[starts-with(#id, "product")]').extract
returns
<bound method SelectorList.extract of [<Selector xpath='.//div[starts-with(#id, "product")]'
data=u'<div id="products">\n<div id="loading">\n<'>]>
Based on the inspect element it should be returning div class="feed-wrapper">?
I'm just trying to get those links so scrapy knows to go into each listing. Thank you for any help.
When you do scrapping always check source of page (not in the inspector but view-source) - that would be real data you operate with.
That div is added dynamically after page loads. JS does that job.
When you send request to server and receive pure HTML - JS will not be executed and so you see real Server response which you support to work with.
div class="feed-wrapper">
Here is real Server response to you. You must deal with it.
I am trying to write a Python script that will periodically check a website to see if an item is available. I have used requests.get, lxml.html, and xpath successfully in the past to automate website searches. In the case of this particular URL (http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/) and others on the same website, my code was not working.
import requests
from lxml import html
page = requests.get("http://www.anthropologie.com/anthro/product/4120200892474.jsp?cm_vc=SEARCH_RESULTS#/")
tree = html.fromstring(page.text)
html_element = tree.xpath(".//div[#class='product-soldout ng-scope']")
at this point, html_element should be a list of elements (I think in this case only 1), but instead it is empty. I think this is because the website is not loading all at once, so when requests.get() goes out and grabs it, it's only grabbing the first part. So my questions are
1: Am I correct in my assessment of the problem?
and
2: If so, is there a way to make requests.get() wait before returning the html, or perhaps another route entirely to get the whole page.
Thanks
Edit: Thanks to both responses. I used Selenium and got my script working.
You are not correct in your assessment of the problem.
You can check the results and see that there's a </html> right near the end. That means you've got the whole page.
And requests.text always grabs the whole page; if you want to stream it a bit at a time, you have to do so explicitly.
Your problem is that the table doesn't actually exist in the HTML; it's build dynamically by client-side JavaScript. You can see that by actually reading the HTML that's returned. So, unless you run that JavaScript, you don't have the information.
There are a number of general solutions to that. For example:
Use selenium or similar to drive an actual browser to download the page.
Manually work out what the JavaScript code does and do equivalent work in Python.
Run a headless JavaScript interpreter against a DOM that you've built up.
The page uses javascript to load the table which is not loaded when requests gets the html so you are getting all the html just not what is generated using javascript, you could use selenium combined with phantomjs for headless browsing to get the html:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get("http://www.anthropologie.eu/anthro/index.jsp#/")
html = browser.page_source
print(html)