Pagination using scrapy - python

I'm trying to crawl this website:
http://www.aido.com/eshop/cl_2-c_189-p_185/stationery/pens.html
I can get all the products in this page, but how do I issue the request for "View More" link at the bottom of the page?
My code till now is:
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths='//li[#class="normalLeft"]/div/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//div[#id="topParentChilds"]/div/div[#class="clm2"]/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//p[#class="proHead"]/a',unique=True)),
Rule(SgmlLinkExtractor(allow=('http://[^/]+/[^/]+/[^/]+/[^/]+$', ), deny=('/about-us/about-us/contact-us', './music.html', ) ,unique=True),callback='parse_item'),
)
Any help?

First of all, you should take a look at this thread on how to deal with scraping ajax dynamically loaded content:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
So, clicking on "View More" button fires up an XHR request:
http://www.aido.com/eshop/faces/tiles/category.jsp?q=&categoryID=189&catalogueID=2&parentCategoryID=185&viewType=grid&bnm=&atmSize=&format=&gender=&ageRange=&actor=&director=&author=&region=&compProductType=&compOperatingSystem=&compScreenSize=&compCpuSpeed=&compRam=&compGraphicProcessor=&compDedicatedGraphicMemory=&mobProductType=&mobOperatingSystem=&mobCameraMegapixels=&mobScreenSize=&mobProcessor=&mobRam=&mobInternalStorage=&elecProductType=&elecFeature=&elecPlaybackFormat=&elecOutput=&elecPlatform=&elecMegaPixels=&elecOpticalZoom=&elecCapacity=&elecDisplaySize=&narrowage=&color=&prc=&k1=&k2=&k3=&k4=&k5=&k6=&k7=&k8=&k9=&k10=&k11=&k12=&startPrize=&endPrize=&newArrival=&entityType=&entityId=&brandId=&brandCmsFlag=&boutiqueID=&nmt=&disc=&rat=&cts=empty&isBoutiqueSoldOut=undefined&sort=12&isAjax=true&hstart=24&targetDIV=searchResultDisplay
which returns text/html of the next 24 items. Note this hstart=24 parameter: first time you click "View more" it's equal to 24, second time - 48 etc..this should be your lifesaver.
Now, you should simulate these requests in your spider. The recommended way to do this is to instantiate scrapy's Request object providing callback where you'll extract the data.
Hope that helps.

Related

trying to download full HTML pages

I am tring to download few hundreds of HTML pages in order to parse them and calculate some measures.
I tried it with linux WGET, and with a loop of the following code in python:
url = "https://www.camoni.co.il/411788/168022"
html = urllib.request.urlopen(url).read()
but the html file I got doen't contain all the content I see in the browser in the same page. for example text I see on the screen is not found in the HTML file. only when I right click the page in the browser and "Save As" i get the full page.
the problem - I need a big anount of pages and can not do it by hand.
URL example - https://www.camoni.co.il/411788/168022 - thelast number changes
thank you
That's because that site is not static. It uses JavaScript (in this example jQuery lib) to fetch additional data from server and paste on page.
So instead of trying to GET raw HTML you should inspect requests in developer tools. There's a POST request on https://www.camoni.co.il/ajax/tabberChangeTab with such data:
tab_name=tab_about
memberAlias=ד-ר-דינה-ראלט-PhD
currentURL=/411788/ד-ר-דינה-ראלט-PhD
And the result is HTML that pasted on page after.
So instead of trying to just download page you should inspect page and requests to get data or use headless browser such as Google Chrome to emulate 'Save As' button and save data.

Select and submit form with python requests library

I am trying to scrape data from this website. To access the tables, I need to click the "Search" button. I was able to successfully do this using mechanize:
br = mechanize.Browser()
br.open(url + 'Wildnew_Online_Status_New.aspx')
br.select_form(name='aspnetForm')
page = br.submit(id='ctl00_ContentPlaceHolder1_Button1')
"page" gives me the resulting webpage with the table, as needed. However, I'd like to iterate through the links to subsequent pages at the bottom, and this triggers javascript. I've heard mechanize does not support this, so I need a new strategy.
I believe I can get to subsequent pages using a post request from the requests library. However, I am not able to click "search" on the main page to get to the initial table. In other words, I want to replicate the above code using requests. I tried
s = requests.Session()
form_data = {'name': 'aspnetForm', 'id': 'ctl00_ContentPlaceHolder1_Button1'}
r = s.post('http://forestsclearance.nic.in/Wildnew_Online_Status_New.aspx', data=form_data)
Not sure why, but this returns the main page again (without clicking Search). Any help appreciated.
I think you should look into scrapy
you forgot some parameters in ths post request:
https://www.pastiebin.com/5bc6562304e3c
check the Post request with google dev tools

Crawl data from next page doesn't change URL

i'm trying to change pages in this site to crawl information. But it doesn't change URL when I click next page:
My code until now:
[...]
paging = response.css('span id.next::attr(href)').extract()
if paging:
yield scrapy.Request(paging, callback=self.parse_links)
I don't know how to crawl from site like this. Please help me, thank you
Network next page request
You can try this request for the next page http://vsd.vn/ModuleArticles/ArticlesList/NextPageHDNVTCPH?pCurrentPage=2
This is return next page data

Scraping Complex Forms using BeautifulSoup and Requests

Below is a snippet of my Python code along with HTML from a page I'm trying to scrape.
The HTML is a complex form I'm having trouble scraping. I'm using BeautifulSoup4 and Python Requests however when I post to the page theform isn't properly receiving the correct inputs. I'm guessing it has something to do with all these hidden inputs above the actual select I'm trying to submit.
If I inspect the form-data being submitted while using Chrome, here's what I see.
Chrome Developer Console View
When using the page through the browser the only field that has to be selected is the select name="sel_subj as seen below. However when posting back to the page this fails
new_url = 'https://wl11gp.neu.edu/udcprod8/NEUCLSS.p_class_search'
requests.post(new_url, data={'STU_TERM_IN':201730,
'p_msg_code': UNSECURED',
'sel_subj': 'ACCT'})
To view a live version of the page I'm trying to scrape visit this link, select "Spring 2017 Semester" and click submit: https://wl11gp.neu.edu/udcprod8/NEUCLSS.p_disp_dyn_sched

Scraping Page That Requires JavaScript Interaction

I am trying to scrape https://a836-propertyportal.nyc.gov/Default.aspx with Scrapy. I am having difficulty using the FormRequest--specifically, I do not know how to tell Scrapy how to fill the block and lot forms out, and then subsequently get the response of the page. I tried following the FormRequest example on the Scrapy website found here (http://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login), but continued to have difficulty with properly clicking on the "Search" button.
I would really appreciate it if you could offer any suggestions so that I can extract data from the submitted page. Some poster on SO suggested that Scrapy cannot handle JS events well, and to use another library like CasperJS instead.
Update: I would very much appreciate it if someone could please point me to a Java/Python/JS library that allows me to submit a form, and retrieve the subsequent information
Updated Code (following Pawel's comment): My code can be found here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
class MonshtarSpider(Spider):
name = "monshtar"
allowed_domains = ["https://a836-propertyportal.nyc.gov/Default.aspx"]
start_urls = (
'https://a836-propertyportal.nyc.gov/Default.aspx/',
)
def parse(self, response):
print "entered the parsing section!!"
yield Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx",
cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"}, callback = self.aftersubmit)
def aftersubmit(self, response):
#get the data....
print "SUCCESS!!\n\n\n"
Your page is somewhat bizzare and difficult to parse, after submitting valid POST request page responds with 302 http status and a bunch of cookies (your formdata is invalid by the way, you need to replace underscores with dollars in your parameters).
Content can be viewed after sending GET to https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx
Most surprising thing is that you can crawl this site using only cookies, without POST request. POST is there only to give you cookies, it does not redirect to or respond with html response. You can manipulate those cookies from your spider. You only need to make first GET to get session cookie, and then successive GETS with borough, block etc.
Try this in scrapy shell:
pawel#stackoverflow:~/stack/scrapy$ scrapy shell "https://a836-propertyportal.nyc.gov/Default.aspx"
In [1]: from scrapy.http import Request
In [2]: req = Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"})
In [3]: fetch(req)
In [4]: view(response)
Out[5]: True # opening browser window
Response at this point will contain data for property with given block, borough and lot. Now you only need to use this knowledge in your spider. Just replace your POST with GET with cookies, add callback to what you have in shell and it should work fine.
If this still does not work or is somehow unsuited to your purposes try extracting hidden ajax parameter (the value of nullctl00_ScriptManager1_HiddenField), add this to formdata (and of course correct your formdata so that it is identical to what browser sends).
You don't click the search button but you make a POST request to a page with all the data. But checking the code, it's send a lot of data. Below I posted my requests...
ctl00_ScriptManager1_HiddenField:;;AjaxControlToolkit, Version=3.0.11119.25904, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:f48478dd-9360-4d50-94c1-5c5fa55bd379:865923e8:411fea1c:e7c87f07:91bd373d:1d58b08c:8e72a662:acd642d2:596d588c:77c58d20:14b56adc:269a19ae:bbfda34c:30a78ec5:5430d994
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTEwMDA4NDY4ODAPZBYCZg9kFgICBQ9kFgQCAg9kFgQCAQ8WAh4HVmlzaWJsZWhkAgcPFgIfAGgWAgIBDxYCHglpbm5lcmh0bWwFGEFsZXJ0IGZvcjxiciAvPiBCQkwgOiAtLWQCBA9kFgQCAg9kFgQCAQ9kFgRmDw8WBB4IQ3NzQ2xhc3MFF2FjY29yZGlvbkhlYWRlclNlbGVjdGVkHgRfIVNCAgJkZAIBDw8WBB8CBRBhY2NvcmRpb25Db250ZW50HwMCAhYCHgVzdHlsZQUOZGlzcGxheTpibG9jaztkAgIPZBYEZg8PFgQfAgUPYWNjb3JkaW9uSGVhZGVyHwMCAmRkAgEPDxYEHwIFEGFjY29yZGlvbkNvbnRlbnQfAwICFgIfBAUNZGlzcGxheTpub25lOxYCAgEPZBYCZg9kFgZmDw9kFgIfBAUNZGlzcGxheTpub25lO2QCDA8PFgIfAGhkZAINDw8WAh8AaGRkAgMPD2QWBh4FU3R5bGUFN3dpZHRoOjM1MHB4O2JhY2tncm91bmQ6d2hpdGU7ZGlzcGxheTpub25lO29wYWNpdHk6MC45MjseC29ubW91c2VvdmVyBQ93d2hIZWxwLnNob3coKTseCm9ubW91c2VvdXQFD3d3aEhlbHAuaGlkZSgpO2Rky2sFuMlw1iy/E0GN9cB65RXg7Aw=
__EVENTVALIDATION:/wEWGgKWm9a2BgL687aTAwLmha0BAujn2IECAo3DtaEJAtLdz/kGAr3g5K4DAu78ttcEAvOB3+MGAvKB3+MGAvGB3+MGAvCB3+MGAveB3+MGAoHAg44PArT/mOoPAqrvlMAJAtzQstcEAoDswboFAoHswboFAoLswboFAoPswboFAoTswboFAtjqpO8KAujQ7b0GAqvgnb0NAsPa/KsBQz19YIqBRvCWvZh8bk6XKxp+wQo=
grpStyle:blue
ctl00$SampleContent$MyAccordion_AccordionExtender_ClientState:0
ctl00$SampleContent$ctl01$TextBox1:(unable to decode value)
ctl00$SampleContent$ctl01$ddlParclBorough:1
ctl00$SampleContent$ctl01$txtBlock:100
ctl00$SampleContent$ctl01$txtLot:200
ctl00$SampleContent$ctl01$btnSearchBBL:Please Wait...
ctl00$SampleContent$ctl03$TextBox2:(unable to decode value)
ctl00$SampleContent$ctl03$ddlParclBoroughPropAddr:1
ctl00$SampleContent$ctl03$txtHouseNbr:
ctl00$SampleContent$ctl03$txtStreetNm:
ctl00$SampleContent$ctl03$txtAptNbr:
My suggestion is to use a scrap lib which supports executing JS. Or use something else. I had many success using Selenium and WebDriver to execute code in browser, which supports JS.
Update:
You have an example How to submit a form using PhantomJS.

Categories

Resources