i'm trying to change pages in this site to crawl information. But it doesn't change URL when I click next page:
My code until now:
[...]
paging = response.css('span id.next::attr(href)').extract()
if paging:
yield scrapy.Request(paging, callback=self.parse_links)
I don't know how to crawl from site like this. Please help me, thank you
Network next page request
You can try this request for the next page http://vsd.vn/ModuleArticles/ArticlesList/NextPageHDNVTCPH?pCurrentPage=2
This is return next page data
Related
I have a daily task at work to download some files from internal company website. The site requires a login. But the main url is something like:
https://abcd.com
But when I open that in the browser, it redirects to something like:
https://abcdGW/ln-eng.aspx?lang=eng&lnid=e69d5d-xxx-xxx-1111cef®l=en-US
My task normally is to open this site, login, click some links back and forth and download some files. This takes me 10 minutes everyday. But I wanna automate this using python. Using my basic knowledge I have written below code:
import urllib3
from bs4 import BeautifulSoup
import requests
import http
url = "https://abcd.com"
redirectURL = requests.get(url).url
jar = http.cookiejar.CookieJar(policy=None)
http = urllib3.PoolManager()
acc_pwd = {'datasouce': 'Data1', 'user':'xxxx', 'password':'xxxx'}
response = http.request('GET', redirectURL)
soup = BeautifulSoup(response.data)
r = requests.get(redirectURL, cookies=jar)
r = requests.post(redirectURL, cookies=jar, data=acc_pwd)
print ("RData %s" % r.text)
This shows that I am able to successfully login. The next step is something where i am stuck. On the page after login I have some links on left side, one of those I need to click. When I inspect them in Chrome, I see them as:
href="javascript:__doPostBack('myAppControl$menu_itm_proj11','')"><div class="menu-cell">
<img class="menu-image" src="images/LiteMenu/projects.png" style="border-width:0px;"><span class="menu-text">Projects</span> </div></a>
This is probably a javascript link. I need to click this, and then on new page another link, then another to download a file and back to the main page and do this all over again to download different files.
I would be grateful to anyone who can help or suggest.
Thanks to chris, I was able to complete this..
First using the request library I got the redirect url as:
redirectURL = requests.get(url).url
After that I use scrapy and selenium for click links and downloading files..
By adding selenium to the browser as add-in/plugin, it was quite simple.
I am trying to scrape data from this website. To access the tables, I need to click the "Search" button. I was able to successfully do this using mechanize:
br = mechanize.Browser()
br.open(url + 'Wildnew_Online_Status_New.aspx')
br.select_form(name='aspnetForm')
page = br.submit(id='ctl00_ContentPlaceHolder1_Button1')
"page" gives me the resulting webpage with the table, as needed. However, I'd like to iterate through the links to subsequent pages at the bottom, and this triggers javascript. I've heard mechanize does not support this, so I need a new strategy.
I believe I can get to subsequent pages using a post request from the requests library. However, I am not able to click "search" on the main page to get to the initial table. In other words, I want to replicate the above code using requests. I tried
s = requests.Session()
form_data = {'name': 'aspnetForm', 'id': 'ctl00_ContentPlaceHolder1_Button1'}
r = s.post('http://forestsclearance.nic.in/Wildnew_Online_Status_New.aspx', data=form_data)
Not sure why, but this returns the main page again (without clicking Search). Any help appreciated.
I think you should look into scrapy
you forgot some parameters in ths post request:
https://www.pastiebin.com/5bc6562304e3c
check the Post request with google dev tools
I'm scraping all products details on http://www.ulta.com/makeup-eyes-eyebrows?N=26yi. My rules are copied below. I only got data from the first page and it doesn't proceed to next page.
rules = (Rule(LinkExtractor(
restrict_xpaths='//*[#id="canada"]/div[4]/div[2]/div[3]/div[3]/div[2]/ul/li[3]/a',),
callback = 'parse',
follow =True),)
Can anyone help me with this?
Use the CrawlSpider, it will automatically crawl to other pages, otherwise with,
Spider , you need to manually pass the other links
class Scrapy1Spider(CrawlSpider):
instead of
class Scrapy1Spider(scrapy.Spider):
See: Scrapy crawl with next page
I am trying to scrape https://a836-propertyportal.nyc.gov/Default.aspx with Scrapy. I am having difficulty using the FormRequest--specifically, I do not know how to tell Scrapy how to fill the block and lot forms out, and then subsequently get the response of the page. I tried following the FormRequest example on the Scrapy website found here (http://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login), but continued to have difficulty with properly clicking on the "Search" button.
I would really appreciate it if you could offer any suggestions so that I can extract data from the submitted page. Some poster on SO suggested that Scrapy cannot handle JS events well, and to use another library like CasperJS instead.
Update: I would very much appreciate it if someone could please point me to a Java/Python/JS library that allows me to submit a form, and retrieve the subsequent information
Updated Code (following Pawel's comment): My code can be found here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
class MonshtarSpider(Spider):
name = "monshtar"
allowed_domains = ["https://a836-propertyportal.nyc.gov/Default.aspx"]
start_urls = (
'https://a836-propertyportal.nyc.gov/Default.aspx/',
)
def parse(self, response):
print "entered the parsing section!!"
yield Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx",
cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"}, callback = self.aftersubmit)
def aftersubmit(self, response):
#get the data....
print "SUCCESS!!\n\n\n"
Your page is somewhat bizzare and difficult to parse, after submitting valid POST request page responds with 302 http status and a bunch of cookies (your formdata is invalid by the way, you need to replace underscores with dollars in your parameters).
Content can be viewed after sending GET to https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx
Most surprising thing is that you can crawl this site using only cookies, without POST request. POST is there only to give you cookies, it does not redirect to or respond with html response. You can manipulate those cookies from your spider. You only need to make first GET to get session cookie, and then successive GETS with borough, block etc.
Try this in scrapy shell:
pawel#stackoverflow:~/stack/scrapy$ scrapy shell "https://a836-propertyportal.nyc.gov/Default.aspx"
In [1]: from scrapy.http import Request
In [2]: req = Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"})
In [3]: fetch(req)
In [4]: view(response)
Out[5]: True # opening browser window
Response at this point will contain data for property with given block, borough and lot. Now you only need to use this knowledge in your spider. Just replace your POST with GET with cookies, add callback to what you have in shell and it should work fine.
If this still does not work or is somehow unsuited to your purposes try extracting hidden ajax parameter (the value of nullctl00_ScriptManager1_HiddenField), add this to formdata (and of course correct your formdata so that it is identical to what browser sends).
You don't click the search button but you make a POST request to a page with all the data. But checking the code, it's send a lot of data. Below I posted my requests...
ctl00_ScriptManager1_HiddenField:;;AjaxControlToolkit, Version=3.0.11119.25904, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:f48478dd-9360-4d50-94c1-5c5fa55bd379:865923e8:411fea1c:e7c87f07:91bd373d:1d58b08c:8e72a662:acd642d2:596d588c:77c58d20:14b56adc:269a19ae:bbfda34c:30a78ec5:5430d994
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTEwMDA4NDY4ODAPZBYCZg9kFgICBQ9kFgQCAg9kFgQCAQ8WAh4HVmlzaWJsZWhkAgcPFgIfAGgWAgIBDxYCHglpbm5lcmh0bWwFGEFsZXJ0IGZvcjxiciAvPiBCQkwgOiAtLWQCBA9kFgQCAg9kFgQCAQ9kFgRmDw8WBB4IQ3NzQ2xhc3MFF2FjY29yZGlvbkhlYWRlclNlbGVjdGVkHgRfIVNCAgJkZAIBDw8WBB8CBRBhY2NvcmRpb25Db250ZW50HwMCAhYCHgVzdHlsZQUOZGlzcGxheTpibG9jaztkAgIPZBYEZg8PFgQfAgUPYWNjb3JkaW9uSGVhZGVyHwMCAmRkAgEPDxYEHwIFEGFjY29yZGlvbkNvbnRlbnQfAwICFgIfBAUNZGlzcGxheTpub25lOxYCAgEPZBYCZg9kFgZmDw9kFgIfBAUNZGlzcGxheTpub25lO2QCDA8PFgIfAGhkZAINDw8WAh8AaGRkAgMPD2QWBh4FU3R5bGUFN3dpZHRoOjM1MHB4O2JhY2tncm91bmQ6d2hpdGU7ZGlzcGxheTpub25lO29wYWNpdHk6MC45MjseC29ubW91c2VvdmVyBQ93d2hIZWxwLnNob3coKTseCm9ubW91c2VvdXQFD3d3aEhlbHAuaGlkZSgpO2Rky2sFuMlw1iy/E0GN9cB65RXg7Aw=
__EVENTVALIDATION:/wEWGgKWm9a2BgL687aTAwLmha0BAujn2IECAo3DtaEJAtLdz/kGAr3g5K4DAu78ttcEAvOB3+MGAvKB3+MGAvGB3+MGAvCB3+MGAveB3+MGAoHAg44PArT/mOoPAqrvlMAJAtzQstcEAoDswboFAoHswboFAoLswboFAoPswboFAoTswboFAtjqpO8KAujQ7b0GAqvgnb0NAsPa/KsBQz19YIqBRvCWvZh8bk6XKxp+wQo=
grpStyle:blue
ctl00$SampleContent$MyAccordion_AccordionExtender_ClientState:0
ctl00$SampleContent$ctl01$TextBox1:(unable to decode value)
ctl00$SampleContent$ctl01$ddlParclBorough:1
ctl00$SampleContent$ctl01$txtBlock:100
ctl00$SampleContent$ctl01$txtLot:200
ctl00$SampleContent$ctl01$btnSearchBBL:Please Wait...
ctl00$SampleContent$ctl03$TextBox2:(unable to decode value)
ctl00$SampleContent$ctl03$ddlParclBoroughPropAddr:1
ctl00$SampleContent$ctl03$txtHouseNbr:
ctl00$SampleContent$ctl03$txtStreetNm:
ctl00$SampleContent$ctl03$txtAptNbr:
My suggestion is to use a scrap lib which supports executing JS. Or use something else. I had many success using Selenium and WebDriver to execute code in browser, which supports JS.
Update:
You have an example How to submit a form using PhantomJS.
I'm trying to crawl this website:
http://www.aido.com/eshop/cl_2-c_189-p_185/stationery/pens.html
I can get all the products in this page, but how do I issue the request for "View More" link at the bottom of the page?
My code till now is:
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths='//li[#class="normalLeft"]/div/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//div[#id="topParentChilds"]/div/div[#class="clm2"]/a',unique=True)),
Rule(SgmlLinkExtractor(restrict_xpaths='//p[#class="proHead"]/a',unique=True)),
Rule(SgmlLinkExtractor(allow=('http://[^/]+/[^/]+/[^/]+/[^/]+$', ), deny=('/about-us/about-us/contact-us', './music.html', ) ,unique=True),callback='parse_item'),
)
Any help?
First of all, you should take a look at this thread on how to deal with scraping ajax dynamically loaded content:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
So, clicking on "View More" button fires up an XHR request:
http://www.aido.com/eshop/faces/tiles/category.jsp?q=&categoryID=189&catalogueID=2&parentCategoryID=185&viewType=grid&bnm=&atmSize=&format=&gender=&ageRange=&actor=&director=&author=®ion=&compProductType=&compOperatingSystem=&compScreenSize=&compCpuSpeed=&compRam=&compGraphicProcessor=&compDedicatedGraphicMemory=&mobProductType=&mobOperatingSystem=&mobCameraMegapixels=&mobScreenSize=&mobProcessor=&mobRam=&mobInternalStorage=&elecProductType=&elecFeature=&elecPlaybackFormat=&elecOutput=&elecPlatform=&elecMegaPixels=&elecOpticalZoom=&elecCapacity=&elecDisplaySize=&narrowage=&color=&prc=&k1=&k2=&k3=&k4=&k5=&k6=&k7=&k8=&k9=&k10=&k11=&k12=&startPrize=&endPrize=&newArrival=&entityType=&entityId=&brandId=&brandCmsFlag=&boutiqueID=&nmt=&disc=&rat=&cts=empty&isBoutiqueSoldOut=undefined&sort=12&isAjax=true&hstart=24&targetDIV=searchResultDisplay
which returns text/html of the next 24 items. Note this hstart=24 parameter: first time you click "View more" it's equal to 24, second time - 48 etc..this should be your lifesaver.
Now, you should simulate these requests in your spider. The recommended way to do this is to instantiate scrapy's Request object providing callback where you'll extract the data.
Hope that helps.