Scraping Page That Requires JavaScript Interaction - python

I am trying to scrape https://a836-propertyportal.nyc.gov/Default.aspx with Scrapy. I am having difficulty using the FormRequest--specifically, I do not know how to tell Scrapy how to fill the block and lot forms out, and then subsequently get the response of the page. I tried following the FormRequest example on the Scrapy website found here (http://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login), but continued to have difficulty with properly clicking on the "Search" button.
I would really appreciate it if you could offer any suggestions so that I can extract data from the submitted page. Some poster on SO suggested that Scrapy cannot handle JS events well, and to use another library like CasperJS instead.
Update: I would very much appreciate it if someone could please point me to a Java/Python/JS library that allows me to submit a form, and retrieve the subsequent information
Updated Code (following Pawel's comment): My code can be found here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
class MonshtarSpider(Spider):
name = "monshtar"
allowed_domains = ["https://a836-propertyportal.nyc.gov/Default.aspx"]
start_urls = (
'https://a836-propertyportal.nyc.gov/Default.aspx/',
)
def parse(self, response):
print "entered the parsing section!!"
yield Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx",
cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"}, callback = self.aftersubmit)
def aftersubmit(self, response):
#get the data....
print "SUCCESS!!\n\n\n"

Your page is somewhat bizzare and difficult to parse, after submitting valid POST request page responds with 302 http status and a bunch of cookies (your formdata is invalid by the way, you need to replace underscores with dollars in your parameters).
Content can be viewed after sending GET to https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx
Most surprising thing is that you can crawl this site using only cookies, without POST request. POST is there only to give you cookies, it does not redirect to or respond with html response. You can manipulate those cookies from your spider. You only need to make first GET to get session cookie, and then successive GETS with borough, block etc.
Try this in scrapy shell:
pawel#stackoverflow:~/stack/scrapy$ scrapy shell "https://a836-propertyportal.nyc.gov/Default.aspx"
In [1]: from scrapy.http import Request
In [2]: req = Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"})
In [3]: fetch(req)
In [4]: view(response)
Out[5]: True # opening browser window
Response at this point will contain data for property with given block, borough and lot. Now you only need to use this knowledge in your spider. Just replace your POST with GET with cookies, add callback to what you have in shell and it should work fine.
If this still does not work or is somehow unsuited to your purposes try extracting hidden ajax parameter (the value of nullctl00_ScriptManager1_HiddenField), add this to formdata (and of course correct your formdata so that it is identical to what browser sends).

You don't click the search button but you make a POST request to a page with all the data. But checking the code, it's send a lot of data. Below I posted my requests...
ctl00_ScriptManager1_HiddenField:;;AjaxControlToolkit, Version=3.0.11119.25904, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:f48478dd-9360-4d50-94c1-5c5fa55bd379:865923e8:411fea1c:e7c87f07:91bd373d:1d58b08c:8e72a662:acd642d2:596d588c:77c58d20:14b56adc:269a19ae:bbfda34c:30a78ec5:5430d994
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTEwMDA4NDY4ODAPZBYCZg9kFgICBQ9kFgQCAg9kFgQCAQ8WAh4HVmlzaWJsZWhkAgcPFgIfAGgWAgIBDxYCHglpbm5lcmh0bWwFGEFsZXJ0IGZvcjxiciAvPiBCQkwgOiAtLWQCBA9kFgQCAg9kFgQCAQ9kFgRmDw8WBB4IQ3NzQ2xhc3MFF2FjY29yZGlvbkhlYWRlclNlbGVjdGVkHgRfIVNCAgJkZAIBDw8WBB8CBRBhY2NvcmRpb25Db250ZW50HwMCAhYCHgVzdHlsZQUOZGlzcGxheTpibG9jaztkAgIPZBYEZg8PFgQfAgUPYWNjb3JkaW9uSGVhZGVyHwMCAmRkAgEPDxYEHwIFEGFjY29yZGlvbkNvbnRlbnQfAwICFgIfBAUNZGlzcGxheTpub25lOxYCAgEPZBYCZg9kFgZmDw9kFgIfBAUNZGlzcGxheTpub25lO2QCDA8PFgIfAGhkZAINDw8WAh8AaGRkAgMPD2QWBh4FU3R5bGUFN3dpZHRoOjM1MHB4O2JhY2tncm91bmQ6d2hpdGU7ZGlzcGxheTpub25lO29wYWNpdHk6MC45MjseC29ubW91c2VvdmVyBQ93d2hIZWxwLnNob3coKTseCm9ubW91c2VvdXQFD3d3aEhlbHAuaGlkZSgpO2Rky2sFuMlw1iy/E0GN9cB65RXg7Aw=
__EVENTVALIDATION:/wEWGgKWm9a2BgL687aTAwLmha0BAujn2IECAo3DtaEJAtLdz/kGAr3g5K4DAu78ttcEAvOB3+MGAvKB3+MGAvGB3+MGAvCB3+MGAveB3+MGAoHAg44PArT/mOoPAqrvlMAJAtzQstcEAoDswboFAoHswboFAoLswboFAoPswboFAoTswboFAtjqpO8KAujQ7b0GAqvgnb0NAsPa/KsBQz19YIqBRvCWvZh8bk6XKxp+wQo=
grpStyle:blue
ctl00$SampleContent$MyAccordion_AccordionExtender_ClientState:0
ctl00$SampleContent$ctl01$TextBox1:(unable to decode value)
ctl00$SampleContent$ctl01$ddlParclBorough:1
ctl00$SampleContent$ctl01$txtBlock:100
ctl00$SampleContent$ctl01$txtLot:200
ctl00$SampleContent$ctl01$btnSearchBBL:Please Wait...
ctl00$SampleContent$ctl03$TextBox2:(unable to decode value)
ctl00$SampleContent$ctl03$ddlParclBoroughPropAddr:1
ctl00$SampleContent$ctl03$txtHouseNbr:
ctl00$SampleContent$ctl03$txtStreetNm:
ctl00$SampleContent$ctl03$txtAptNbr:
My suggestion is to use a scrap lib which supports executing JS. Or use something else. I had many success using Selenium and WebDriver to execute code in browser, which supports JS.
Update:
You have an example How to submit a form using PhantomJS.

Related

I can't use Scrapy on all web pages

I am new to using Scrapy and I need to extract the information of some of the prices of Walmart Canada. The problem is that it does not extract anything, but it only happens to me with Walmart Canada, since when using Scrapy on another web page, it works correctly.
import scrapy
from scrapy.item import Item, Field
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
class WalmartItem(Item):
barcodes = Field()
sku = Field()
class WalmartCrawler(CrawlSpider):
name = 'walmartCrawler'
start_urls = [
'https://www.walmart.ca/en/ip/apple-gala/6000195494284']
def parse(self, response):
item = ItemLoader(WalmartItem(), response)
item.add_xpath(
'barcodes', "//div[#class='css-1dar8at e1cuz6d10']/div[#class='css-w8lmum e1cuz6d11']/div[contains(text(), 'UPC')]/parent::node()/div[2]/text()")
item.add_xpath(
'sku', "//*[contains(text(), 'UPC')]/parent::node()/div[2]/text()")
yield item.load_item()
Your xpath doesn't work,
one way to do it is using regex
import re,ast
sku = re.search(r'"sku":"(\d+)',response.text).groups()[0]
barcodes = ast.literal_eval(re.search(r'"upc":(\[.*?\])',response.text).groups()[0])
TL;DR: You cannot assume Scrapy will work to extract data from any web page.
Some websites load information using browser scripting (JavaScript code) or AJAX requests. These processes are executed in the browser after the initial response is received from the server. This means that when you receive the HTML response in Scrapy, you may not receive the information as you see it in the browser.
Instead, to check the response that you will receive in Scrapy, you should check on the Network tab inside the DevTools of your browser (In Google Chrome you can access them with Right Click > Inspect). Here, search for the initial request that the browser is doing to the server. Once you have found it, you can check which is the response to that request. This is the response you are going to receive in Scrapy.
Therefore, inside Scrapy you can only work with this HTML. And as you can see, the price is not available. In this cases you must find another alternatives such as: a) Using Selenium Web Driver, b) Finding the data of the product inside an script tag on the HTML (which is the way to go in this case, check the first script tag inside the HTML). c) Do an extraction via API.
Take a look at this walmart.ca extraction script which goes for b) solution for each product in a list of products:
https://github.com/juansimon27/scrapy-walmart/blob/master/product_scraping/spiders/spider.py
On top of this, in this specific case of walmart.ca, if you do not use the correct user agent in your requests, walmart.ca may respond you with an: <h2>Your web browser is not accepting cookies.</h2> or something like: Your browser is not able to execute JS.
Configure the following user agent to avoid these problems:
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; http://www.google.com/bot.html) Chrome/W.X.Y.Z‡ Safari/537.36 '
}
In your script you can put this custom_settings definition just below your start_urls variable, or instead use a settings.py file with the USER_AGENT config.

scrapy response not able to crawl because of bad characters

In the picture you can see operator has some bad characters in the name. These fix themselves and show in chrome but on scrapy when I run even response.text in the shell I get
scrapy.exceptions.NotSupported: Response content isn't text
When I check other jobs where the operator doesnt have this text I can run the script fine and grab all the data.
I am sure its due to unicodes. I and not sure how to tell scrapy to ignore them and run the rest as text so I can scrape anything.
below is just a skeleton of my code
class PrintSpider(scrapy.Spider):
name = "printer_data"
start_urls = [
'http://192.168.4.107/jobid-15547'
]
def parse(self, response):
job_dict = {}
url_split = response.request.url.split('/')
job_dict['job_id'] = url_split[len(url_split)-1].split('-',1)[1]
job_dict['job_name'] = response.xpath("/html/body/fieldset/big[1]/text()").extract_first().split(': ', 1)[1] # this breaks here.
Update with other things I have tried already
I have worked with this for a while in the scrapy shell. response.text gives the exception I put earlier. this check is also in the response.xpath.
I have looked at the code a little bit but cannot find how response.text is working. I feel like I need to fix these characters in the response somehow so that scrapy will see it as text and can process the html instead of ignoring the entire page so I cannot access anything.
I also would love a way to save the response to a file without opening in chrome and saving so that I can work with the original document for testing.
It could be however not necessary. Try following approach to see what your crawler see:
from scrapy.utils.response import open_in_browser
def parse(self, response):
open_in_browser (response)
This will open the page in browser - make sure you are not doing this in a loop otherwise your browser will get stuck.
Secondly, try to fetch HTML first and see if this works fine.
response.xpath("/html/body/fieldset/big[1]/text()").extract_first()
modify to:
response.xpath("/html/body/fieldset/big[1]")[0].extract()
If second approach fixes the issue, then go with bs4 or lxml to convert html to text.
Furthermore, if this is a public link, let us know the link along with complete log for further understanding of the issue.

Select and submit form with python requests library

I am trying to scrape data from this website. To access the tables, I need to click the "Search" button. I was able to successfully do this using mechanize:
br = mechanize.Browser()
br.open(url + 'Wildnew_Online_Status_New.aspx')
br.select_form(name='aspnetForm')
page = br.submit(id='ctl00_ContentPlaceHolder1_Button1')
"page" gives me the resulting webpage with the table, as needed. However, I'd like to iterate through the links to subsequent pages at the bottom, and this triggers javascript. I've heard mechanize does not support this, so I need a new strategy.
I believe I can get to subsequent pages using a post request from the requests library. However, I am not able to click "search" on the main page to get to the initial table. In other words, I want to replicate the above code using requests. I tried
s = requests.Session()
form_data = {'name': 'aspnetForm', 'id': 'ctl00_ContentPlaceHolder1_Button1'}
r = s.post('http://forestsclearance.nic.in/Wildnew_Online_Status_New.aspx', data=form_data)
Not sure why, but this returns the main page again (without clicking Search). Any help appreciated.
I think you should look into scrapy
you forgot some parameters in ths post request:
https://www.pastiebin.com/5bc6562304e3c
check the Post request with google dev tools

Scrapy - Javascript website

I'm familiar with scraping websites with Scrapy, however I cant seem to scrape this one (javascript perhaps ?).
I'm trying to download historical data for commodities for some personal research from this website:
http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx
On this website you will have to select the date and then click go. Once the data is loaded, you can click 'View in Excel' to download a CSV file with commodity prices for that day. I'm trying to build a scraper to download these CSV files for a few months. However, this website seems like a hard nut to crack. Any help will be appreciated.
Things i've tried:
1) Look at the page source to see if data is being loaded but not shown (hidden)
2) Used firebug to see if there are any AJAX requests
3) Modified POST headers to see if I can get data for different days. The post headers seem very complicated.
Asp.net websites are notoriously hard to crawl because it relies on viewsessions, being extremely strict with requests and loads of other nonsense.
Luckily your case seems to be pretty straightforward. Your scrapy approach should look something like:
import scrapy
from scrapy import FormRequest
class MxindiaSpider(scrapy.Spider):
name = "mxindia"
allowed_domains = ["mcxindia.com"]
start_urls = ('http://www.mcxindia.com/SitePages/BhavCopyDateWiseArchive.aspx',)
def parse(self, response):
yield FormRequest.from_response(response,
formdata={
'mTbdate': '02/13/2015', # your date here
'ScriptManager1': 'MupdPnl|mImgBtnGo',
'__EVENTARGUMENT': '',
'__EVENTTARGET': '',
'mImgBtnGo.x': '12',
'mImgBtnGo.y': '9'
},
callback=self.parse_cal, )
def parse_cal(self, response):
inspect_response(response, self) # everything is there!
What we do here is create FormRequest from the response object we already have. It's mart enough to find the <input> and <form> fields and generates formdata.
However some input fields that don't have defaults or we need to override the defaults need to be overriden with formdata argument.
So we provide formdata argument with updated form values. When you inspect the request you can see all of the form values you need to make a successful request:
So just copy all of them over to your formdata. Asp is really anal about the formdata so it takes some time experimenting what is required and what is not.
I'll leave you to figure out how to get to the next page yourself, usually it just adds aditional key to formadata like 'page': '2'.

How to avoid scrapy ignoring hash tag

i am working on scrapy
I had a site to scrape with hash tag included , but when i run it , scrapy downloading the response by ignoring hash tag
For example this is the url with hash fragments, url="www.example.com/hash-tag.php#user_id-654"
and the response from this request is only www.example.com/hash-tag.php, but i want to scrape the url with hash fragments.
My code is below
class ExampleSpider(BaseSpider):
name = "example"
domain_name = "www.example.com"
def start_requests(self):
return Request("www.example.com/hash-tag.php#user_id-654")
def parse(self):
print response
Result:
<GET www.example.com/hash-tag.php>
How can i do this......
Thanks in advance................
What you are trying to do is not easily possible. To achieve what you want you need a full DOM and JavaScript engine, i.e. a (possibly headless) browser.
If you really need it, have a look at PhantomJS. It is the WebKit engine but completely headless. I'm not sure if scrapy can be easily extended but if you really want to execute JavaScript (which you need in this case), using PhantomJS is probably the way to go.
Well if you really need that information you could just do split string before calling Request, and send that information as meta.
Something like
url = "www.example.com/hash-tag.php#user_id-654"
hash = url.split("#")[1]
request = Request(url, callback=self.parse_something)
request.meta['after_hash'] = hash
yield request
and then in parsing get and use it like
def parse_something(self, response):
hash = response.meta['after_hash']
That is if you only need that information after hash sign.

Categories

Resources