scrapy response not able to crawl because of bad characters

scrapy response not able to crawl because of bad characters - python

In the picture you can see operator has some bad characters in the name. These fix themselves and show in chrome but on scrapy when I run even response.text in the shell I get
scrapy.exceptions.NotSupported: Response content isn't text
When I check other jobs where the operator doesnt have this text I can run the script fine and grab all the data.
I am sure its due to unicodes. I and not sure how to tell scrapy to ignore them and run the rest as text so I can scrape anything.
below is just a skeleton of my code
class PrintSpider(scrapy.Spider):
name = "printer_data"
start_urls = [
'http://192.168.4.107/jobid-15547'
]
def parse(self, response):
job_dict = {}
url_split = response.request.url.split('/')
job_dict['job_id'] = url_split[len(url_split)-1].split('-',1)[1]
job_dict['job_name'] = response.xpath("/html/body/fieldset/big[1]/text()").extract_first().split(': ', 1)[1] # this breaks here.
Update with other things I have tried already
I have worked with this for a while in the scrapy shell. response.text gives the exception I put earlier. this check is also in the response.xpath.
I have looked at the code a little bit but cannot find how response.text is working. I feel like I need to fix these characters in the response somehow so that scrapy will see it as text and can process the html instead of ignoring the entire page so I cannot access anything.
I also would love a way to save the response to a file without opening in chrome and saving so that I can work with the original document for testing.

It could be however not necessary. Try following approach to see what your crawler see:
from scrapy.utils.response import open_in_browser
def parse(self, response):
open_in_browser (response)
This will open the page in browser - make sure you are not doing this in a loop otherwise your browser will get stuck.
Secondly, try to fetch HTML first and see if this works fine.
response.xpath("/html/body/fieldset/big[1]/text()").extract_first()
modify to:
response.xpath("/html/body/fieldset/big[1]")[0].extract()
If second approach fixes the issue, then go with bs4 or lxml to convert html to text.
Furthermore, if this is a public link, let us know the link along with complete log for further understanding of the issue.

Related

Requests Module not fetching full website in Python

Sorry for a Noob question.... I have written a code which searches google for an image stored locally on my computer. I accomplished this using the requests module. I want to scrape the result page for information about the image but request module never fetches the entire page. It only fetches a part of it and thus I am not able to scrape the website for results
import requests
import webbrowser
from bs4 import BeautifulSoup
filePath = "C:\\Users\\mjjha\\Documents\\Checkrow\\monaLisa.jpg"
searchUrl = 'http://www.google.com/searchbyimage/upload'
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
r=requests.get(fetchUrl)
webbrowser.open(fetchUrl)
soup=BeautifulSoup(r.content,'html.parser')
head=soup.find_all('a')
for i in head:
print(i['href'])
The web page looks like this:
but when I scrape it for anchor tag links using beautiful soup I get the following result:
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZisFTqPOZmEYpGB89rRLg4R4TfmF3WVQ_1gHEFiENQ8wbYqQq7-KsJUE5KuuxvINd0hFo10EMmP4RzWvOBvRxmsHZ7vm6etW-I36-QfCwmwir1NawORzsWZJffCnSwTpdts39mmQ1EfkcH0R8eGsiJ4_1Xw9DA_1C9mqLpChwRYdgOT-oFNcpt2O25Zhmo6ouG2XA5ZelCbKAChT4DJfGz0TXphXB_1dGEluDV6R_15n42URKCX5Q1zIqR6_16CR0rgXBphz95FMrETLqtPURRbAaWzauYisnSk6jF_1T5GbuoJKHtqThXevhogUSW9ERfZr5vbbWI6DA9c&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_jZL2NlEWh9JZhydIGUbq3LjMUs0%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwjT1qipk8f4AhUCTmwGHZy2BLcQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_fWA_BOpuaXy87gYOc9cg4jE6KwU%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZiuVpzNQ_10ZbUunarcxtcBADLP2GPHTlIiPAtNpsOezRg48a1S4ofT8Df9C4NRT_1PzMk4baDStlOQGRp2okPHCALA5TvLodRqGj_1Q9_19KWykusySDjQNkbi67Ob6Kx7LZ0ybQ59c9mvyda27CBq8_19XutgXXgl4hGCLdXX9M3Od0WI9BckSHxv_1zajCMhKj1XaLKl9p7T9S0hfQbyZs4zQNcudXEk_1y3Zle6anU1rmSpEdpeCXC6r_1HnTnTLAtYWQlVFVF6QuCT8W5djGXGXwTjQH7NgkXnOi6q7v4F_1DqTVytnSAcBX6rc1eJFlXwIR2dzR73cs983mzb686VgOqUUNS1IG8w&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_cvjNCWi4vNXxh56c1c4ZqxAqvrQ%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwibiJivk8f4AhVaTmwGHenOD1wQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_d19wKMnK5qQH_fmlL2YBuhhR_BE%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
PS C:\Users\mjjha\Documents\Checkrow> python -u "c:\Users\mjjha\Documents\Checkrow\1.py"
http://www.google.co.in/imghp?hl=en&tab=wi
http://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=IN&tab=w1
https://news.google.com/?tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.hr/webhp%3Ftbs%3Dsbi:AMhZZiu1U_1mLzDh7oSZSFpNdcot7N84lXExmiJp6LMIJ1NO2PHle7mcBr72CTgX45DTbkkF8yfhvT0GATXTIzgd--ayauOaI-gLvTa-DAeOAodk35Kz6mpzCzl8ly6YYdUbn2S5cCe35BP37ysxT-tSFbvtLPwovJuiNPmunzpk_1j0a88zkXOmb1tn3vfgXnb6IhaucZJIMZztSDOIljOiaSTIzhdQ1aSusETDAu3EMNnoWRaFWqzcUGHzIWuABI9gJkelzjDaV-aK4ilxQJhwGJnzuKNHDbJ4GSX33an2jIfssmwfoWZLFej_1V0Zijr2fuFqULhoAg2lku41nHNxHY1nI0gNU4M2Q&ec=GAZAAQ
/search?ie=UTF-8&q=Anne+Frank&oi=ddle&ct=236393864&hl=en-GB&si=AC1wQDDagiMg03ncxeOQZbwVe-CJxRCchC-jr2hCPTxjc9wbgOxFdg4PkIAWeA8WhyCLGGzkibRoi5B84SONt2NaUNMtZff0HVDXAtNUKeMfxbgImvSIzyY%3D&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQPQgD
/advanced_search?hl=en-IN&authuser=0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=hr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAU
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=hi&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAY
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=bn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAc
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=te&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAg
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=mr&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAk
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=ta&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAo
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=gu&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAs
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=kn&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCAw
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=ml&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA0
http://www.google.hr/setprefs?sig=0_S9cBw3JrMA0drw1wRsLp0wK1RFM%3D&hl=pa&source=homepage&sa=X&ved=0ahUKEwj64PWzk8f4AhXqS2wGHdm9C0MQ2ZgBCA4
/intl/en/ads/
http://www.google.co.in/services/
/intl/en/about.html
http://www.google.hr/setprefdomain?prefdom=US&sig=K_5Sxk31MIG7AiUTjwI71yoWFyO_E%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/
The content fetched by requests module doesn't contain the full web page I don't know why. I want to scrape information in image ,anchor and h3 tags from the page using beautiful soup but its just not working out.

The main problem is Python Requests module doesn't render JavaScript. As a result, you are not getting the webpage you are supposed to get.
You are using a webbrowser module to view your URL where JavaScript is enabled, so you are getting the page as expected. But next, when you use the requests module to get the page, javascript stays disabled, and google doesn't let you render the page but instead redirects you to another page(Google Homepage). And there, you get different HTML resulting in no search results(you did in the first place).
IN 1 is the URL you are trying to hit, and 2 is the URL you are redirected to.
Look at the difference is google.com/webhp?tbs=sbi:AMhZZisX...
VS google.com/search?tbs=sbi:AMhZZisX...
The HTML of that page results in is this -
Always use the source HTML given by the requests module, which shows you the actual result.
As you can see, this is not the search result page.
So to reach your goal, try using Selenium.

Getting Selenium page source before JavaScript execution

I have a case where the raw page source code from the HTTP request contains some JSON data I need to capture - let's say <code id="data">{'some':'json'}</code>. But the JavaScript executes, processes it, and removes the data from the DOM so I can't see it in webdriver.page_source.
Any ideas how I can capture this? Or at least somehow disable/pause JavaScript, .get() the page, extract what I need from source_code and then re-enable/un-pause JavaScript?

I feel like this is a bit dirty but it works. I needed to use selenium for the first part. I then handed off to requests:
session = requests.Session()
for cookie in self.webdriver.get_cookies():
session.cookies.set(cookie['name'], cookie['value'])
response = session.get(self.webdriver.current_url).text
and simply wrote the contents to a temp file and then opened that with selenium:
response_file = utils.save_to_temp_file(response, extension='.html')
self.webdriver.get(f'file://{response_file}')`

Scrapy - Xpath works in shell but not in code

I'm trying to crawl a website (I got their authorization), and my code returns what I want in scrapy shell, but I get nothing in my spider.
I also checked all the previous question similar to this one without any success, e.g., the website doesn't use javascript in the home page to load the elements I need.
import scrapy
class MySpider(scrapy.Spider):
name = 'MySpider'
start_urls = [ #WRONG URL, SHOULD BE https://shop.app4health.it/ PROBLEM SOLVED!
'https://www.app4health.it/',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
print ('PRE RISULTATI')
results = response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
# results = response.css('li a>href').extract()
# This works on scrapy shell, not in code
#risultati = response.xpath('//*[#id="nav"]/ol/li[1]/a').extract()
print (risultati)
#for pagineitems in risultati:
# next_page = pagineitems
print ('NEXT PAGE')
#Ignores the request cause already done. Insert dont filter
yield scrapy.Request(url=risultati, callback=self.prodotti,dont_filter = True)
def prodotti(self, response):
self.logger.info('A REEEESPONSEEEEEE from %s just arrived!', response.url)
return 1
The website i'm trying to crawl is https://shop.app4health.it/
The xpath command that i use is this one :
response.selector.xpath('//*[#id="nav"]/ol/li[*]/a/#href').extract()
I know there are some problems with the prodotti function ecc..., but that's not the point. I would like to understand why the xpath selector works with scrapy shell ( i get exactly the links that i need ), but when i run it in my spider, i always get a null list.
If it can help, when i use CSS selectors in my spider, it works fine and it finds the elements, but i would like to use xpath ( i need it in the future development of my application ).
Thanks for the help :)
EDIT:
I tried to print the body of the first response ( from start_urls ) and it's correct, i get the page i want. When i use selectors in my code ( even the one that have been suggested ) they all work fine in shell, but i get nothing in my code!
EDIT 2
I have become more experienced with Scrapy and web crawling, and i realised that sometimes, the HTML page that you get in your browser might be different from the one you get with the Scrapy request! In my experience some website would respond with a different HTML compared to the one you see in your browser! That's why sometimes if you use the "correct" xpath/css query taken from the browser, it might return nothing if used in your Scrapy code.
Always check if the body of your response is what you were expecting!
SOLVED:
Path is correct. I wrote the wrong start_urls!

Alternatively to Desperado's answer you can use css selectors which are much simpler but more than enough for your use case:
$ scrapy shell "https://shop.app4health.it/"
In [1]: response.css('.level0 .level-top::attr(href)').extract()
Out[1]:
['https://shop.app4health.it/sonno',
'https://shop.app4health.it/monitoraggio-e-diagnostica',
'https://shop.app4health.it/terapia',
'https://shop.app4health.it/integratori-alimentari',
'https://shop.app4health.it/fitness',
'https://shop.app4health.it/benessere',
'https://shop.app4health.it/ausili',
'https://shop.app4health.it/prodotti-in-offerta',
'https://shop.app4health.it/kit-regalo']
scrapy shell command is perfect for debugging issues like this.

//nav[#id="mmenu"]//ul/li[contains(#class,"level0")]/a[contains(#class,"level-top")]/#href
use this xpath, also consider 'view-source' of page before creating xpath

Scraping Page That Requires JavaScript Interaction

I am trying to scrape https://a836-propertyportal.nyc.gov/Default.aspx with Scrapy. I am having difficulty using the FormRequest--specifically, I do not know how to tell Scrapy how to fill the block and lot forms out, and then subsequently get the response of the page. I tried following the FormRequest example on the Scrapy website found here (http://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login), but continued to have difficulty with properly clicking on the "Search" button.
I would really appreciate it if you could offer any suggestions so that I can extract data from the submitted page. Some poster on SO suggested that Scrapy cannot handle JS events well, and to use another library like CasperJS instead.
Update: I would very much appreciate it if someone could please point me to a Java/Python/JS library that allows me to submit a form, and retrieve the subsequent information
Updated Code (following Pawel's comment): My code can be found here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
class MonshtarSpider(Spider):
name = "monshtar"
allowed_domains = ["https://a836-propertyportal.nyc.gov/Default.aspx"]
start_urls = (
'https://a836-propertyportal.nyc.gov/Default.aspx/',
)
def parse(self, response):
print "entered the parsing section!!"
yield Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx",
cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"}, callback = self.aftersubmit)
def aftersubmit(self, response):
#get the data....
print "SUCCESS!!\n\n\n"

Your page is somewhat bizzare and difficult to parse, after submitting valid POST request page responds with 302 http status and a bunch of cookies (your formdata is invalid by the way, you need to replace underscores with dollars in your parameters).
Content can be viewed after sending GET to https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx
Most surprising thing is that you can crawl this site using only cookies, without POST request. POST is there only to give you cookies, it does not redirect to or respond with html response. You can manipulate those cookies from your spider. You only need to make first GET to get session cookie, and then successive GETS with borough, block etc.
Try this in scrapy shell:
pawel#stackoverflow:~/stack/scrapy$ scrapy shell "https://a836-propertyportal.nyc.gov/Default.aspx"
In [1]: from scrapy.http import Request
In [2]: req = Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"})
In [3]: fetch(req)
In [4]: view(response)
Out[5]: True # opening browser window
Response at this point will contain data for property with given block, borough and lot. Now you only need to use this knowledge in your spider. Just replace your POST with GET with cookies, add callback to what you have in shell and it should work fine.
If this still does not work or is somehow unsuited to your purposes try extracting hidden ajax parameter (the value of nullctl00_ScriptManager1_HiddenField), add this to formdata (and of course correct your formdata so that it is identical to what browser sends).

You don't click the search button but you make a POST request to a page with all the data. But checking the code, it's send a lot of data. Below I posted my requests...
ctl00_ScriptManager1_HiddenField:;;AjaxControlToolkit, Version=3.0.11119.25904, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:f48478dd-9360-4d50-94c1-5c5fa55bd379:865923e8:411fea1c:e7c87f07:91bd373d:1d58b08c:8e72a662:acd642d2:596d588c:77c58d20:14b56adc:269a19ae:bbfda34c:30a78ec5:5430d994
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTEwMDA4NDY4ODAPZBYCZg9kFgICBQ9kFgQCAg9kFgQCAQ8WAh4HVmlzaWJsZWhkAgcPFgIfAGgWAgIBDxYCHglpbm5lcmh0bWwFGEFsZXJ0IGZvcjxiciAvPiBCQkwgOiAtLWQCBA9kFgQCAg9kFgQCAQ9kFgRmDw8WBB4IQ3NzQ2xhc3MFF2FjY29yZGlvbkhlYWRlclNlbGVjdGVkHgRfIVNCAgJkZAIBDw8WBB8CBRBhY2NvcmRpb25Db250ZW50HwMCAhYCHgVzdHlsZQUOZGlzcGxheTpibG9jaztkAgIPZBYEZg8PFgQfAgUPYWNjb3JkaW9uSGVhZGVyHwMCAmRkAgEPDxYEHwIFEGFjY29yZGlvbkNvbnRlbnQfAwICFgIfBAUNZGlzcGxheTpub25lOxYCAgEPZBYCZg9kFgZmDw9kFgIfBAUNZGlzcGxheTpub25lO2QCDA8PFgIfAGhkZAINDw8WAh8AaGRkAgMPD2QWBh4FU3R5bGUFN3dpZHRoOjM1MHB4O2JhY2tncm91bmQ6d2hpdGU7ZGlzcGxheTpub25lO29wYWNpdHk6MC45MjseC29ubW91c2VvdmVyBQ93d2hIZWxwLnNob3coKTseCm9ubW91c2VvdXQFD3d3aEhlbHAuaGlkZSgpO2Rky2sFuMlw1iy/E0GN9cB65RXg7Aw=
__EVENTVALIDATION:/wEWGgKWm9a2BgL687aTAwLmha0BAujn2IECAo3DtaEJAtLdz/kGAr3g5K4DAu78ttcEAvOB3+MGAvKB3+MGAvGB3+MGAvCB3+MGAveB3+MGAoHAg44PArT/mOoPAqrvlMAJAtzQstcEAoDswboFAoHswboFAoLswboFAoPswboFAoTswboFAtjqpO8KAujQ7b0GAqvgnb0NAsPa/KsBQz19YIqBRvCWvZh8bk6XKxp+wQo=
grpStyle:blue
ctl00$SampleContent$MyAccordion_AccordionExtender_ClientState:0
ctl00$SampleContent$ctl01$TextBox1:(unable to decode value)
ctl00$SampleContent$ctl01$ddlParclBorough:1
ctl00$SampleContent$ctl01$txtBlock:100
ctl00$SampleContent$ctl01$txtLot:200
ctl00$SampleContent$ctl01$btnSearchBBL:Please Wait...
ctl00$SampleContent$ctl03$TextBox2:(unable to decode value)
ctl00$SampleContent$ctl03$ddlParclBoroughPropAddr:1
ctl00$SampleContent$ctl03$txtHouseNbr:
ctl00$SampleContent$ctl03$txtStreetNm:
ctl00$SampleContent$ctl03$txtAptNbr:
My suggestion is to use a scrap lib which supports executing JS. Or use something else. I had many success using Selenium and WebDriver to execute code in browser, which supports JS.
Update:
You have an example How to submit a form using PhantomJS.

How to avoid scrapy ignoring hash tag

i am working on scrapy
I had a site to scrape with hash tag included , but when i run it , scrapy downloading the response by ignoring hash tag
For example this is the url with hash fragments, url="www.example.com/hash-tag.php#user_id-654"
and the response from this request is only www.example.com/hash-tag.php, but i want to scrape the url with hash fragments.
My code is below
class ExampleSpider(BaseSpider):
name = "example"
domain_name = "www.example.com"
def start_requests(self):
return Request("www.example.com/hash-tag.php#user_id-654")
def parse(self):
print response
Result:
<GET www.example.com/hash-tag.php>
How can i do this......
Thanks in advance................

What you are trying to do is not easily possible. To achieve what you want you need a full DOM and JavaScript engine, i.e. a (possibly headless) browser.
If you really need it, have a look at PhantomJS. It is the WebKit engine but completely headless. I'm not sure if scrapy can be easily extended but if you really want to execute JavaScript (which you need in this case), using PhantomJS is probably the way to go.

Well if you really need that information you could just do split string before calling Request, and send that information as meta.
Something like
url = "www.example.com/hash-tag.php#user_id-654"
hash = url.split("#")[1]
request = Request(url, callback=self.parse_something)
request.meta['after_hash'] = hash
yield request
and then in parsing get and use it like
def parse_something(self, response):
hash = response.meta['after_hash']
That is if you only need that information after hash sign.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrapy response not able to crawl because of bad characters - python

Related

Requests Module not fetching full website in Python

Getting Selenium page source before JavaScript execution

Scrapy - Xpath works in shell but not in code

Scraping Page That Requires JavaScript Interaction

How to avoid scrapy ignoring hash tag

Categories

Resources