I change the default request headers in settings.py as below:
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
}
However, it doesn't work in my HotSpider. I can see scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware was enabled, but connection was closed cleanly as if the headers were not set.
Here is the HotSpider:
# -*- coding: utf-8 -*-
import scrapy
class HotSpider(scrapy.Spider):
name = "hot"
allowed_domains = ["qiushibaike.com"]
start_urls = (
'http://www.qiushibaike.com/hot',
)
def parse(self, response):
print '\n', response.status, '\n'
If I change the code to override the make_requests_from_url to set the header, everything works well.
# -*- coding: utf-8 -*-
import scrapy
class HotSpider(scrapy.Spider):
name = "hot"
allowed_domains = ["qiushibaike.com"]
start_urls = (
'http://www.qiushibaike.com/hot',
)
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
}
def make_requests_from_url(self, url):
return scrapy.http.Request(url, headers=self.headers)
def parse(self, response):
print '\n', response.status, '\n'
This problem will be settled in Scrapy 1.2 according to prioritize default headers over user agent middlewares #2091
I see User-Agent header is indeed not set properly when using default headers middleware and this particular site refuses connections without some expected user-agent header.
Recommended way to set user-agent for your crawler is by using USER_AGENT setting key:
e.g.
# settings.py
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36"
not setting user-agent when using default headers might be some bug in Scrapy, or maybe this is expected and documented somewhere. You need to do more research about this, if it is indeed bug it's worth posting bug report in Scrapy github repo.
Related
Here's an example link I'm trying to scrape: https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227
My scraper was working fine till today so I'm guessing Lowe's added more protection against bots :(
After some research, I found that I would have to add headers to my web scraper so I can emulate a real user.
Opened up Dev Console -> Network -> XHR/Fetch -> Found JSON File.
Here's my scrapy script
# -*- coding: utf-8 -*-
import scrapy
from ..items import LowesItem
import re
import pandas as pd
import requests
import json
from scrapy.http import Request
from datetime import date
class LowesSpider(scrapy.Spider):
name = 'Lowes'
def start_requests(self):
HEADERS = {
'method': 'GET',
'scheme': 'https',
'authority': 'content.syndigo.com',
'Accept': '*/*',
'Content-Type': 'text/plain',
'Origin': 'https://lowes.com',
'Accept-Language': 'en-US,en;q=0.9',
'Host': 'content.syndigo.com',
'User-Agent': ' Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Safari/605.1.15',
'Referer': 'https://www.lowes.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Cookie': 'sn=0321'
}
start_urls = ['https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Professional-7-Burners-4-cu-ft-2-cu-ft-Double-Oven-Convection-Dual-Fuel-Range-Stainless-Steel-Common-48-in-Actual-48-in/1000514227']
for url in start_urls:
yield Request(url,
headers=HEADERS,
meta={'dont_merge_cookies': True,
'url':url})
def parse(self, response):
for item in self.parseLowes(response):
yield item
pass
def parseLowes(self, response):
item = LowesItem() #items from items.py
script_tag = response.xpath('//script[#type="application/ld+json"]/text()').get() #get js container
productPrice = json.loads(script_tag)[2]["offers"]["price"]
productURL = response.url
url = response.meta['url']
productSKU = url.split("=")[-1]
scrapedDate = date.today()
#item['productName'] = productName #display product name
item['productOMS'] = productSKU
item['productPrice'] = productPrice #display price and assign to variable
item['productURL'] = productURL #displayURL
item['scrapedDate'] = scrapedDate
yield item
When I run scrapy, I get 400 as a response from the command.
From what I can see about the network connection, the issue is related to the their CDN (Akamai) which is blocking the access.
I was able to access your link and see the product from Microsoft Edge (version 107). In my request the user agent is:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.26
So, try to modify in your code the 'User-Agent' key with that value.
I'm trying to create a script using scrapy to grab json content from this webpage. I've used headers within the script accordingly but when I run it, I always end up getting JSONDecodeError. The site sometimes throws captcha but not always. However, I've never got any success using the script below even when I used vpn. How can I fix it?
This is how I've tried:
import scrapy
import urllib
class ImmobilienScoutSpider(scrapy.Spider):
name = "immobilienscout"
start_url = "https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen"
headers = {
'accept': 'application/json; charset=utf-8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
params = {
'price': '1000.0-',
'constructionyear': '-2000',
'pagenumber': '1'
}
def start_requests(self):
req_url = f'{self.start_url}?{urllib.parse.urlencode(self.params)}'
yield scrapy.Request(
url=req_url,
headers=self.headers,
callback=self.parse,
)
def parse(self,response):
yield {"response":response.json()}
This is how the output should look like (truncated):
{"searchResponseModel":{"additional":{"lastSearchApiUrl":"/region?realestatetype=apartmentbuy&price=1000.0-&constructionyear=-2000&pagesize=20&geocodes=1276010&pagenumber=1","title":"Eigentumswohnung in Nordrhein-Westfalen - ImmoScout24","sortingOptions":[{"description":"Standardsortierung","code":0},{"description":"Kaufpreis (höchste zuerst)","code":3},{"description":"Kaufpreis (niedrigste zuerst)","code":4},{"description":"Zimmeranzahl (höchste zuerst)","code":5},{"description":"Zimmeranzahl (niedrigste zuerst)","code":6},{"description":"Wohnfläche (größte zuerst)","code":7},{"description":"Wohnfläche (kleinste zuerst)","code":8},{"description":"Neubau-Projekte (Projekte zuerst)","code":31},{"description":"Aktualität (neueste zuerst)","code":2}],"pagerTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=%page%","sortingTemplate":"|Suche|de|nordrhein-westfalen|wohnung-kaufen?price=1000.0-&constructionyear=-2000&sorting=%sorting%","world":"LIVING","international":false,"device":{"deviceType":"NORMAL","devicePlatform":"UNKNOWN","tablet":false,"mobile":false,"normal":true}
EDIT:
This is how the script built upon requests module looks like:
import requests
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen'
headers = {
'accept': 'application/json; charset=utf-8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'x-requested-with': 'XMLHttpRequest',
'content-type': 'application/json; charset=utf-8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?price=1000.0-&constructionyear=-2000&pagenumber=1',
# 'cookie': 'hardcoded cookies'
}
params = {
'price': '1000.0-',
'constructionyear': '-2000',
'pagenumber': '2'
}
sess = requests.Session()
sess.headers.update(headers)
resp = sess.get(link,params=params)
print(resp.json())
Scrapy's CookiesMiddleware disregards 'cookie' passed in headers.
Reference: scrapy/scrapy#1992
Pass cookies explicitly:
yield scrapy.Request(
url=req_url,
headers=self.headers,
callback=self.parse,
# Add the following line:
cookies={k: v.value for k, v in http.cookies.SimpleCookie(self.headers.get('cookie', '')).items()},
),
Note: That site uses GeeTest CAPTCHA, which cannot be solved by simply rendering the page or using Selenium, so you still need to periodically update the hardcoded cookie (cookie name: reese84) taken from the browser, or use a service like 2Captcha.
I'm new to scrapy, so please be kind:))
So, I'm going to scrape some JSON files from devtools-Network from multiple pages with scrapy. However each of the pages have different headers. How can I solve this
Lets use this
import scrapy
import json
import scrapy
import json
class QuoteSpider(scrapy.Spider):
name = 'quote'
allowed_domains = ['shopee.co.id']
page = 1
start_urls = ['https://shopee.co.id/api/v2/search_items/?by=relevancy&keyword=deodorant&limit=50&newest=0&order=desc&page_type=search&version=2']
def parse(self, response):
data = json.loads(response.text)
for quote in data["quotes"]:
yield {"quote": quote["text"]}
if data["has_next"]:
self.page += 1
url = "https://shopee.co.id/api/v2/search_items/?by=relevancy&keyword=deodorant&limit=50&newest="+str(self.page)+"&order=desc&page_type=search&version=2"
headers = {
'accept' : '*/*'
,'accept-encoding': 'gzip, deflate, br'
,'accept-language': 'en-US,en;q=0.9'
,'if-none-match-' : '55b03-ba4b020f9bad34856fc4771b0aaedc93'
,'referer' : 'https://shopee.co.id/search?keyword=deodorant&page='+str(self.page)
,'sec-fetch-dest' : 'empty'
,'sec-fetch-mode' : 'cors'
,'sec-fetch-site' : 'same-origin'
,'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'
,'x-api-source' : 'pc'
,'x-requested-with': 'XMLHttpRequest'
}
yield scrapy.Request(url=url, callback=self.parse, method='GET', headers=headers)
How would I add a unique headers for each pages, for example,
header = {
'referer': 'some_reference_pagenumber',
}
I tried the script and tried to tinker with it, yet it always result in
referer:none in the shell, thus unable to scrape
Thanks a lot!
you don't have to add the parameters in headers that you want to change as per page. suppose you want to change 'referer' according to some logic then initialize the headers without 'referer'.
headers = {
'accept' : '*/*'
,'accept-encoding': 'gzip, deflate, br'
,'accept-language': 'en-US,en;q=0.9'
,'if-none-match-' : '55b03-ba4b020f9bad34856fc4771b0aaedc93'
,'sec-fetch-dest' : 'empty'
,'sec-fetch-mode' : 'cors'
,'sec-fetch-site' : 'same-origin'
,'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'
,'x-api-source' : 'pc'
,'x-requested-with': 'XMLHttpRequest'
}
and as you know that headers is a dictionary and we can append it as we want. so now we can append it according to our logic. see below:
if 'deoderand' in keyword:
headers['referer'] = f'https://shopee.co.id/search?keyword={keyword}'
yield scrapy.Request(url=url, callback=self.parse, method='GET', headers=headers)
every time, when your logic will be true it will modify your headers's referer as you want.
I'm trying to get a POST request, but I don't know what's wrong with my code that the data doesn't come.
The following message is displayed:
HTTP status code is not handled or not allowed
This is the website
A screenshot of the header:
This is my code:
import json
import scrapy
class MySpider(scrapy.Spider):
name = 'pb'
payload = {"version":"1.0.0","queries":[{"Query":{"Commands":[{"SemanticQueryDataShapeCommand":{"Query":{"Version":2,"From":[{"Name":"e","Entity":"Events"},{"Name":"d","Entity":"DAX"}],"Select":[{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Date Start"},"Name":"Events.Date Start"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Event Type"},"Name":"Events.Event Type"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Name"},"Name":"Events.Name"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Length"},"Name":"Events.Total Days"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Location"},"Name":"Events.Location"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Link to Event"},"Name":"Events.Link to Event"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Days Until Event"},"Name":"DAX.Days Until"},{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Link to Submit"},"Name":"Events.Link to Submit"},{"Measure":{"Expression":{"SourceRef":{"Source":"d"}},"Property":"Event Type Number"},"Name":"DAX.Event Type Number"}],"OrderBy":[{"Direction":1,"Expression":{"Column":{"Expression":{"SourceRef":{"Source":"e"}},"Property":"Date Start"}}}]},"Binding":{"Primary":{"Groupings":[{"Projections":[0,1,2,3,4,5,6,7,8]}]},"DataReduction":{"DataVolume":3,"Primary":{"Window":{"Count":500}}},"Aggregates":[{"Select":3,"Aggregations":[{"Min":{}},{"Max":{}}]}],"SuppressedJoinPredicates":[8],"Version":1}}}]},"CacheKey":"{\"Commands\":[{\"SemanticQueryDataShapeCommand\":{\"Query\":{\"Version\":2,\"From\":[{\"Name\":\"e\",\"Entity\":\"Events\"},{\"Name\":\"d\",\"Entity\":\"DAX\"}],\"Select\":[{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Date Start\"},\"Name\":\"Events.Date Start\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Event Type\"},\"Name\":\"Events.Event Type\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Name\"},\"Name\":\"Events.Name\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Length\"},\"Name\":\"Events.Total Days\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Location\"},\"Name\":\"Events.Location\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Link to Event\"},\"Name\":\"Events.Link to Event\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Days Until Event\"},\"Name\":\"DAX.Days Until\"},{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Link to Submit\"},\"Name\":\"Events.Link to Submit\"},{\"Measure\":{\"Expression\":{\"SourceRef\":{\"Source\":\"d\"}},\"Property\":\"Event Type Number\"},\"Name\":\"DAX.Event Type Number\"}],\"OrderBy\":[{\"Direction\":1,\"Expression\":{\"Column\":{\"Expression\":{\"SourceRef\":{\"Source\":\"e\"}},\"Property\":\"Date Start\"}}}]},\"Binding\":{\"Primary\":{\"Groupings\":[{\"Projections\":[0,1,2,3,4,5,6,7,8]}]},\"DataReduction\":{\"DataVolume\":3,\"Primary\":{\"Window\":{\"Count\":500}}},\"Aggregates\":[{\"Select\":3,\"Aggregations\":[{\"Min\":{}},{\"Max\":{}}]}],\"SuppressedJoinPredicates\":[8],\"Version\":1}}}]}","QueryId":"","ApplicationContext":{"DatasetId":"6427f3c6-42f6-4287-b061-c31c1d2e7ae0","Sources":[{"ReportId":"6e442642-8594-4894-bc32-0ab7f4620772"}]}}],"cancelQueries":[],"modelId":1226835}
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
def start_requests(self):
yield scrapy.Request(
url='https://wabi-australia-southeast-api.analysis.windows.net/public/reports/querydata?synchronous=true',
method='POST',
body=json.dumps(self.payload),
headers={
'Accept-Language': 'pt-BR,pt;q=0.9,en;q=0.8',
'ActivityId': '1d3ecdc2-5dc0-801e-4140-82a258f127a6',
'Connection': 'keep-alive',
'Content-Length': '3462',
'Content-Type': 'application/json;charset=UTF-8',
'Host': 'wabi-australia-southeast-api.analysis.windows.net',
'Origin': 'https://app.powerbi.com',
'Referer': 'https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9',
'RequestId': '11c18fe6-00da-7df4-952c-98ba7bdf188e',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'X-PowerBI-ResourceKey': '0b056628-32ac-4310-9900-a260ee395636'
}
)
def parse(self, response):
items = json.loads(response.text)
yield {"data":items}
The request in your screenshot is a GET request.
The behaviour of this website is very interesting!
Let's examine it.
By looking at the network panel we can see that GET request is being made to some complex url with many various headers. However It seems that the header X-PowerBI-ResourceKey is the only one that's needed and it controls what content the request will return.
So all we need to replicate this is find the X-PowerBI-ResourceKey value.
If you take a look at the source code of the html page:
https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9
Here we can see that javascript's atob method is used on url parameter. This is javascripts b64decode function. We can run it in python:
$ ptpython
>>> from base64 import b64decode
>>> b64decode("eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzF
1 kMGQ2ZDE5ZSJ9")
b'{"k":"0b056628-32ac-4310-9900-a260ee395636","t":"6f0e9c42-96ce-4551-9701-ba31d0d6d19e"}'
We got it figured out! Now lets put everything together in our crawler:
import json
from base64 import b64decode
from w3lib.url import url_query_parameter
def parse(self, response):
url = "https://app.powerbi.com/view?r=eyJrIjoiMGIwNTY2MjgtMzJhYy00MzEwLTk5MDAtYTI2MGVlMzk1NjM2IiwidCI6IjZmMGU5YzQyLTk2Y2UtNDU1MS05NzAxLWJhMzFkMGQ2ZDE5ZSJ9"
# get the "r" paremeter from url
resource_key = url_query_parameter(url, 'r')
# base64 decode it
resource_key = b64decode(resource_key)
# {'k': '0b056628-32ac-4310-9900-a260ee395636', 't': '6f0e9c42-96ce-4551-9701-ba31d0d6d19e'}
# it's a json string - load it and get key "k"
resource_key = json.loads(resource_key)['k']
headers = {
'Accept': "application/json, text/plain, */*",
# 'X-PowerBI-ResourceKey': "0b056628-32ac-4310-9900-a260ee395636",
'X-PowerBI-ResourceKey': resource_key,
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36",
'Accept-Encoding': "gzip, deflate, br",
'Accept-Language': "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}
yield Request(url, headers=headers)
i want scrap the PINCODEs from "http://www.indiapost.gov.in/pin/", i am doing with following code written.
import urllib
import urllib2
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.indiapost.gov.in/pin/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
viewstate = 'JulXDv576ZUXoVOwThQQj4bDuseXWDCZMP0tt+HYkdHOVPbx++G8yMISvTybsnQlNN76EX/...'
eventvalidation = '8xJw9GG8LMh6A/b6/jOWr970cQCHEj95/6ezvXAqkQ/C1At06MdFIy7+iyzh7813e1/3Elx...'
url = 'http://www.indiapost.gov.in/pin/'
formData = (
('__EVENTVALIDATION', eventvalidation),
('__EVENTTARGET',''),
('__EVENTARGUMENT',''),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('__EVENTVALIDATION', eventvalidation),
('txt_offname',''),
('ddl_dist','0'),
('txt_dist_on',''),
('ddl_state','2'),
('btn_state','Search'),
('txt_stateon',''),
('hdn_tabchoice','3')
)
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
encodedFields = urllib.urlencode(formData)
f = myopener.open(url, encodedFields)
print f.info()
try:
fout = open('tmp.txt', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
i am getting response from server as "Sorry this site has encountered a serious problem, please try reloading the page or contact webmaster."
pl suggest where i am going wrong..
Where did you get the value viewstate and eventvalidation? On one hand, they shouldn't end with "...", you must have omitted something. On the other hand, they shouldn't be hard-coded.
One solution is like this:
Retrieve the page via URL "http://www.indiapost.gov.in/pin/" without any form data
Parse and retrieve the form values like __VIEWSTATE and __EVENTVALIDATION (you may take use of BeautifulSoup).
Get the search result(second HTTP request) by adding vital form-data from step 2.
UPDATE:
According to the above idea, I modify your code slightly to make it work:
import urllib
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.indiapost.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.indiapost.gov.in/pin/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'http://www.indiapost.gov.in/pin/'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('__VIEWSTATEENCRYPTED',''),
('txt_offname', ''),
('ddl_dist', '0'),
('txt_dist_on', ''),
('ddl_state','1'),
('btn_state', 'Search'),
('txt_stateon', ''),
('hdn_tabchoice', '1'),
('search_on', 'Search'),
)
encodedFields = urllib.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
try:
# actually we'd better use BeautifulSoup once again to
# retrieve results(instead of writing out the whole HTML file)
# Besides, since the result is split into multipages,
# we need send more HTTP requests
fout = open('tmp.html', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()