Web Scraping Asp.NET site using Beautiful Soup and Python

Web Scraping Asp.NET site using Beautiful Soup and Python - python

I have the following code, but its gives be 200 OK with first page (state of default drop down) response. Please note that the Drop Down lists are dymanic and progressive until final search button appears , Can someone correct me as to what is wrong with my code?
def process(ghatno):
home_url = 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik'
post_url = 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik'
print "Please wait...getting details of :" + ghatno
with requests.Session() as session:
r = session.get(url=post_url)
cookies = r.cookies
pprint.pprint(r.headers)
gethead = r.headers
soup = BeautifulSoup(r.text, 'html.parser')
viewstate = soup.select('input[name="__VIEWSTATE"]')[0]['value']
csrftoken = soup.select('input[name="__CSRFTOKEN"]')[0]['value']
eventvalidation = soup.select('input[name="__EVENTVALIDATION"]')[0]['value']
viewgen = soup.select('input[name="__VIEWSTATEGENERATOR"]')[0]['value']
data = {
'__CSRFTOKEN':csrftoken,
'__EVENTARGUMENT':'',
'__EVENTTARGET':'',
'__LASTFOCUS':'',
'__SCROLLPOSITION':'0',
'__SCROLLPOSITIONY':'0',
'__EVENTVALIDATION': eventvalidation,
'__VIEWSTATE':viewstate,
'__VIEWSTATEGENERATOR': viewgen,
'ctl00$ContentPlaceHolder5$ddlLanguage' : 'en-US',
'ctl00$ContentPlaceHolder5$btnSearchCommonSr':'Search',
'ctl00$ContentPlaceHolder5$ddlTaluka': '2',
'ctl00$ContentPlaceHolder5$ddlVillage': '25',
'ctl00$ContentPlaceHolder5$ddlYear': '20192020',
'ctl00$ContentPlaceHolder5$grpSurveyLocation': 'rdbSurveyNo',
'ctl00$ContentPlaceHolder5$txtCommonSurvey': 363
}
headers = {
'Host': 'igrmaharashtra.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0',
'Referer': 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik',
'Host': 'igrmaharashtra.gov.in',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
r = requests.post(url=post_url, data=json.dumps(data), cookies=cookies, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
table = SoupStrainer('tr')
soup = BeautifulSoup(soup.get_text(), 'html.parser', parse_only=table)
print(soup.get_text())
pprint.pprint(r.headers)
print r.text
getpost = r.headers
getpostrequest = r.request.headers
getresponsebody = r.request.body
f = open('/var/www/html/nashik/hiren.txt', 'w')
f.write(str(gethead))
f.write(str(getpostrequest))
f.write(str(getresponsebody))
f.write(str(getpost))
My response is as below :
Response header - (GET Request)
{'Content-Length': '5994', 'X-AspNet-Version': '4.0.30319', 'Set-Cookie': 'ASP.NET_SessionId=24wwh11lwvzy5gf0xlzi1we4; path=/; HttpOnly, __CSRFCOOKIE=d7b10286-fc9f-4ed2-863d-304737df8758; path=/; HttpOnly', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'ASP.NET', 'Server': 'Microsoft-IIS/8.0', 'Cache-Control': 'private', 'Date': 'Thu, 02 May 2019 08:21:48 GMT', 'Content-Type': 'text/html; charset=utf-8'}
Response header - (GET Request)
{'Content-Length': '3726', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Host': 'igrmaharashtra.gov.in', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0', 'Connection': 'keep-alive', 'Referer': 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik', 'Cookie': '__CSRFCOOKIE=d7b10286-fc9f-4ed2-863d-304737df8758; ASP.NET_SessionId=24wwh11lwvzy5gf0xlzi1we4', 'Content-Type': 'application/x-www-form-urlencoded'}
Response header - (POST Request)
{'Content-Length': '7834', 'X-AspNet-Version': '4.0.30319', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'ASP.NET', 'Server': 'Microsoft-IIS/8.0', 'Cache-Control': 'private', 'Date': 'Fri, 03 May 2019 10:21:45 GMT', 'Content-Type': 'text/html; charset=utf-8'}
**Default Page Selected Drop Down is returned **
नाशिक and
- - Select Taluka - - INSTEAD of option value "2" i.e इगतपुरी once option "2" is selected I want value "25" in next drop down before I put my final survey "363" for results.
Please note I tried Mechanize browser too, but no luck !!

Finally the solution is to do post requests multiple times in same "session" with same "cookie" and iterate through them. It works now !

Related

How to give an exception that search data returns error message in a loop when sending POST request?

I have IDs that each of them sending HTTP POST request in the website form but some of them return error and whole syntax stop returning the remained ID's output. The site has to radio bar that one is for corporate(L) and another one for individuals(P) in list named "tip".
Code supposed to check if the ID ('voen') is related to 'tip' L (for individuals) then when searching ID in the website the form of L (in the form_data list you can see the key named "tip" is referenced to tip list.
import requests
from bs4 import BeautifulSoup
request_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.e-taxes.gov.az',
'Origin': 'https://www.e-taxes.gov.az',
'Referer': 'https://www.e-taxes.gov.az/ebyn/payerOrVoenChecker.jsp',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'YOUR USER AGENT',
}
voens = [1700401281,
4501313952,
]
tip = ['L',
'P',
]
form_data = {
'tip': tip,
'voenOrName': 'V',
'voen': voens,
'name': '',
'submit': ' Yoxla ',
}
url = 'https://www.e-taxes.gov.az/ebyn/payerOrVoenChecker.jsp'
for voen in voens:
form_data['voen'] = voen
response = requests.post(url, data=form_data, headers=request_headers)
s = BeautifulSoup(response.content, 'lxml')
sContent = s.findAll('table', {'class': 'com'})[0].findAll('tr', recursive=False)[1]
print(type(sContent))
**if sContent:**
outcome = sContent.get_text().strip()
print(outcome)
else:
form_data['tip'] = tip[1]
response = requests.post(url, data=form_data, headers=request_headers)
sContent = s.findAll('table', {'class': 'com'})[0].findAll('tr', recursive=False)[1]
print(sContent)
In the code above the "voens" dictionary there is one corporate and one individual IDs and in the code i write it like if the id is not corporate then check the another "tip" (P)
I thin the problem in the if sContent: line
ERROR MESSAGE: sContent = s.findAll('table', {'class': 'com'})[0].findAll('tr', recursive=False)[1]
IndexError: list index out of rang

How to get cookies value to set in requests?

I am accessing a URL https://streeteasy.com/sales/all which does not show the page unless Cookie is set. I am having no idea how this cookie value being generated. I highly doubt that cookie value is fixed so I guess I can't use a hard-coded Cookie value either.
Code below:
import requests
from bs4 import BeautifulSoup
headers = {
'authority': 'streeteasy.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'referer': 'https://streeteasy.com/sales/all',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,ur;q=0.8',
'cookie': 'D_SID=103.228.157.1:Bl5GGXCWIxq4AopS1Hkr7nkveq1nlhWXlD3PMrssGpU; _se_t=0944dfa5-bfb4-4085-812e-fa54d44acc54; google_one_tap=0; D_IID=AFB68ACC-B276-36C0-8718-13AB09A55E51; D_UID=23BA0A61-D0DF-383D-88A9-8CF65634135F; D_ZID=C0263FA4-96BF-3071-8318-56839798C38D; D_ZUID=C2322D79-7BDB-3E32-8620-059B1D352789; D_HID=CE522333-8B7B-3D76-B45A-731EB750DF4D; last_search_tab=sales; se%3Asearch%3Asales%3Astate=%7C%7C%7C%7C; streeteasy_site=nyc; se_rs=123%2C1029856%2C123%2C1172313%2C2815; se%3Asearch%3Ashared%3Astate=102%7C%7C%7C%7Cfalse; anon_searcher_stage=initial; se_login_trigger=4; se%3Abig_banner%3Asearch=%7B%22123%22%3A2%7D; se%3Abig_banner%3Ashown=true; se_lsa=2019-07-08+04%3A01%3A30+-0400; _ses=BAh7DEkiD3Nlc3Npb25faWQGOgZFVEkiJWRiODVjZTA1NmYzMzZkMzZiYmU4YTk4Yjk5YmU5ZTBlBjsAVEkiEG5ld192aXNpdG9yBjsARlRJIhFsYXN0X3NlY3Rpb24GOwBGSSIKc2FsZXMGOwBUSSIQX2NzcmZfdG9rZW4GOwBGSSIxbTM5eGRPUVhLeGYrQU1jcjZIdi81ajVFWmYzQWFSQmhxZThNcG92cWxVdz0GOwBGSSIIcGlzBjsARmkUSSIOdXNlcl9kYXRhBjsARnsQOhBzYWxlc19vcmRlckkiD3ByaWNlX2Rlc2MGOwBUOhJyZW50YWxzX29yZGVySSIPcHJpY2VfZGVzYwY7AFQ6EGluX2NvbnRyYWN0RjoNaGlkZV9tYXBGOhJzaG93X2xpc3RpbmdzRjoSbW9ydGdhZ2VfdGVybWkjOhltb3J0Z2FnZV9kb3ducGF5bWVudGkZOiFtb3J0Z2FnZV9kb3ducGF5bWVudF9kb2xsYXJzaQJQwzoSbW9ydGdhZ2VfcmF0ZWYJNC4wNToTbGlzdGluZ3Nfb3JkZXJJIhBsaXN0ZWRfZGVzYwY7AFQ6EHNlYXJjaF92aWV3SSIMZGV0YWlscwY7AFRJIhBsYXN0X3NlYXJjaAY7AEZpAXs%3D--d869dc53b8165c9f9e77233e78c568f610994ba7',
}
session = requests.Session()
response = session.get('https://streeteasy.com/for-sale/downtown', headers=headers, timeout=20)
if response.status_code == 200:
html = response.text
soup = BeautifulSoup(html, 'lxml')
links = soup.select('h3 > a')
print(links)

Download file from POST request in scrapy

I know there is builtin middleware to handle downloadings. but it only accept a url. but in my case, my downloading link is a POST request.
When i made that POST request pdf file starts downloading.
Now i want to download that file from POST request in scrapy.
Website is http://scrb.bihar.gov.in/View_FIR.aspx
You can enter district Aurangabad and police station Kasma PS
On last column status there is a link to downloading file.
ps_x = '//*[#id="ctl00_ContentPlaceHolder1_ddlPoliceStation"]//option[.="Kasma PS"]/#value'
police_station_val = response.xpath(ps_x).extract_first()
d_x = '//*[#id="ctl00_ContentPlaceHolder1_ddlDistrict"]//option[.="Aurangabad"]/#value'
district_val = response.xpath(d_x).extract_first()
viewstate = response.xpath(self.viewstate_x).extract_first()
viewstategen = response.xpath(self.viewstategen_x).extract_first()
eventvalidator = response.xpath(self.eventvalidator_x).extract_first()
eventtarget = response.xpath(self.eventtarget_x).extract_first()
eventargs = response.xpath(self.eventargs_x).extract_first()
lastfocus = response.xpath(self.lastfocus_x).extract_first()
payload = {
'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargs,
'__LASTFOCUS': lastfocus,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategen,
'__EVENTVALIDATION': eventvalidator,
'ctl00$ContentPlaceHolder1$ddlDistrict': district_val,
'ctl00$ContentPlaceHolder1$ddlPoliceStation': police_station_val,
'ctl00$ContentPlaceHolder1$optionsRadios': 'radioPetioner',
'ctl00$ContentPlaceHolder1$txtSearchBy': '',
'ctl00$ContentPlaceHolder1$rptItem$ctl06$lnkStatus.x': '21',
'ctl00$ContentPlaceHolder1$rptItem$ctl06$lnkStatus.y': '24',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Origin': 'http://scrb.bihar.gov.in',
'Upgrade-Insecure-Requests': '1',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'http://scrb.bihar.gov.in/View_FIR.aspx',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
}
# req = requests.post(response.url, data=payload, headers=headers)
# with open('pdf/ch.pdf', 'w+b') as f:
# f.write(req.content)

When You click donwload, webbrowser sends POST request.
So this answer mentioned by El Ruso earlier is applyable in your case
.....
def parse(self, response):
......
yield scrapy.FormRequest("http://scrb.bihar.gov.in/View_FIR.aspx",.#your post request configuration, callback=self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)

Python: How to obtain headers and payload information for requests.Session()?

In Python, how can I obtain headers and payload information for a particular website to make requests via requests.Session()?
e.g.:
headers = {
'Host': 'www.testsite.com',
'Accept': 'application/json',
'Proxy-Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-us',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'http://www.testsite.com',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D257',
'Referer': 'http://www.testsite.com/mobile'
}
Thank you in advance and will be sure to upvote and accept answer

Most of those headers are automatically supplied by the requests module. Here is an example:
import requests
from pprint import pprint
with requests.Session() as s:
s.get('http://httpbin.org/cookies/set?name=joe')
r = s.get('http://httpbin.org/cookies')
pprint(dict(r.request.headers))
assert r.json()['cookies']['name'] == 'joe'
The output of the pprint() call is this:
{'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Cookie': 'name=joe',
'User-Agent': 'python-requests/2.9.1'}
As you can see, s.get() fills in several headers.

A response object has a headers attribute:
import requests
with requests.Session() as s:
r = s.get("http://google.es")
print(r.headers)
Output:
>> {
'Date': 'Tue, 22 Aug 2017 00:37:13 GMT',
'Expires': '-1',
'Cache-Control': 'private,
max-age=0',
'Content-Type': 'text/html; charset=ISO-8859-1',
...
}

How do I add these headers to my python urllib opener?

headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'http://www.namestation.com/domain-search?autosearch=1',
'Origin': 'http://www.namestation.com',
'Host': 'www.namestation.com',
'Content-Type': 'application/json; charset=UTF-8',
'Connection': 'keep-alive'
}
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addHeaders(headers)?

Your opener should have an attribute addheaders, which is a list of tuples. By default it contains the user agent.
opener.addheaders.append(('Host', 'www.namestation.com'))

Something like this may work:
def opener():
cj=cookielib.CookieJar()
#Process Hadlers
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders=[
('Accept', 'application/json, text/javascript, */*; q=0.01'),
('X-Requested-With', 'XMLHttpRequest'),
('Referer', 'http://www.namestation.com/domain-search?autosearch=1'),
('Host', 'www.namestation.com'),
('Content-Type', 'application/json; charset=UTF-8'),
('Connection', 'keep-alive'),
]
return opener

You should turn the dict to the tuple first.
opener.addHeaders=tuple(items for items in headers.items())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping Asp.NET site using Beautiful Soup and Python - python

Finally the solution is to do post requests multiple times in same "session" with same "cookie" and iterate through them. It works now !

Related

How to give an exception that search data returns error message in a loop when sending POST request?

How to get cookies value to set in requests?

Download file from POST request in scrapy

Python: How to obtain headers and payload information for requests.Session()?

How do I add these headers to my python urllib opener?

Categories

Resources