Unable to scrape ASPX webpage using python?

Unable to scrape ASPX webpage using python? - python

Statement:- I am trying to scrap the url. I am trying to get the data after the post request with the parameters as set in the code.
Problem:- I am actually getting the html of the original .aspx page and the paramenters that I have set in the 'formFields' are not set. Can anyone explain where I am going wrong.
import urllib
import urllib2
uri = 'http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Content-Type': 'text/html; charset=utf-8'
}
formFields = (
(r'__VIEWSTATE',r'yE41BXTcCz0TfHmpBjw1lXLQe3PFLLTPmxSVU0hSeOt6wKGfqvehmJiLNmLj2VdBWrbXtmUIr0bh8lgw8X8f/yn9T1D4zmny5nUAc5HpgouURKeWzvZasUpmJUJ8pgh4jTUo62EVRQCYkXKayQVbCwaSP81BxDO9gxrERvzCUlw8i76A4jzlleSSunjr844drlsOw/PxjgYeRZCLm/h8WAc5HZrJ+w7vLMyLxlY/mDQaYdkVAF/s4lAJAxGfnX1rlshirdphBhI1tZIuoJa+ZTNzizgrXi70PVnAR3cw0QhCWr2rrTkrvoJ+rI5pme0pYPAX+CZfmSH3Cg1BKEbY/+G+p1AsLRqsobC8EBQXHPicqnhgOR7/kx+Z54XyCzxDwXCZBFKl3npcSq4xJ2Ebi3PFS6FtT0K+wZTZ8XGCQUVudzKyqhfDxQ4UTpDWn4vR7LIF765qhXpRmNG6HCzbvgrLqNrBzt+PZ0mbpeLsIWEia5L/AIbN+zcLiNsBxTBA9zOsljZDPkbL1rWo+WDUwBfDRiu6X4ru+RKypczFFzoUUciCaWD2ciOq+//NYa7NEwZ9d7YRY/LfEhszYUJO72Xpzaotxu7+7RdGVvsxzrh1Ro8OHqoskesX6QlEdqjakgrk3yqaroXCfnnBJu1ExKYXrU6JuVFbOGz66CZfkqiJJSVHs2Ozfos/yPWWiJVyETKtPQF003o9GGKKIFIaQ6YRHoco3jVOKB+NIE/VxMJHiSzPuuExZ27MY/pX9U9rx5/4az6i/LIqYoRzQilJT7lN5eTIdVn/W5Uep0tvKtOKhR7JjX7WhO7QpROlOh7H/wbCdr5u/jXB5FpJLB1Y8oZSdVyKJK8Xx+tODFvxYxMve5NT+/fBwlMSfkh7+h7+5Zo5hHKwLF01mrS52KHehix4Ja5bQ3hy6Q2r6pz+ZVHvUsev9OpsZlO1+PPjVCbdjzeUX23At6R4BRm6Zq/o0knl2Xf/kvM6WtKtDi+AbpIa7Eyj+brqWMdDKW4AiOLGS45H4ERP2z1CeVwiXlfa22JhkPbz8GJf9J9iey+JefnmsgD5ScdcvPU1mb1i7NLv9QOYXZXZwJXhGHZcw65iQm7vrZB5sJlBp7agBhhwX2KNaKZguNGVGhlxiS84zoKrkvdBv7e52n6H9p3okMvqHR+yEe+UCuDPanO+9tTvNvOqBzEAVvPYIK80YWsuDY3R66OBPjQEKpbPrDpz5hoMKk59r1FiIq6Jd71c6QeE57Au3ei72GZEVzaLXAva0RJP/tSROnO7ZKHkvxuP0oayVlzjLuIHnO0o4zUsoHpTJqPa20Bxv9De3JwOOi8HJgYj+dZjdRIDT9zHhgbLV9UO4z0HHyK54RIS67OAS8WqMYyfdC5I5GGwy8rosVKNjCfHymMEUzbs5iHCPhrM/X0UMJTxQ7yq113/6U43p6BP4PqP/OAgRYxejrVtT9goPKWxHTwu0kDROXCVvqHo5SiQ+/X3DdTxLF+13p0k5xlXBk0qkeLJkNlSYBeTOgPyvjHxnSMUdjhjHtiA0oFCSSCYCpVU9Pe3PLQyyUjv+KhI/jWS94D3KxYqXjyHUC/nMxEwot65kzFE/scAoOsdk/MJS/uZw4PbfbGEVKWTcJLtOV8s3wHKPzmB/AexZ//iEmDv'),
(r'__VIEWSTATEGENERATOR','AC872EDD'),
(r'__EVENTVALIDATION',r'35e8G73FpRBmb1p/I2BD+iKRsX5vU8sS0eAo0dx/dcp4JGg0Jp+2ZLyg8GHhCmrVcGjQVBABi8oOAgXt+2OghWX0FyVvk0fo8XPTIL/uMJjSBFrRvgsQNnh/Ue1VCOGhBgwhmrBt+VbjmmkA3hafVR2lAVjy/RYswgFcmyloWPyxQDu9UuktSiCNfyfJcF/xZrJDTeHBkNmB9P8QGNHLVnoaZjsmks/yB+Nll5ZLiq0WvDRhsbq1I/rrRWytnleHAMNutsEqdU9OvycQ/insfM871Jww+TyIvmxEVRYcAH6MnYl0nqUaderHN4R37aH8CH8B/NUxGDYLSdlsvJGJxXEZh9EVzzDprJuz7sJUxolBhjv6YNfJDAThrLCip2QEY20SztPZ/j8KnWgnX7Xs6sxjZofKnQxNlB44dZG0okIPitb9zjWB6kC2xDmN69vfDayDOCkcPJG3q/HMP6ewQvV/YheqUbuBwC77WPIKXrc='),
(r'ctl00$ContentPlaceHolder1$optlist', 'State+Wise'),
(r'ctl00$ContentPlaceHolder1$ddlitem' , '22'),
(r'ctl00$ContentPlaceHolder1$optlistChoice', 'List+of+All+Schools'),
(r'ctl00$ContentPlaceHolder1$search', 'Search'),
(r'__EVENTTARGET','ctl00$ContentPlaceHolder1$search')
)
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req)
try:
fout = open('temp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()

The easiest way is to switch to requests library and try something like below. You should go for the below approach only when you have no time to code and you are in a hurry. You can see the same result in the website when you select ASSAM from dropdown and press the search button.
import requests
from bs4 import BeautifulSoup
URL = "http://cbseaff.nic.in/cbse_aff/schdir_Report/userview.aspx"
payload="__EVENTTARGET=&__EVENTARGUMENT=&__LASTFOCUS=&__VIEWSTATE=sR0epHdr4qQ9SeEuqmUUBWzZHpFAyMj5Xr%2BhGagE8Be7zDbBqQgdx1%2BqrCMY8JfcRJ%2BBhmH8eEKMS%2F3VeboGFTjN0LZz1p2mx3FfY2OVkzs353bmDLWhtvBlOVnyFgcCXGyI9Li3Wp22e5txxKQwtdrKTWheuZLatRvI9wztvyeGueD9ZmEl8gIQHT77fIyt3N%2Bi3dn%2FhUdvvi%2FRqHR%2FaE1YfW3RmmSECDepwYfmBlSH3e4zTDQskW6XoOD7jykryp8L845QWclBE37ttOc5zXndfqkE%2FKCwOxiCxzRVPWCIFD9mBfELERlDOzq0cKwg6hHcE%2B4ZgyxJKBwNbW%2FDBtl%2BzpATY8arhjAg0CCf5CGmYMG1w2aaYz%2FSQ6bKQYWAaiH%2B0t9lrV291vPNj5Gfy1rWwrUhmw%2F6LcvX7uof41OUHtfGt%2Bn1IlxODsPu8i89hjiu5rqprmltTKSzoOZXJuQ1sfyRneShA7WeSIO0M4UmzUVdnjsNlimBmcc7vwAgvF8E6SJ8DaGCbPaLLH6tsI%2BISTBWPWw97o7KQyS%2FPicpZQA2Q5kzCHkAeroJWWrVILlCxEAEuOmzZffW7qIiiUwI3Urbixw40YDkjMPgVZMbb74EiEvmMGr1vNOMcjoICgM54sY5cViGP945D86fRlbS%2B6yYxNH951EZNrmRWnASiT2D%2FEbGMAkOwcYKrOWBrtMXEp8GPsswxryCZp2eANe7ajSioa0utT2cmGG5T7uiypveyfM0LBzgVTMqr5q7oMyUI9ML%2Bx8LHUFSh6SrjP1Nj7xs3%2BFPr2WaXoIs57hhUKfpE7u7Du9EGd1iXapf2WYKbvsl38Ov5u3FcPJ8L%2BLv%2FtksFE7Y7K2ZQ59p0Jqr%2BF%2FbMaKZeRRPp%2BMpW1TblAI8RVP1u%2BTDAHz19jrdpsralLWYoweP3wS5kdDcx68IukuaeGl0ESXcbNYWdFF0pEolfWLBSWcbOpR3YPpxVQD5Eqq3z%2FsYmk%2B%2B%2ByRXOH%2FizUrCqDqQjwbTw4UqRJY5HewB0q0Xp6jrPoT0%2FztPwteGCb6xonx1rzVkr84aOUw9IBUkCcvEtyi0S8VcUyVyii6KK9CThtyTHjbgut%2BdjuIOBlXPL9bD6u2frJiGGacVziPZInp6%2FtKcmnYn89Kcr0Ec1o3VZ6PD6xe8SUyMkupf%2FruNhbK6r6ZAXKZ3zMhkiDsMZYVGUihU43gNj3c8X%2BFD8ONb1827AfaDydgS1JxMtA9z4eOTqmoiPQ61vSr2j8IXokWJSDNg%2FhIn1uChG8BoEst2qZyVoQify0g%2FDMN%2B%2BPMMK7KPqnAfT4P17QI4Gn3mg2kFzfuQdnsQgs7aB0zAt0jrMgCoTDxuwbNvZ1w5BMF0bFbbfxi9QvbCXdpifAWbgAyutlo3wCbD1lIv3NmzLFQ61Mpih5zIOU2z9bpBeH4nClXcAN7QQVQIq8w%3D&__VIEWSTATEGENERATOR=AC872EDD&__VIEWSTATEENCRYPTED=&__EVENTVALIDATION=B7%2FH7uAZc%2FaW7NlReiB%2BrW3ac6DmQz82BOha1KMrmjcJR7%2F3qZByA7RNvmBBI36ivFrYkmYQ8SptltVOT45oBSWn4HG%2FRBbAWPHtcWY7qtl4okgXZ831Q1MTlxdIOkn2uPcoQOsostEzjJ8LVZHkcx%2FVjr6Fb1zImxNbSPDRDVJ1QLmYSCaf2RbJkzmP7ZiqR3w9MXn7GliipkRdhVHlJaDrh7eFy9zOjEcG2Ed2v0NxA5lnpnrXFcE42f9W%2BnLwNfUPR%2BiB95TtvY52ucsD5CgjWqlm9uMrDzHL1kl3WGzg6eU%2BIA9J744%2FRM2TD5JfhPykP6DB9E3E9%2BWzSSJowqwSzOwLNCjcbC%2BvBUb3GPXPwadz%2Fg3pEGTiBWtqdBCeOUiKnkDeDOrno8fS1RPu%2BVx%2F6M1LGWddW2CBUa8m3CizqMfLTGP7HVj4VpnSU0fttCuY26UTZzzMmplPmCjZziEJHd%2F5jc%2Byyf517tk%2BfHA%3D&ctl00%24ContentPlaceHolder1%24optlist=State+Wise&ctl00%24ContentPlaceHolder1%24ddlitem=2+&ctl00%24ContentPlaceHolder1%24optlistChoice=List+of+All+Schools&ctl00%24ContentPlaceHolder1%24search=Search"
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
s.headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
res = s.post(URL,data=payload)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#hyp,#Hyperlink1"):
print(item.text)
Partial results:
Abanti Kumar Mahanta Educational Institution
adarsh vidyalaya
adarsh vidyalaya
adarsh vidyalaya dahalapara

Related

How do I web scrape this link and iterate through the page numbers?

My goal is to web scrape this url link and iterate through the pages. I keep getting a strange error. My code and error follows:
import requests
import json
import pandas as pd
url = 'https://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page='
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}
#create a url list to scrape data from all pages
url_list = []
for i in range(0, 4375):
url_list.append(url + str(i))
response = requests.get(url, headers=headers)
data = response.json()
d = json.dumps(data)
df = pd.json_normalize(d)
Error:
{'items': [{'applicationName': 'ReverseProxy', 'errorCode': 'UNAUTHORIZED', 'message': 'You are Unauthorized to perform the attempted operation. Application access token required', 'additionalErrorData': [{'name': 'OperationName', 'value': 'http://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page=0&page=1'}]}], 'exceptionDetail': {'type': 'Mozu.Core.Exceptions.VaeUnAuthorizedException'}
This is strange to me because I should be able to access each page on this url
Specifically, since I can follow the link and copy and paste the json data. Is there a way to scrape this site without an api key?

It works in your browser because you have the cookie token saved in you local storage.
Once you delete all cookies, it does not work when you try to navigate to API link directly.
The token cookie is sb-sf-at-prod-s. Add this cookie to your headers and it will work.
I do not know if the value of this cookie is linked to my ip address. But if it is and it does not work for you. Just change the value of this cookie to one from your browser.
This cookies maybe is valid only for some request or for some time.
I recommend you to put some sleep between each request.
This website has security antibot Akamai.
import requests
import json
url = 'https://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page='
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'cookie': 'sb-sf-at-prod=at=%2FVzynTSsuVJGJMAd8%2BjAO67EUtyn1fIEaqKmCi923rynHnztv6rQZH%2F5LMa7pmMBRiW00x2L%2B%2FLfmJhJKLpNMoK9OFJi069WHbzphl%2BZFM%2FpBV%2BdqmhCL%2FtylU11GQYQ8y7qavW4MWS4xJzWdmKV%2F01iJ0RkwynJLgcXmCzcde2oqgxa%2FAYWa0hN0xuYBMFlCoHJab1z3CU%2F01FJlsBDzXmJwb63zAJGVj4PIH5LvlcbnbOhbouQBKxCrMyrmpvxDf70U3nTl9qxF9qgOyTBZnvMBk1juoK8wL1K3rYp51nBC0O%2Bthd94wzQ9Vkolk%2B4y8qapFaaxRtfZiBqhAAtMg%3D%3D'
}
#create a url list to scrape data from all pages
url_list = []
for i in range(0, 4375):
url_list.append(url + str(i))
response = requests.get(url, headers=headers)
data = response.json()
d = json.dumps(data)
print(d)
I hope I have been able to help you.

Web Scrapping Python - Immoscout24 - Robot Rejection

So im trying to make a data science project using information from this site. But sadly when I try to scrape it, it blocks me because it thinks I am a bot. I saw a couple of post here: Python webscraping blocked
but it seems that Immoscout have already found a solution to this workaround. Does somebody know how I can come around this? thanks!
My Code:
import requests
from bs4 import BeautifulSoup
import random
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
"KHTML, like Gecko) Version/4.0 Safari/534.30 , 'Accept-Language': 'en-US,en;q=0.5'"}
url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?enteredFrom=one_step_search"
response = requests.get(url, cookies={'required_cookie': 'reese84=xxx'} ,headers=headers)
webpage = response.content
print(response.status_code)
soup = BeautifulSoup(webpage, "html.parser")
print(soup.prettify)
thanks :)

Data is generating dynamically from API calls json response as POST method and You can extract data using only requests module.So,You can follow the next example.
import requests
headers= {
'content-type': 'application/json',
'x-requested-with': 'XMLHttpRequest'
}
api_url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?pagenumber=1"
jsonData = requests.post(api_url).json()
for item in jsonData['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']:
value=item['attributes'][0]['attribute'][0]['value'].replace('€','').replace('.',',')
print(value)
Output:
4,350,000
285,000
620,000
590,000
535,000
972,500
579,000
1,399,900
325,000
749,000
290,000
189,900
361,825
199,900
299,000
195,000
1,225,000
199,000
825,000
315,000

Python Request library not helping in Getting the correct webpage

I have this website:
https://xueqiu.com/hq#exchange=CN&firstName=1&secondName=1_0
I am trying to get this webpage through Python's get request. I have also tried changing "user-agent".
But I am not able to get the webpage, I am new to this parsing.
url = 'https://xueqiu.com/hq#exchange=CN&firstName=1&secondName=1_0'
with request.session() as session:
response = session.get(url)
Could someone please help me how to extract it?

Your data loaded via below url json format. So i use json module to extract the data.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
import requests
import json
headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
}
def scrape(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
mydata =r.json()
for data in mydata['data']['list']:
print(data, sep='*')
url = 'https://xueqiu.com/service/v5/stock/screener/quote/list?page=1&size=30&order=desc&orderby=percent&order_by=percent&market=CN&type=sh_sz&_=1606221698728'
scrape(url)
hope its help you.

python requests & beautifulsoup bot detection

I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)
But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?
EDIT 1:
From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :
I might need to add a header in the requests, but I couldn't understand what should be the value of header.
Use Selenium.
Now my question is, do both of the ways provide equal support?

It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.
import requests
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent()
hdr = {'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = requests.get(url, headers=hdr)
print response.content
Selenium is used for browser automation and high level web scraping for dynamic contents.

As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.
Another alternative for you could also be fake-useragent maybe you can also have a try with this.

try this:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
r = requests.get(url)
r = r.text
##options #1
# print r.text
soup = BeautifulSoup( r.encode("utf-8") , "html.parser")
### options 2
print(soup)

Python's Requests or Mechanize to login to a site

I want to start off by apologizing. I know this has more than likely been done more than enough times, and I'm just beating a dead horse, but I'd really like to know how to get this to work. I am trying to use the Requests module for python in order to login to a website and verify that it works. I'm also using BeautifulSoup in the code in order to find some strings that I have to use to process the request.
I'm getting hung up on how to properly form the header. What exactly is necessary in the header information?
import requests
from bs4 import BeautifulSoup
session = requests.session()
requester = session.get('http://iweb.craven.k12.nc.us/index.php')
soup = BeautifulSoup(requester.text)
ps = soup.find_all('input')
def getCookieInfo():
result = []
for item in ps:
if (item.attrs['name'] == 'return' and item.attrs['type'] == 'hidden'):
strcom = item.attrs['value']
sibling = item.next_sibling.next_sibling.attrs['name']
result.append(strcom)
result.append(sibling)
return result
cookiedInfo=getCookieInfo()
payload = [('username','myUsername'),
('password','myPassword'),
('Submit','Log in'),
('option','com_users'),
('task','user.login'),
('return', cookiedInfo[0]),
(cookiedInfo[1], '1')
]
headers = {
'Connection': 'keep-alive',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin':'http://iweb.craven.k12.nc.us',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64)'
}
r = session.post('http://iweb.craven.k12.nc.us/index.php', data=payload, headers=headers)
r = session.get('http://iweb.craven.k12.nc.us')
soup = BeautifulSoup(r.text)
Also if it would be better/more pythonic to utilize the mechanize module I would be open to suggestions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to scrape ASPX webpage using python? - python

Related

How do I web scrape this link and iterate through the page numbers?

Web Scrapping Python - Immoscout24 - Robot Rejection

Python Request library not helping in Getting the correct webpage

python requests & beautifulsoup bot detection

Python's Requests or Mechanize to login to a site

Categories

Resources