getting forbidden 403 error while logging into a website Python - python

I am trying to login into a website (using python requests) by passing user name and password in get.
But getting 403 forbidden error. I am able to login into the website through browser using the same credentials.
I have tried using different headers but nothing seems to be working.
I have tried all commented headers. also the proxy. This is basically a VB code we need t convert in python. In VB they have used proxy hence tried using proxy.
I need to login into the website and download a file abc.csv
code:
# Import library
import requests
# Define url, username, and password
url = 'https://example.com'
#r = requests.get(url, auth=(user, password))
#s = requests.Session()
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
#headers = {"Authorization" : "Basic" , "Pragma" : "no-cache","If-Modified-Since": "Thu, 01 Jun 1970 00:00:00 GMT",'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
#headers={'Server': 'nginx', 'Date': 'Thu, 23 Sep 2021 12:44:25 GMT', 'Content-Type': 'text/html; charset=iso-8859-1', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip'}
proxies= {"https":"http://10.10.10.10:8080"}
r=requests.get(url,proxies=proxies),headers=headers)
requests.post(url, data={"username":"user","password":"pwd"})
#s.get(url)
# send a HTTP request to the server and save
# the HTTP response in a response object called r
with open("abc.csv",'wb') as f:
# Saving received content as a png file in
# binary format
# write the contents of the response (r.content)
# to a new file in binary mode.
r.write(r.content)
Thanks,

Related

Python Requests login: i have error 403 but the request looks correct

I am trying to login into www.zalando.it using the requests library, but every time I try to post my data I am getting a 403 error. I saw in the network tab from Zalando and the login call and is the same.
These are just dummy data, you can test creating a test account.
Here is the code for the login function:
import requests
import pickle
import json
session = requests.session()
headers1 = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
r = session.get('https://www.zalando.it/', headers = headers1)
cookies = r.cookies
url = 'https://www.zalando.it/api/reef/login'
payload = {'username': "email#email.it", 'password': "password", 'wnaMode': "shop"}
headers = {
'x-xsrf-token': cookies['frsx'],
#'_abck': str(cookies['_abck']),
'usercentrics_enabled' : 'true',
'Connection': 'keep-alive',
'Content-Type':'application/json; charset=utf-8',
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
'origin':'https://www.zalando.it',
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Credentials': 'true',
'Access-Control-Allow-Methods': 'GET,PUT,POST,DELETE,OPTIONS',
'Access-Control-Allow-Headers': 'Origin,X-Requested-With,Content-Type,Accept,content-type,application/json',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'same-origin',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7',
'dpr': '1.3125',
'referer': 'https://www.zalando.it/uomo-home/',
'viewport-width': '1464'
}
x = session.post(url, data = json.dumps(payload), headers = headers, cookies = cookies)
print(x) #error 403
print(x.text) #page that show 403
For the initial request it needs to look like an actual browser request, after that the headers need to be modified to look like an xhr (Ajax) request. Also, there's some response headers that need to be added to future requests to the server, along with cookies such as the client-id and an xsrf token.
Here's some example code that is currently working:
import requests
# first load the home page
home_page_link = "https://www.zalando.it/"
login_api_schema = "https://www.zalando.it/api/reef/login/schema"
login_api_post = "https://www.zalando.it/api/reef/login"
headers = {
'Host': 'www.zalando.it',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection' : 'close',
'Upgrade-Insecure-Requests': '1'
}
if __name__ == '__main__':
with requests.Session() as s:
s.headers.update(headers)
r = s.get(home_page_link)
# fetch these cookies: frsx, Zalando-Client-Id
cookie_dict = s.cookies.get_dict()
# update the headers
# remove this header for the xhr requests
del s.headers['Upgrade-Insecure-Requests']
# these 2 are taken from some response cookies
s.headers['x-xsrf-token'] = cookie_dict['frsx']
s.headers['x-zalando-client-id'] = cookie_dict['Zalando-Client-Id']
# i didn't pay attention to where these came from
# just saw them and manually added them
s.headers['x-zalando-render-page-uri'] = '/'
s.headers['x-zalando-request-uri'] = '/'
# this is sent as a response header and is needed to
# track future requests/responses
s.headers['x-flow-id'] = r.headers['X-Flow-Id']
# only accept json data from xhr requests
s.headers['Accept'] = 'application/json'
# when clicking the login button this request is sent
# i didn't test without this request
r = s.get(login_api_schema)
# add an origin header
s.headers['Origin'] = 'https://www.zalando.it'
# finally log in, this should return a 201 response with a cookie
login_data = {"username":"email#email.it","password":"password","wnaMode":"modal"}
r = s.post(login_api_post, json=login_data)
print(r.status_code)
print(r.headers)
Well, it seems to me that this website is protected by Akamai (looks like Akamai Bot Manager).
See that Server: AkamaiGHost in the response headers of /api/reef/login when you get a 403 response?
Also, have a look at the requests sent during a legitimate browser session: there are many requests sent to /static/{some unique ID}, with some sensor_data, including your user-agent, and some other "gibberish".
The above description seems to fit this one:
The BMP SDK collects behavioral data while the user is interacting with the application. This behavioral data, also known as sensor data, includes the device characteristics, device orientation, accelerometer data, touch events, etc. Reference: BMP SDK
Also, this answer confirms that some of the cookies set by this website in fact belong to Akamai Bot Manager.
Well, I'm not sure if there's an easy way of bypassing it. After all, that's a product developed exactly for this purpose - block web-scraping bots like yours.

POSTMAN's response is different than python's response

I'm doing a GET request on POSTMAN. The request is just this URL https://www.google.com/search?q=pip+google-images&tbm=isch and I get a 853639 characters long response(Which is the response I want).
So I want to do the same thing with Python. I used the Postman's GENERATE CODE SNIPPETS and copied the code for Python Requests, I pasted it into my own python script and ran it.
But the response I got was only 22490 characters long(The response I don't want).
Why is it happening?
Python's code:
import requests
url = "https://www.google.com/search"
querystring = {"q":"pip google-images","tbm":"isch"}
headers = {
'cache-control': "no-cache",
'postman-token': "6b5e997f-6651-2178-1371-5d6a555984a7"
}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
You need to set the User-Agent, e.g.
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
Include this in your headers.

Python request to crawl URL returns 404 Error while working inside the browser

I have a crawling python script that hangs on a url: pulsepoint.com/sellers.json
The bot uses a standard request to get the content, but is returned Error 404. In the browser it works (there is a 301 redirect, but request can follow that). My first hunch is that this could be a request header issue, so I copied my browser configuration. The code looks like this
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
myheaders = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
r = requests.get(seller_json_url, headers=myheaders)
logging.info(" %d" % r.status_code)
But I am still getting a 404 Error.
My next guess:
Login? Not used here
Cookies? Not that I can see
So how is their server blocking my bot? This is an url that is supposed to be crawled by the way, nothing illegal..
Thanks in advance!
You can also do a workaround on the SSL certificate error like below:
from urllib.request import urlopen
import ssl
import json
#this is a workaround on the SSL error
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
print(seller_json_url)
response = urlopen(seller_json_url).read()
# print in dictionary format
print(json.loads(response))
Sample response:
{'contact_email': 'PublisherSupport#pulsepoint.com', 'contact_address': '360 Madison Ave, 14th Floor, NY, NY, 10017', 'version': '1.0', 'identifiers': [{'name': 'TAG-ID', 'value': '89ff185a4c4e857c'}], 'sellers': [{'seller_id': '508738', ...
...'seller_type': 'PUBLISHER'}, {'seller_id': '562225', 'name': 'EL DIARIO', 'domain': 'impremedia.com', 'seller_type': 'PUBLISHER'}]}
You can just go directly to the link and extract the data, no need to get 301 to the correct link
import requests
headers = {"Upgrade-Insecure-Requests": "1"}
response = requests.get(
url="https://projects.contextweb.com/sellersjson/sellers.json",
headers=headers,
verify=False,
)
Ok, just for other people, an hardened version of âńōŋŷXmoůŜ's answer, because:
Some website want headers to answer;
Some website use weird encoding
Some website send gzipped answer when not requested.
import urllib
import ssl
import json
from io import BytesIO
import gzip
ssl._create_default_https_context = ssl._create_unverified_context
crawled_url="pulsepoint.com"
seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
req = urllib.request.Request(seller_json_url)
# ADDING THE HEADERS
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0')
req.add_header('Accept','application/json,text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
response = urllib.request.urlopen(req)
data=response.read()
# IN CASE THE ANSWER IS GZIPPED
if response.info().get('Content-Encoding') == 'gzip':
buf = BytesIO(data)
f = gzip.GzipFile(fileobj=buf)
data = f.read()
# ADAPTS THE ENCODING TO THE ANSWER
print(json.loads(data.decode(response.info().get_param('charset') or 'utf-8')))
Thanks again!

Python requests GET fails but CURL command working

I am trying to make a get request to a webpage but I keep getting a 404 error using Python2.7 with requests package. However, using CURL I get a successful response and it works with the browser.
Python
r = requests.get('https://www.ynet.co.il/articles/07340L-446694800.html')
r.status_code
404
r.headers
{'backend-cache-control': '', 'Content-Length': '20661', 'WAI': '02',
'X-me': '08', 'vg_id': '1', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding',
'Last-Modified': 'Sun, 20 May 2018 01:20:04 GMT', 'Connection': 'keep-alive',
'V-TTL': '47413', 'Date': 'Sun, 20 May 2018 14:55:21 GMT', 'VX-Cache': 'HIT',
'Content-Type': 'text/html; charset=UTF-8', 'Accept-Ranges': 'bytes'}
r.reason
'Not Found'
CURL
curl https://www.ynet.co.il/articles/07340L-446694800.html
The code is correct, it works for some other sites (see https://repl.it/repls/MemorableUpbeatExams ).
This site loads for me in the browser, so I confirm your issue.
It might be that they block Python requests, because they don't want their site scraped and analysed by bots, but they forgot to block curl.
What you are doing is probably violating www.ynet.co.il terms of use, and you shouldn't do that.
A 404 is displayed when:
The URL is incorrect and the response is actually accurate.
Trailing spaces in the URL
The website may not like HTTP(S) requests coming from Python code. Change your headers by adding "www." to your Referer url.
resp = requests.get(r'http://www.xx.xx.xx.xx/server/rest/line/125')
or
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
result = requests.get('https://www.transfermarkt.co.uk', headers=headers)

Python Requests Get not Working

I have a simple Get request I'd like to make using Python's Request library.
import requests
HEADERS = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'http://stats.nba.com/scores/'}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
However, when I make the requests.get call, I get the error requests.exceptions.ReadTimeout: HTTPConnectionPool(host='stats.nba.com', port=80): Read timed out. (read timeout=5). But I am able to copy/paste that url into my browser and view the resulting JSON. Why is requests not able to get the result?
Your HEADERS format is wrong. I tried with this code and it worked without any issues:
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
print(response.text)

Categories

Resources