python requests, with same headers of chrome browser, get 403 errors - python

I have written some code for scraping
that program uses requests.get(url, headers=headers)
with headers exactly same with my Chrome browser except cookie
Initially, It works fine. but later. It gets 403 error
My Chrome browser get that data very well without error
but My python requests code doesn't work. What is the problem. I don't know
url = 'http://www.matchesfashion.com/en-kr/products/1171735'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Whale/0.10.36.11 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'ko-KR,ko;q=0.8,en-US;q=0.6,en;q=0.4',
'Host': 'www.matchesfashion.com',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0',
'Accept-Encoding':'gzip, deflate'}
r = requests.get(url, headers=headers)

Related

Failed to log in to a website using the requests module

I'm trying to log in to a website through a python script that I've created using the requests module. I've issued a post HTTP request with appropriate parameters and headers to the server, but for some reason I get a different response from that site compared to what I see in dev tools. The status is always 200, though. There is also a get request in place within the script that should fetch the credentials once the login is successful. Currently, it throws a JSONDecodeError on the last line.
import requests
link = 'https://propwire.com/login'
check_url = 'https://propwire.com/search'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'referer': 'https://propwire.com/login',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
'origin': 'https://propwire.com',
}
payload = {"email":"some-email","password":"password","remember":"true"}
with requests.Session() as s:
r = s.get(link)
headers['x-xsrf-token'] = r.cookies['XSRF-TOKEN'].rstrip('%3D')
s.headers.update(headers)
s.post(link,json=payload)
res = s.get(check_url)
print(res.json()['props']['auth'])

Different output between python online compiler and mine offline

I'm trying to run this code:
import requests
import json
print(requests.__version__)
print(json.__version__)
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Host': 'www.soraredata.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'Accept-Language': 'en-GB,en;q=0.9',
'Referer':'https://www.soraredata.com/player/17512868900934537680021886
28460549415375229654518317941780411003457747672993',
'Connection': 'keep-alive',
}
req = requests.Request('GET',
'https://www.soraredata.com/api/players/info/
1751286890093453768002188628460549415375229654518317941780411003457747672993',
headers=headers)
resp = requests.Session().send(req.prepare())
print(resp.status_code)
On programiz.com works fine, gives 200 as the status code.
But it does not work on my PC, even though the code is the same and even the packages versions. I even tried with different python versions, but it did not work out.
I can t understand why it does not return 200. I hope someone can illuminate me.
I appreciate any help you can provide.
This is happened because while you are running your program in programiz.com that it can load the req link that is player info.
And while you are running the same program in the application that can't load the req link (check your system it may be offline)

Why is Python requests.get() is retrieving outdated data from API?

Context:
I'm making GET requests to an API, and the API sometimes returns data that is up to 5 minutes old. However, when making the same request on Chrome, the data is always up to date. The server is ngnix.
This is the API request made when the page is loaded in Chrome:
https://buff.163.com/api/market/goods/sell_order?game=csgo&goods_id=781660&_=1604808126524
Relevant Code:
def epochTimestamp():
return int(round(datetime.now().timestamp()*1000))
def getProxies():
proxy = random.choice(proxies)
return {'http': fr'socks5h://{proxy}', 'https': fr'socks5h://{proxy}'}
get_purchase_headers = {
'Host': 'buff.163.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Cache-Control': 'max-age=0'
}
url = f"https://buff.163.com/api/market/goods/sell_order?game=csgo&goods_id=781660&_={epochTimestamp()}"
source = requests.get(url, timeout=10, proxies=getProxies(), headers=get_purchase_headers)
What I have tried:
Including User-Agent headers
'Cache-Control': 'max-age=0'
Including timestamp in the URL

Unable to emulate browser POST requests without getting a Response [500]. Why?

import requests
url = 'https://cmoffice.kenes.com/cmsearchableprogrammev15/conferencemanager/CM_W3_SearchableProgram/api/persionid/anonymous/type/normal/getfilteredsessions/conference/igcs19'
headers = {'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
'content-type': 'application/json; charset=UTF-8',
'cookie': '_ga=GA1.2.471841928.1549896884; _gid=GA1.2.1479150813.1563120868; __RequestVerificationToken_L2NtU2VhcmNoYWJsZVByb2dyYW1tZVYxNQ2=t57HyXHVNBIm0HZ33v1WyG8hRa4j4RlDEOvFtEfPakPgH5AutBjAN5pSRHnBx_BpBhbMnH6R-tIhSdop_VMtLF-aY7XcXTRFt7vg5X46zgE1; _gat=1',
'origin': 'https://cmoffice.kenes.com',
'referer': 'https://cmoffice.kenes.com/cmsearchableprogrammeV15/conferencemanager/programme/personid/anonymous/igcs19/normal/b833d15f547f3cf698a5e922754684fa334885ed',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'}
response = requests.post(url, headers = headers)
print(response)
Gives Response [500]
However browser is able to get a json response with status_code 200
Can anyone shed some light why and how to solve this problem?
Something appears not to be right in the backend. It returns a 500 when you try to post to it, which could be actually anything like for example missing configuration or programming errors.
If I hit the given URL in a browser I get actually a 405 'method not allowed' error.

POST requests using cookie from session

I am trying to scrape a website using POST request to fill the form:
http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced
in python, this goes as follow:
import requests
import webbrowser
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=OwXG0Hkxj+X9ELygHZa-aLQ5.undefined; _ga=GA1.3.1911942552.',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Origin': 'http://www.planning2.cityoflondon.gov.uk',
'Referer': 'http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
data = {
'searchCriteria.developmentType': '002',
'date(applicationReceivedStart)': '01/08/2000',
'date(applicationReceivedEnd)': '01/08/2018'
}
url = 'http://www.planning2.cityoflondon.gov.uk/online-applications/advancedSearchResults.do?action=firstPage'
test_file = 'planning_app.html'
with requests.Session() as session:
r = session.post(url, headers = headers, data = data)
with open (test_file, 'w') as file:
file.write(r.text)
webbrowser.open(test_file)
As you can see from the page reopened with webbrowser, this gives an error of outdated cookie.
For this to work I would need to manually go to the webpage, perform a query while opening the inspect panel of google chrome on the network tab, look at the cookie in the requests header and copy paste the cookie in my code. This would work until of course the cookie is expired again.
I tried to automate that retrieval of the cookie by doing the following:
headers_get = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
with requests.Session() as session:
c = session.get('http://www.planning2.cityoflondon.gov.uk/online-applications/', headers = headers_get)
headers['Cookie'] = 'JSESSIONID=' + list(c.cookies.get_dict().values())[0]
r = session.post(url, headers = headers, data = data)
with open (test_file, 'w') as file:
file.write(r.text)
webbrowser.open(test_file)
I would expect this to work as it is simply automating what i do manually:
Go to the page of the GET request, get the cookie from it add said cookie to the headers dict of the POST request.
However I still receive the 'server error' page from the POST requests.
Anyone would be able to get an understanding of why this happen?
The requests.post accept cookies name parameter. Using it instead of sending cookies directly in header may fix the problem:
with requests.Session() as session:
c = session.get('http://www.planning2.cityoflondon.gov.uk/online- applications/', headers = headers_get)
# Also, you can set with cookies=session.cookies
r = session.post(url, headers = headers, data = data, cookies=c.cookies)
Basically I suppose there may be some javascript logic on the site, which isn't executed with the use of requests.post. If that's the case, to resolve that you have to use selenium for filling and submitting form.
Please see Dynamic Data Web Scraping with Python, BeautifulSoup which has similar problem - javascript not executed.

Categories

Resources