Context:
I'm making GET requests to an API, and the API sometimes returns data that is up to 5 minutes old. However, when making the same request on Chrome, the data is always up to date. The server is ngnix.
This is the API request made when the page is loaded in Chrome:
https://buff.163.com/api/market/goods/sell_order?game=csgo&goods_id=781660&_=1604808126524
Relevant Code:
def epochTimestamp():
return int(round(datetime.now().timestamp()*1000))
def getProxies():
proxy = random.choice(proxies)
return {'http': fr'socks5h://{proxy}', 'https': fr'socks5h://{proxy}'}
get_purchase_headers = {
'Host': 'buff.163.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Cache-Control': 'max-age=0'
}
url = f"https://buff.163.com/api/market/goods/sell_order?game=csgo&goods_id=781660&_={epochTimestamp()}"
source = requests.get(url, timeout=10, proxies=getProxies(), headers=get_purchase_headers)
What I have tried:
Including User-Agent headers
'Cache-Control': 'max-age=0'
Including timestamp in the URL
I am trying to login into www.zalando.it using the requests library, but every time I try to post my data I am getting a 403 error. I saw in the network tab from Zalando and the login call and is the same.
These are just dummy data, you can test creating a test account.
Here is the code for the login function:
import requests
import pickle
import json
session = requests.session()
headers1 = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
r = session.get('https://www.zalando.it/', headers = headers1)
cookies = r.cookies
url = 'https://www.zalando.it/api/reef/login'
payload = {'username': "email#email.it", 'password': "password", 'wnaMode': "shop"}
headers = {
'x-xsrf-token': cookies['frsx'],
#'_abck': str(cookies['_abck']),
'usercentrics_enabled' : 'true',
'Connection': 'keep-alive',
'Content-Type':'application/json; charset=utf-8',
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
'origin':'https://www.zalando.it',
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Credentials': 'true',
'Access-Control-Allow-Methods': 'GET,PUT,POST,DELETE,OPTIONS',
'Access-Control-Allow-Headers': 'Origin,X-Requested-With,Content-Type,Accept,content-type,application/json',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'same-origin',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7',
'dpr': '1.3125',
'referer': 'https://www.zalando.it/uomo-home/',
'viewport-width': '1464'
}
x = session.post(url, data = json.dumps(payload), headers = headers, cookies = cookies)
print(x) #error 403
print(x.text) #page that show 403
For the initial request it needs to look like an actual browser request, after that the headers need to be modified to look like an xhr (Ajax) request. Also, there's some response headers that need to be added to future requests to the server, along with cookies such as the client-id and an xsrf token.
Here's some example code that is currently working:
import requests
# first load the home page
home_page_link = "https://www.zalando.it/"
login_api_schema = "https://www.zalando.it/api/reef/login/schema"
login_api_post = "https://www.zalando.it/api/reef/login"
headers = {
'Host': 'www.zalando.it',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection' : 'close',
'Upgrade-Insecure-Requests': '1'
}
if __name__ == '__main__':
with requests.Session() as s:
s.headers.update(headers)
r = s.get(home_page_link)
# fetch these cookies: frsx, Zalando-Client-Id
cookie_dict = s.cookies.get_dict()
# update the headers
# remove this header for the xhr requests
del s.headers['Upgrade-Insecure-Requests']
# these 2 are taken from some response cookies
s.headers['x-xsrf-token'] = cookie_dict['frsx']
s.headers['x-zalando-client-id'] = cookie_dict['Zalando-Client-Id']
# i didn't pay attention to where these came from
# just saw them and manually added them
s.headers['x-zalando-render-page-uri'] = '/'
s.headers['x-zalando-request-uri'] = '/'
# this is sent as a response header and is needed to
# track future requests/responses
s.headers['x-flow-id'] = r.headers['X-Flow-Id']
# only accept json data from xhr requests
s.headers['Accept'] = 'application/json'
# when clicking the login button this request is sent
# i didn't test without this request
r = s.get(login_api_schema)
# add an origin header
s.headers['Origin'] = 'https://www.zalando.it'
# finally log in, this should return a 201 response with a cookie
login_data = {"username":"email#email.it","password":"password","wnaMode":"modal"}
r = s.post(login_api_post, json=login_data)
print(r.status_code)
print(r.headers)
Well, it seems to me that this website is protected by Akamai (looks like Akamai Bot Manager).
See that Server: AkamaiGHost in the response headers of /api/reef/login when you get a 403 response?
Also, have a look at the requests sent during a legitimate browser session: there are many requests sent to /static/{some unique ID}, with some sensor_data, including your user-agent, and some other "gibberish".
The above description seems to fit this one:
The BMP SDK collects behavioral data while the user is interacting with the application. This behavioral data, also known as sensor data, includes the device characteristics, device orientation, accelerometer data, touch events, etc. Reference: BMP SDK
Also, this answer confirms that some of the cookies set by this website in fact belong to Akamai Bot Manager.
Well, I'm not sure if there's an easy way of bypassing it. After all, that's a product developed exactly for this purpose - block web-scraping bots like yours.
import requests
session = requests.Session()
url = 'https://supremenewyork.com/shop/304070/add'
headers = {
'Accept': '*/*;q=0.5, text/javascript, application/javascript, application/ecmascript, application/x-ecmascript',
'Origin': 'https://www.supremenewyork.com',
'X-CSRF-Token': 'cGh34LIXA5O75UEl+ArjyIQA/CS6BGY9mFleXXZ5GnznS4t8y2rGTpUTumG93EHNwSfnkDDtsYLvbEGbmMymRQ==',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
}
post_data = {
'commit': 'add to basket',
'size': '53133',
'style': '25229',
'utf8': '✓'
}
session.post(url=url, headers=headers, data=post_data, timeout=1)
r = session.get('https://supremenewyork.com/shop/cart.json', headers=headers)
print(r.text)
Post data is correct, i took it from Google Chrome, but every time code return nothing (because basket is empty). How do i do post request correct?
I am fairly new to Python and I'm trying to extract production data from the Alabama state website (https://www.gsa.state.al.us/ogb/production). I was wondering if someone could guide me on starting this? This is what I have so far. I was trying to extract production for permit number 8132-C.
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}
payload = '8132-C'
session = requests.Session()
r = requests.get('https://www.gsa.state.al.us/ogb/production',
params=payload)
print(r.url)
Instead of r.url , you should r.text to see the data.
import requests
payload = '8132-C'
session = requests.Session()
r = requests.get('https://www.gsa.state.al.us/ogb/production', params=payload)
print(r.text)
The response web page is as below when to slect title and input wordpress.
Here is my python code to pass arguments for get method with python3.
import urllib.request
import urllib.parse
url = 'http://www.it-ebooks.info/'
values = {'q': 'wordpress','type': 'title'}
data = urllib.parse.urlencode(values).encode(encoding='utf-8',errors='ignore')
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0' }
request = urllib.request.Request(url=url, data=data,headers=headers,method='GET')
response = urllib.request.urlopen(request)
buff = response.read()
html = buff.decode("utf8")
print(html)
I can't get the desired output web page.
How to pass arguments for get method with urllib in my example?
The data kwarg of urllib.request.Request is only used for POST requests as it modifies the request's body.
GET requests simply use URL parameters, so you should append these to the url:
params = '?q=wordpress&type=title'
url = 'http://www.it-ebooks.info/search/{}'.format(params)
You can of course take the time and generalize this into a generic function.
is better if you use the library called requests
import requests
headers = {
'DNT': '1',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'es-ES,es;q=0.8,en;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://www.it-ebooks.info/',
'Connection': 'keep-alive',
}
r = requests.get('http://www.it-ebooks.info/search/?q=wordpress&type=title', headers=headers)
print r.content