I'm working on a python script that grabs the prices of items from the steam marketplace.
My problem is that if I let it run for too long, it gets an HTTP 429 error.
I want to avoid this, but the header retry-after is not found in server's response.
Here's a sample of the response headers
('Server', 'nginx')
('Content-Type', 'application/json; charset=utf-8')
('X-Frame-Options', 'DENY')
('Expires', 'Mon, 26 Jul 1997 05:00:00 GMT')
('Cache-Control', 'no-cache')
('Vary', 'Accept-Encoding')
('Date', 'Wed, 08 May 2019 03:58:30 GMT')
('Content-Length', '6428')
('Connection', 'close')
('Set-Cookie', 'sessionid=14360f3a5309bb1531932884; path=/; secure')
('Set-Cookie', 'steamCountry=CA%7C2020e87b713c54ddc925e4c38b0bf705; path=/; secure')
EDIT: heres the code and sample output.
note that nothing inside of the try statement will be run for this example
def getPrice(card, game):
url = 'https://steamcommunity.com/market/search/render/?query='
url = url+card+" "+game
url = url.replace(" ", "+")
print(url)
try:
data = urllib.request.urlopen(url)
h = data.getheaders()
for item in h:
print(item)
#print(data.getheaders())
#k = data.headers.keys()
json_data = json.loads(data.read())
pprint.pprint(json_data)
except Exception as e:
print(e.headers)
return 0
sample output on 3 different calls:
https://steamcommunity.com/market/search/render/?query=Glub+Crawl
Server: nginx
Content-Type: application/json; charset=utf-8
X-Frame-Options: DENY
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Cache-Control: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 24
Date: Wed, 08 May 2019 04:24:49 GMT
Connection: close
Set-Cookie: sessionid=5d1ea46f5095d9c28e141dd5; path=/; secure
Set-Cookie: steamCountry=CA%7C2020e87b713c54ddc925e4c38b0bf705; path=/; secure
https://steamcommunity.com/market/search/render/?query=Qaahl+Crawl
Server: nginx
Content-Type: application/json; charset=utf-8
X-Frame-Options: DENY
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Cache-Control: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 24
Date: Wed, 08 May 2019 04:24:49 GMT
Connection: close
Set-Cookie: sessionid=64e7956224b18e6d89cc45c0; path=/; secure
Set-Cookie: steamCountry=CA%7C2020e87b713c54ddc925e4c38b0bf705; path=/; secure
https://steamcommunity.com/market/search/render/?query=Odshan+Crawl
Server: nginx
Content-Type: application/json; charset=utf-8
X-Frame-Options: DENY
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Cache-Control: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 24
Date: Wed, 08 May 2019 04:24:50 GMT
Connection: close
Set-Cookie: sessionid=a7acd1023b4544809914dc6e; path=/; secure
Set-Cookie: steamCountry=CA%7C2020e87b713c54ddc925e4c38b0bf705; path=/; secure
Try this.
def getPrice(card, game):
url = 'https://steamcommunity.com/market/search/render/?query='
url = url+card+" "+game
url = url.replace(" ", "+")
print(url)
while True:
try:
data = urllib.request.urlopen(url)
h = data.getheaders()
for item in h:
print(item)
json_data = json.loads(data.read())
pprint.pprint(json_data)
except Exception as e:
import time
milliseconds = 10000
time.sleep(milliseconds)
Use your milliseconds value.
Related
I am wondering if python requests support "autoreferer" functionality in curl. Basically, for allow_redirects=True, the requests should set the "Referer" header for subsequent redirected requests automatically.
Here is how request headers look like (without "Referer" header) using requests:
>>> import requests
>>> import logging
>>> import http.client
>>> http.client.HTTPConnection.debuglevel = 1
>>> logging.basicConfig()
>>> logging.getLogger().setLevel(logging.DEBUG)
>>> requests_log = logging.getLogger("requests.packages.urllib3")
>>> requests_log.setLevel(logging.DEBUG)
>>> requests_log.propagate = True
>>> r = requests.post('http://www.somewebsite.com', allow_redirects=True)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): www.somewebsite.com:80
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 307 Temporary Redirect\r\n'
DEBUG:urllib3.connectionpool:http://www.somewebsite.com:80 "POST / HTTP/1.1" 307 185
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.somewebsite.com:443
header: Server header: Date header: Content-Type header: Content-Length header: Connection header: Location header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
DEBUG:urllib3.connectionpool:https://www.somewebsite.com:443 "POST / HTTP/1.1" 302 13
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): somewebsite.com:443
header: Content-Type header: Content-Length header: Connection header: Date header: Location header: Access-Control-Allow-Origin header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'GET / HTTP/1.1\r\nHost: somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
DEBUG:urllib3.connectionpool:https://somewebsite.com:443 "GET / HTTP/1.1" 200 149681
header: Content-Type header: Content-Length header: Connection header: Date header: Server header: Expires header: Last-Modified header: Content-Encoding header: Via header: Vary header: Accept-Ranges header: Cache-Control header: Set-Cookie header: X-Cache header: X-Amz-Cf-Pop header: X-Amz-Cf-Id >>>
>>>
And here is how request headers look like (with "Referer" header) using pycurl:
>>> import pycurl
>>> from io import BytesIO
>>> buffer = BytesIO()
>>> c = pycurl.Curl()
>>> c.setopt(c.URL, 'http://www.somewebsite.com/')
>>> c.setopt(c.WRITEDATA, buffer)
>>> c.setopt(pycurl.VERBOSE, 1)
>>> c.setopt(pycurl.AUTOREFERER, 1)
>>> c.setopt(pycurl.FOLLOWLOCATION, 1)
>>> c.perform()
>>> c.close()
* Trying 99.84.194.56...
* Connected to www.somewebsite.com (99.84.194.56) port 80 (#0)
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
< HTTP/1.1 301 Moved Permanently
< Server: CloudFront
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Content-Type: text/html
< Content-Length: 183
< Connection: keep-alive
< Location: https://www.somewebsite.com/
< X-Cache: Redirect from cloudfront
< Via: 1.1 40ddfb9607f5d49c286c41e9afdce772.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Uij3cpBtl0ZJ_OwFFDSint5ab3Ayvn0okmhJekgtxI-etIN5l07sjg==
<
* Ignoring the response-body
* Connection #0 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://www.somewebsite.com/'
* Found bundle for host www.somewebsite.com: 0x2ab53b0 [can pipeline]
* Trying 99.84.194.113...
* Connected to www.somewebsite.com (99.84.194.113) port 443 (#1)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:#STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: CN=watchdisneyfe.com
* start date: Dec 16 00:00:00 2019 GMT
* expire date: Jan 16 12:00:00 2021 GMT
* subjectAltName: www.somewebsite.com matched
* issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
* SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: http://www.somewebsite.com/
< HTTP/1.1 302 Moved Temporarily
< Content-Type: text/plain
< Content-Length: 13
< Connection: keep-alive
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Location: https://somewebsite.com/
< Access-Control-Allow-Origin: *
< X-Cache: Miss from cloudfront
< Via: 1.1 74d35431a23bfc97a6055173d9be2dc4.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Bxg1W9zPN7U4i8GqysA11vj6h2dyDZdClyMUfUMfVUqd-v_mrQXGhQ==
<
* Ignoring the response-body
* Connection #1 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://somewebsite.com/'
* Trying 13.225.146.93...
* Connected to somewebsite.com (13.225.146.93) port 443 (#2)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:#STRENGTH
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
* subject: CN=watchdisneyfe.com
* start date: Dec 16 00:00:00 2019 GMT
* expire date: Jan 16 12:00:00 2021 GMT
* subjectAltName: somewebsite.com matched
* issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
* SSL certificate verify ok.
> GET / HTTP/1.1
Host: somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: https://www.somewebsite.com/
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 1218349
< Connection: keep-alive
< Vary: Accept-Encoding
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Server: nginx/1.16.1
< Expires: Wed, 26 Feb 2020 21:56:48 GMT
< Last-Modified: Wed, 26 Feb 2020 21:56:48 GMT
< Via: 1.1 varnish-v4, 1.1 a52dcb1fed052adbd58b868375961d24.cloudfront.net (CloudFront)
< Vary: Accept-Encoding
< Accept-Ranges: bytes
< Cache-Control: max-age=0, must-revalidate
< Set-Cookie: SWID=72B09DFD-D038-485C-C836-7229EB59F0B1; path=/; Expires=Sun, 26 Feb 2040 21:46:55 GMT; domain=somewebsite.com;
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Pop: LAX3-C4
< X-Amz-Cf-Id: JGF1k-OnDIZT_1DP5psnrlb9jmmp7rq69QbGNZL1CVGbjJWjORwpGQ==
<
* Connection #2 to host somewebsite.com left intact
Is there anyway to add the "Referer" header automatically as curl does?
Note: if you want to try it out, replace "somewebsite" to "abc", for instance.
requests doesn't have any official hooks for this task. But you could subclass requests.Session to wrap a method that's called for each redirect: Session.rebuild_auth():
When being redirected we may want to strip authentication from the request to avoid leaking credentials. This method intelligently removes and reapplies authentication where possible to avoid credential loss.
Because it is called with the next (prepared) request as well as the previous response that triggered the redirect, it is ideally situated to add the Referer header:
import requests
class RefererSession(requests.Session):
def rebuild_auth(self, prepared_request, response):
super().rebuild_auth(prepared_request, response)
prepared_request.headers["Referer"] = response.url
then use this subclass for all requests:
with RefererSession() as session:
r = session.post('http://www.somewebsite.com', allow_redirects=True)
Demo using https://httpbin.org:
>>> import requests
>>> import http.client
>>> http.client.HTTPConnection.debuglevel = 1
>>> def echo_request_lines(msg, *rest):
... """HTTPConnection debug print handler, writes out request lines"""
... if msg != 'send:': return
... request_lines = literal_eval(rest[0]).replace(b'\r', b'')
... print(request_lines.rstrip().decode('latin1'))
... print()
...
>>> http.client.HTTPConnection.debuglevel = 1
>>> http.client.print = echo_request_lines
>>> class RefererSession(requests.Session):
... def rebuild_auth(self, prepared_request, response):
... super().rebuild_auth(prepared_request, response)
... prepared_request.headers["Referer"] = response.url
...
>>> with RefererSession() as session:
... r = session.get('https://httpbin.org/redirect/2')
...
GET /redirect/2 HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
GET /relative-redirect/1 HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Referer: https://httpbin.org/redirect/2
GET /get HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Referer: https://httpbin.org/relative-redirect/1
>>> from pprint import pprint
>>> pprint(dict(r.history[1].request.headers))
{'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Referer': 'https://httpbin.org/redirect/2',
'User-Agent': 'python-requests/2.22.0'}
>>> pprint(dict(r.request.headers))
{'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Referer': 'https://httpbin.org/relative-redirect/1',
'User-Agent': 'python-requests/2.22.0'}
I am trying to extract the Httponly, Secure, domain and path from the given cookie in python. How do I do that ???
import requests
target_url = "https://www.google.com/"
try:
response1 = requests.get(target_url)
if response1.status_code == 200:
response2 = response1.headers['Set-Cookie']
print(response2)
except Exception as e:
print(str(e))
targets = response2.split('; ')
for target in targets:
print(target)
Result
1P_JAR=2020-01-26-18; expires=Tue, 25-Feb-2020 18:45:29 GMT; path=/; domain=.google.com; Secure, NID=196=vCD6Y6ltvTmjf_VRFN9SUuqEN7OEJKjEoJg4XhiBc8Xivdez5boKQ8QzcCYung7EKe58kso1333yCrqq_Wq2QXwCZPAIrwHbo1lITA8lvqRtJERF-S6t9mMVEOg_o_Jpne5oRL3vwn8ReeV8f3Exx6ScJipPsm9MlXXir1fisho; expires=Mon, 27-Jul-2020 18:45:29 GMT; path=/; domain=.google.com; HttpOnly
1P_JAR=2020-01-26-18
expires=Tue, 25-Feb-2020 18:45:29 GMT
path=/
domain=.google.com
Secure, NID=196=vCD6Y6ltvTmjf_VRFN9SUuqEN7OEJKjEoJg4XhiBc8Xivdez5boKQ8QzcCYung7EKe58kso1333yCrqq_Wq2QXwCZPAIrwHbo1lITA8lvqRtJERF-S6t9mMVEOg_o_Jpne5oRL3vwn8ReeV8f3Exx6ScJipPsm9MlXXir1fisho
expires=Mon, 27-Jul-2020 18:45:29 GMT
path=/
domain=.google.com
HttpOnly
A little string manipulation should do the trick:
targets = response2.split('; ')
for target in targets:
print(target)
Output:
1P_JAR=2020-01-26-17
expires=Tue, 25-Feb-2020 17:36:40 GMT
path=/
domain=.google.com
Secure, NID=196=dXGexcdgLL0Ndy85DQj-Yg5aySfe_th__wZRtmnu2V2alQQdl807dMDLSTeEKb2CEfGpV17fIej7uXIp6w5Nb0Npab4nrf38fQwi480iYF8DYxa-ggSN-PTXVXeGvrwKRnmDYWmfYynSvpD-C9UUiXI59baq1dsdDtwsIL-zzq0
expires=Mon, 27-Jul-2020 17:36:40 GMT
path=/
domain=.google.com
HttpOnly
To get only, for example, "domain" use:
print(targets[3])
If you don't know the order of cookies, you can try a dictionary:
cookies = dict()
for target in targets:
if '=' in target:
key=target.split('=')[0]
value=target.split('=')[1]
cookies.update({key:value})
else:
cookies.update({target:target})
cookies.get('domain')
Output:
.google.com
I am using requests and I need to extract a certain value from response headers set cookie. I cant use r.cookies because that doesnt add expiration, path, domain, etc and I need those values.
When I do
test = r.headers['set-cookie']
print(test)
I get a response as so:
'cookie1 = cookie1value; expires=datehere; path=/; domain=domainhere, cookie2 = cookie2value; expires=datehere; path=/; domain=domainhere,cookie3 = cookie3value; Domain=.domain.com; Path=/; Expires=Wed, 04 Nov 2020 19:44:17 GMT; Max-Age=31536000; Secure
I need to extract the value of cookie3 with all of its tags.
You could use re
import re
test = 'cookie1 = cookie1value; expires=datehere; path=/; domain=domainhere, cookie2 = cookie2value; expires=datehere; path=/; domain=domainhere,cookie3 = cookie3value; expires=datehere; path=/; domain=domainhere,cookie4 = cookie4value; expires=datehere; path=/; domain=domainhere'
p = re.compile(r'cookie3 = (.*)')
print(p.findall(test)[0])
i have many cookie strings that i get from a http response and save in a set. For example like this:
cookies = set()
cookies.add("__cfduid=123456789101112131415116; expires=Thu, 27-Aug-20 10:10:10 GMT; path=/; domain=.example.com; HttpOnly; Secure")
cookies.add("MUID=16151413121110987654321; domain=.bing.com; expires=Mon, 21-Sep-2020 10:10:11 GMT; path=/;, MUIDB=478534957198492834; path=/; httponly; expires=Mon, 21-Sep-2020 10:10:11 GMT")
Now i would like to parse that strings to an array or something else to access the data (domain, expires, ...) easier. For example like this:
cookie['MUID']['value']
cookie['MUID']['domain']
cookie['MUIDB']['path']
cookie['__cfduid']['Secure']
...
But i don't know how i do this. I try it with the SimpleCookie from http.cookies but i get not the expected result.
You should create a python dictionary for this.
from collections import defaultdict
cookies = defaultdict(str)
list_of_strings = ["__cfduid=123456789101112131415116; expires=Thu, 27-Aug-20 10:10:10 GMT; path=/; domain=.example.com; HttpOnly; Secure"]# this is your list of strings you want to add
for string in list_of_strings:
parts = string.split(";")
for part in parts:
temp = part.split("=")
if len(temp) == 2:
cookies[temp[0]] = temp[1]
I have a problem with response from webapp.
smth = requests.get('httpxxx')
then I get:
requests.packages.urllib3.exceptions.HeaderParsingError:
[MissingHeaderBodySeparatorDefect()], unparsed data: 'Compression
Index: 1\r\nContent-Type: text/plain;charset=UTF-8\r\nContent-Length:
291\r\nDate: Wed, 01 Jun 2016 15:03:05 GMT\r\n\r\n'
but some headers are parsable:
smth.headers
{'Cache-Control': 'no-cache', 'Server': 'Apache-Coyote/1.1', 'Pragma': 'no-cache', 'Last-Modified': 'Wed, 01 Jun 2016 15:24:58 GMT'}
When contentlength is more than 1000 then response will be compressed, so destroy everything.