Why can I not obtain some cookies - python

I am trying to create a bot to read cookies but I'm failing to do so, what the hell am I doing wrong?
import urllib
import http.cookiejar
URL = 'https://roblox.com'
def extract_cookies():
cookie_jar = http.cookiejar.CookieJar()
url_opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
url_opener.open(URL)
print(URL)
for cookie in cookie_jar:
print("[Cookie Name = %s] [Cookie Value = %s]" %(cookie.name, cookie.value))
if __name__ == '__main__':
extract_cookies()```

It's a permanent redirect: https://roblox.com redirects to https://www.roblox.com. This is why you get an HTTP 308 status code. Note the difference in www.
The server tells you where to go in the HTTP response:
HTTP GET https://roblox.com/
--
HTTP/2 308 Permanent Redirect
location: https://www.roblox.com/
So update your URL to https://www.roblox.com

Related

How to set up a Tor-Server (Hidden Service) as a proxy?

The goal is, being able to access the proxy anonymously, such that the host (proxy) doesn't know, where the request came from (of course with credentials).
The client should be able to acess www.example.com over the hosts ip, without the host knowing the clients ip.
Here's a example request route to www.example.com:
How would I hookup a browser to it?
How would I connect using Python? (something proxy-chain like?)
Note: OS doesn't depend, programming language preferably Python
EDIT:
The Client in my case should be able to specify:
Headers
request-method
site-url
For the request which the hidden-service makes (so basically a proxy)
first you need to create a hidden service on tor from host to be able to communicate over tor network
basic flask proxy example (for more advanced proxy you can follow this code) i didnt test this code but you can fix errors:
"""
A simple proxy server, based on original by gear11:
https://gist.github.com/gear11/8006132
Modified from original to support both GET and POST, status code passthrough, header and form data passthrough.
Usage: http://hostname:port/p/(URL to be proxied, minus protocol)
For example: http://localhost:5000/p/www.google.com
"""
from stem.control import Controller
import re
from urllib.parse import urlparse, urlunparse
from flask import Flask, render_template, request, abort, Response, redirect
import requests
import logging
app = Flask("example")
port = 5000
host = "127.0.0.1"
hidden_svc_dir = "c:/temp/"
logging.basicConfig(level=logging.INFO)
CHUNK_SIZE = 1024
LOG = logging.getLogger("app.py")
#app.route('/<path:url>', methods=["GET", "POST"])
def root(url):
# If referred from a proxy request, then redirect to a URL with the proxy prefix.
# This allows server-relative and protocol-relative URLs to work.
referer = request.headers.get('referer')
if not referer:
return Response("Relative URL sent without a a proxying request referal. Please specify a valid proxy host (/p/url)", 400)
proxy_ref = proxied_request_info(referer)
host = proxy_ref[0]
redirect_url = "/p/%s/%s%s" % (host, url, ("?" + request.query_string.decode('utf-8') if request.query_string else ""))
LOG.debug("Redirecting relative path to one under proxy: %s", redirect_url)
return redirect(redirect_url)
#app.route('/p/<path:url>', methods=["GET", "POST"])
def proxy(url):
"""Fetches the specified URL and streams it out to the client.
If the request was referred by the proxy itself (e.g. this is an image fetch
for a previously proxied HTML page), then the original Referer is passed."""
# Check if url to proxy has host only, and redirect with trailing slash
# (path component) to avoid breakage for downstream apps attempting base
# path detection
url_parts = urlparse('%s://%s' % (request.scheme, url))
if url_parts.path == "":
parts = urlparse(request.url)
LOG.warning("Proxy request without a path was sent, redirecting assuming '/': %s -> %s/" % (url, url))
return redirect(urlunparse(parts._replace(path=parts.path+'/')))
LOG.debug("%s %s with headers: %s", request.method, url, request.headers)
r = make_request(url, request.method, dict(request.headers), request.form)
LOG.debug("Got %s response from %s",r.status_code, url)
headers = dict(r.raw.headers)
def generate():
for chunk in r.raw.stream(decode_content=False):
yield chunk
out = Response(generate(), headers=headers)
out.status_code = r.status_code
return out
def make_request(url, method, headers={}, data=None):
url = 'http://%s' % url
# Pass original Referer for subsequent resource requests
referer = request.headers.get('referer')
if referer:
proxy_ref = proxied_request_info(referer)
headers.update({ "referer" : "http://%s/%s" % (proxy_ref[0], proxy_ref[1])})
# Fetch the URL, and stream it back
LOG.debug("Sending %s %s with headers: %s and data %s", method, url, headers, data)
return requests.request(method, url, params=request.args, stream=True, headers=headers, allow_redirects=False, data=data)
def proxied_request_info(proxy_url):
"""Returns information about the target (proxied) URL given a URL sent to
the proxy itself. For example, if given:
http://localhost:5000/p/google.com/search?q=foo
then the result is:
("google.com", "search?q=foo")"""
parts = urlparse(proxy_url)
if not parts.path:
return None
elif not parts.path.startswith('/p/'):
return None
matches = re.match('^/p/([^/]+)/?(.*)', parts.path)
proxied_host = matches.group(1)
proxied_path = matches.group(2) or '/'
proxied_tail = urlunparse(parts._replace(scheme="", netloc="", path=proxied_path))
LOG.debug("Referred by proxy host, uri: %s, %s", proxied_host, proxied_tail)
return [proxied_host, proxied_tail]
controller = Controller.from_port(address="127.0.0.1", port=9151)
try:
controller.authenticate(password="")
controller.set_options([
("HiddenServiceDir", hidden_svc_dir),
("HiddenServicePort", "80 %s:%s" % (host, str(port)))
])
svc_name = open(hidden_svc_dir + "/hostname", "r").read().strip()
print "onion link: %s" % svc_name
except Exception as e:
print e
app.run()
after runing this you will get a onion link like: "somelongstringwithnumber123.onion" with that onion link you can connect to host from client over tor network
Than you need to make request over tor network from host:
import requests
session = requests.session()
session.proxies = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'
r = session.get("http://somelongstringwithnumber123.onion/p/alpwebtasarim.com")
print(r.text)
im not gonna test that codes but i hope you understand the main idea.
To use a Tor hidden service as a proxy, you must install Tor on the server. https://www.torproject.org/download/ has the programme.
After installing Tor, add these lines to your torrc file:
import socks
import socket
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket
# Use the socket module normally, and all connections will go via the proxy.
To utilise the tor hidden service with a browser, set the browser to use the Tor SOCKS proxy on localhost:9050 and use the hidden service hostname for the target address, which can be found in the file titled "hostname" in HiddenServiceDir.
The python-stem library can automate starting and stopping the tor service, establishing and deleting hidden services, and getting the hidden service's onion address.

(Python) How to check http response on status

can someone tell me how to check the statuscode of a HTTP response with http.client? I didn't find anything specifically to that in the documentary of http.client.
Code would look like this:
if conn.getresponse():
return True #Statuscode = 200
else:
return False #Statuscode != 200
My code looks like that:
from urllib.parse import urlparse
import http.client, sys
def check_url(url):
url = urlparse(url)
conn = http.client.HTTPConnection(url.netloc)
conn.request("HEAD", url.path)
r = conn.getresponse()
if r.status == 200:
return True
else:
return False
if __name__ == "__main__":
input_url=input("Enter the website to be checked (beginning with www):")
url = "http://"+input_url
url_https = "https://"+input_url
if check_url(url_https):
print("The entered Website supports HTTPS.")
else:
if check_url(url):
print("The entered Website doesn't support HTTPS, but supports HTTP.")
if check_url(url):
print("The entered Website supports HTTP too.")
Take a look at the documentation here, you simply needs to do:
r = conn.getresponse()
print(r.status, r.reason)
Update: If you want (as said in the comments) to check an http connection, you could eventually use an HTTPConnection and read the status:
import http.client
conn = http.client.HTTPConnection("docs.python.org")
conn.request("GET", "/")
r1 = conn.getresponse()
print(r1.status, r1.reason)
If the website is correctly configured to implement HTTPS, you should not have a status code 200; In this example, you'll get a 301 Moved Permanently response, which means the request was redirected, in this case rewritten to HTTPS .

POST method in Python: errno 104

I am trying to query a website in Python. I need to use a POST method (according to what is happening in my browser when I monitor it with the developer tools).
If I query the website with cURL, it works well:
curl -i --data "param1=var1&param2=var2" http://www.test.com
I get this header:
HTTP/1.1 200 OK
Date: Tue, 26 Sep 2017 08:46:18 GMT
Server: Apache/1.3.33 (Unix) mod_gzip/1.3.26.1a mod_fastcgi/2.4.2 PHP/4.3.11
Transfer-Encoding: chunked
Content-Type: text/html
But when I do it in Python 3, I get an error 104.
Here is what I tried so far. First, with urllib (getting inspiration from this thread to manage to use a POST method instead of GET):
import re
from urllib import request as ur
# URL to handle request
url = "http://www.test.com"
data = "param1=var1&param2=var2"
# Build a request dictionary
preq = [re.findall("[^=]+", i) for i in re.findall("[^\&]+", data)]
dreq = {i[0]: i[1] if len(i) == 2 else "" for i in preq}
# Initiate request & add method
ureq = ur.Request(url)
ureq.get_method = lambda: "POST"
# Send request
req = ur.urlopen(ureq, data=str(dreq).encode())
I did basically the same with requests:
import re
import requests
# URL to handle request
url = "http://www.test.com"
data = "param1=var1&param2=var2"
# Build a request dictionary
preq = [re.findall("[^=]+", i) for i in re.findall("[^\&]+", data)]
dreq = {i[0]: i[1] if len(i) == 2 else "" for i in preq}
# Initiate request & add method
req = requests.post(url, data=dreq)
In both cases, I get a HTTP 104 error:
ConnectionResetError: [Errno 104] Connection reset by peer
That I don't understand since the same request is working with cURL. I guess I misunderstood something with Python request but so far I'm stuck. Any hint would be appreciated!
I've just figured out I did not pass data in the right format. I thought it needed to be store in a dict; that is not the case and it is therefore much more simple that what I tried previously.
With urllib:
req = ur.urlopen(ureq, data=str(data).encode())
With requests:
req = requests.post(url, data=data)

How to pass proxy-authentication (requires digest auth) by using python requests module

I was using Mechanize module a while ago, and now try to use Requests module.
(Python mechanize doesn't work when HTTPS and Proxy Authentication required)
I have to go through proxy-server when I access the Internet.
The proxy-server requires authentication. I wrote the following codes.
import requests
from requests.auth import HTTPProxyAuth
proxies = {"http":"192.168.20.130:8080"}
auth = HTTPProxyAuth("username", "password")
r = requests.get("http://www.google.co.jp/", proxies=proxies, auth=auth)
The above codes work well when proxy-server requires basic authentication.
Now I want to know what I have to do when proxy-server requires digest authentication.
HTTPProxyAuth seems not to be effective in digest authentication (r.status_code returns 407).
No need to implement your own! in most cases
Requests has built in support for proxies, for basic authentication:
proxies = { 'https' : 'https://user:password#proxyip:port' }
r = requests.get('https://url', proxies=proxies)
see more on the docs
Or in case you need digest authentication HTTPDigestAuth may help.
Or you might need try to extend it like yutaka2487 did bellow.
Note: must use ip of proxy server not its name!
I wrote the class that can be used in proxy authentication (based on digest auth).
I borrowed almost all codes from requests.auth.HTTPDigestAuth.
import requests
import requests.auth
class HTTPProxyDigestAuth(requests.auth.HTTPDigestAuth):
def handle_407(self, r):
"""Takes the given response and tries digest-auth, if needed."""
num_407_calls = r.request.hooks['response'].count(self.handle_407)
s_auth = r.headers.get('Proxy-authenticate', '')
if 'digest' in s_auth.lower() and num_407_calls < 2:
self.chal = requests.auth.parse_dict_header(s_auth.replace('Digest ', ''))
# Consume content and release the original connection
# to allow our new request to reuse the same one.
r.content
r.raw.release_conn()
r.request.headers['Authorization'] = self.build_digest_header(r.request.method, r.request.url)
r.request.send(anyway=True)
_r = r.request.response
_r.history.append(r)
return _r
return r
def __call__(self, r):
if self.last_nonce:
r.headers['Proxy-Authorization'] = self.build_digest_header(r.method, r.url)
r.register_hook('response', self.handle_407)
return r
Usage:
proxies = {
"http" :"192.168.20.130:8080",
"https":"192.168.20.130:8080",
}
auth = HTTPProxyDigestAuth("username", "password")
# HTTP
r = requests.get("http://www.google.co.jp/", proxies=proxies, auth=auth)
r.status_code # 200 OK
# HTTPS
r = requests.get("https://www.google.co.jp/", proxies=proxies, auth=auth)
r.status_code # 200 OK
I've written a Python module (available here) which makes it possible to authenticate with a HTTP proxy using the digest scheme. It works when connecting to HTTPS websites (through monkey patching) and allows to authenticate with the website as well. This should work with latest requests library for both Python 2 and 3.
The following example fetches the webpage https://httpbin.org/ip through HTTP proxy 1.2.3.4:8080 which requires HTTP digest authentication using user name user1 and password password1:
import requests
from requests_digest_proxy import HTTPProxyDigestAuth
s = requests.Session()
s.proxies = {
'http': 'http://1.2.3.4:8080/',
'https': 'http://1.2.3.4:8080/'
}
s.auth = HTTPProxyDigestAuth('user1', 'password1')
print(s.get('https://httpbin.org/ip').text)
Should the website requires some kind of HTTP authentication, this can be specified to HTTPProxyDigestAuth constructor this way:
# HTTP Basic authentication for website
s.auth = HTTPProxyDigestAuth(('user1', 'password1'),
auth=requests.auth.HTTPBasicAuth('user1', 'password0'))
print(s.get('https://httpbin.org/basic-auth/user1/password0').text))
# HTTP Digest authentication for website
s.auth = HTTPProxyDigestAuth(('user1', 'password1'),,
auth=requests.auth.HTTPDigestAuth('user1', 'password0'))
print(s.get('https://httpbin.org/digest-auth/auth/user1/password0').text)
This snippet works for both types of requests (http and https). Tested on the current version of requests (2.23.0).
import re
import requests
from requests.utils import get_auth_from_url
from requests.auth import HTTPDigestAuth
from requests.utils import parse_dict_header
from urllib3.util import parse_url
def get_proxy_autorization_header(proxy, method):
username, password = get_auth_from_url(proxy)
auth = HTTPProxyDigestAuth(username, password)
proxy_url = parse_url(proxy)
proxy_response = requests.request(method, proxy_url, auth=auth)
return proxy_response.request.headers['Proxy-Authorization']
class HTTPSAdapterWithProxyDigestAuth(requests.adapters.HTTPAdapter):
def proxy_headers(self, proxy):
headers = {}
proxy_auth_header = get_proxy_autorization_header(proxy, 'CONNECT')
headers['Proxy-Authorization'] = proxy_auth_header
return headers
class HTTPAdapterWithProxyDigestAuth(requests.adapters.HTTPAdapter):
def proxy_headers(self, proxy):
return {}
def add_headers(self, request, **kwargs):
proxy = kwargs['proxies'].get('http', '')
if proxy:
proxy_auth_header = get_proxy_autorization_header(proxy, request.method)
request.headers['Proxy-Authorization'] = proxy_auth_header
class HTTPProxyDigestAuth(requests.auth.HTTPDigestAuth):
def init_per_thread_state(self):
# Ensure state is initialized just once per-thread
if not hasattr(self._thread_local, 'init'):
self._thread_local.init = True
self._thread_local.last_nonce = ''
self._thread_local.nonce_count = 0
self._thread_local.chal = {}
self._thread_local.pos = None
self._thread_local.num_407_calls = None
def handle_407(self, r, **kwargs):
"""
Takes the given response and tries digest-auth, if needed.
:rtype: requests.Response
"""
# If response is not 407, do not auth
if r.status_code != 407:
self._thread_local.num_407_calls = 1
return r
s_auth = r.headers.get('proxy-authenticate', '')
if 'digest' in s_auth.lower() and self._thread_local.num_407_calls < 2:
self._thread_local.num_407_calls += 1
pat = re.compile(r'digest ', flags=re.IGNORECASE)
self._thread_local.chal = requests.utils.parse_dict_header(
pat.sub('', s_auth, count=1))
# Consume content and release the original connection
# to allow our new request to reuse the same one.
r.content
r.close()
prep = r.request.copy()
requests.cookies.extract_cookies_to_jar(prep._cookies, r.request, r.raw)
prep.prepare_cookies(prep._cookies)
prep.headers['Proxy-Authorization'] = self.build_digest_header(prep.method, prep.url)
_r = r.connection.send(prep, **kwargs)
_r.history.append(r)
_r.request = prep
return _r
self._thread_local.num_407_calls = 1
return r
def __call__(self, r):
# Initialize per-thread state, if needed
self.init_per_thread_state()
# If we have a saved nonce, skip the 407
if self._thread_local.last_nonce:
r.headers['Proxy-Authorization'] = self.build_digest_header(r.method, r.url)
r.register_hook('response', self.handle_407)
self._thread_local.num_407_calls = 1
return r
session = requests.Session()
session.proxies = {
'http': 'http://username:password#proxyhost:proxyport',
'https': 'http://username:password#proxyhost:proxyport'
}
session.trust_env = False
session.mount('http://', HTTPAdapterWithProxyDigestAuth())
session.mount('https://', HTTPSAdapterWithProxyDigestAuth())
response_http = session.get("http://ww3.safestyle-windows.co.uk/the-secret-door/")
print(response_http.status_code)
response_https = session.get("https://stackoverflow.com/questions/13506455/how-to-pass-proxy-authentication-requires-digest-auth-by-using-python-requests")
print(response_https.status_code)
Generally, the problem of proxy autorization is also relevant for other types of authentication (ntlm, kerberos) when connecting using the protocol HTTPS. And despite the large number of issues (since 2013, and maybe there are earlier ones that I did not find):
in requests: Digest Proxy Auth, NTLM Proxy Auth, Kerberos Proxy Auth
in urlib3: NTLM Proxy Auth, NTLM Proxy Auth
and many many others,the problem is still not resolved.
The root of the problem in the function _tunnel of the module httplib(python2)/http.client(python3). In case of unsuccessful connection attempt, it raises an OSError without returning a response code (407 in our case) and additional data needed to build the autorization header. Lukasa gave a explanation here.
As long as there is no solution from maintainers of urllib3 (or requests), we can only use various workarounds (for example, use the approach of #Tey' or do something like this).In my version of workaround, we pre-prepare the necessary authorization data by sending a request to the proxy server and processing the received response.
You can use digest authentication by using requests.auth.HTTPDigestAuth instead of requests.auth.HTTPProxyAuth
For those of you that still end up here, there appears to be a project called requests-toolbelt that has this plus other common but not built in functionality of requests.
https://toolbelt.readthedocs.org/en/latest/authentication.html#httpproxydigestauth
This works for me. Actually, don't know about security of user:password in this soulution:
import requests
import os
http_proxyf = 'http://user:password#proxyip:port'
os.environ["http_proxy"] = http_proxyf
os.environ["https_proxy"] = http_proxyf
sess = requests.Session()
# maybe need sess.trust_env = True
print(sess.get('https://some.org').text)
import requests
import os
# in my case I had to add my local domain
proxies = {
'http': 'proxy.myagency.com:8080',
'https': 'user#localdomain:password#proxy.myagency.com:8080',
}
r=requests.get('https://api.github.com/events', proxies=proxies)
print(r.text)
Here is an answer that is not for http Basic Authentication - for example a transperant proxy within organization.
import requests
url = 'https://someaddress-behindproxy.com'
params = {'apikey': '123456789'} #if you need params
proxies = {'https': 'https://proxyaddress.com:3128'} #or some other port
response = requests.get(url, proxies=proxies, params=params)
I hope this helps someone.

Python Requests - Is it possible to receive a partial response after an HTTP POST?

I am using the Python Requests Module to datamine a website. As part of the datamining, I have to HTTP POST a form and check if it succeeded by checking the resulting URL. My question is, after the POST, is it possible to request the server to not send the entire page? I only need to check the URL, yet my program downloads the entire page and consumes unnecessary bandwidth. The code is very simple
import requests
r = requests.post(URL, payload)
if 'keyword' in r.url:
success
fail
An easy solution, if it's implementable for you. Is to go low-level. Use socket library.
For example you need to send a POST with some data in its body. I used this in my Crawler for one site.
import socket
from urllib import quote # POST body is escaped. use quote
req_header = "POST /{0} HTTP/1.1\r\nHost: www.yourtarget.com\r\nUser-Agent: For the lulz..\r\nContent-Type: application/x-www-form-urlencoded; charset=UTF-8\r\nContent-Length: {1}"
req_body = quote("data1=yourtestdata&data2=foo&data3=bar=")
req_url = "test.php"
header = req_header.format(req_url,str(len(req_body))) #plug in req_url as {0}
#and length of req_body as Content-length
s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #create a socket
s.connect(("www.yourtarget.com",80)) #connect it
s.send(header+"\r\n\r\n"+body+"\r\n\r\n") # send header+ two times CR_LF + body + 2 times CR_LF to complete the request
page = ""
while True:
buf = s.recv(1024) #receive first 1024 bytes(in UTF-8 chars), this should be enought to receive the header in one try
if not buf:
break
if "\r\n\r\n" in page: # if we received the whole header(ending with 2x CRLF) break
break
page+=buf
s.close() # close the socket here. which should close the TCP connection even if data is still flowing in
# this should leave you with a header where you should find a 302 redirected and then your target URL in "Location:" header statement.
There's a chance the site uses Post/Redirect/Get (PRG) pattern. If so then it's enough to not follow redirect and read Location header from response.
Example
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'
If you need more information on what would you get if you had followed redirection then you can use HEAD on the url given in Location header.
Example
>>> import requests
>>> response = requests.get('http://httpbin.org/redirect/1', allow_redirects=False)
>>> response.status_code
302
>>> response.headers['location']
'http://httpbin.org/get'
>>> response2 = requests.head(response.headers['location'])
>>> response2.status_code
200
>>> response2.headers
{'date': 'Wed, 07 Nov 2012 20:04:16 GMT', 'content-length': '352', 'content-type':
'application/json', 'connection': 'keep-alive', 'server': 'gunicorn/0.13.4'}
It would help if you gave some more data, for example, a sample URL that you're trying to request. That being said, it seems to me that generally you're checking if you had the correct URL after your POST request using the following algorithm relying on redirection or HTTP 404 errors:
if original_url == returned request url:
correct url to a correctly made request
else:
wrong url and a wrongly made request
If this is the case, what you can do here is use the HTTP HEAD request (another type of HTTP request like GET, POST, etc.) in Python's requests library to get only the header and not also the page body. Then, you'd check the response code and redirection url (if present) to see if you made a request to a valid URL.
For example:
def attempt_url(url):
'''Checks the url to see if it is valid, or returns a redirect or error.
Returns True if valid, False otherwise.'''
r = requests.head(url)
if r.status_code == 200:
return True
elif r.status_code in (301, 302):
if r.headers['location'] == url:
return True
else:
return False
elif r.status_code == 404:
return False
else:
raise Exception, "A status code we haven't prepared for has arisen!"
If this isn't quite what you're looking for, additional detail on your requirements would help. At the very least, this gets you the status code and headers without pulling all of the page data.

Categories

Resources