I am currently using Python + Mechanize for retrieving pages from a local server. As you can see the code uses "localhost" as a proxy. The proxy is an instance of the Fiddler2 debug proxy. This works exactly as expected. This indicates that my machine can reach the test_box.
import time
import mechanize
url = r'http://test_box.test_domain.com:8000/helloWorldTest.html'
browser = mechanize.Browser();
browser.set_proxies({"http": "127.0.0.1:8888"})
browser.add_password(url, "test", "test1234")
start_timer = time.time()
resp = browser.open(url)
resp.read()
latency = time.time() - start_timer
However when I remove the browser.set_proxies statement it stops to work. I get an error <"urlopen error [Errno 10061] No connection could be made because the target machine actively refused it>". The point is that I can access the test_box from my machine with any browser. This also indicates that test_box can be reached from my machine.
My suspicion is that this has something to do with Mechanize trying to guess the proper proxy settings. That is: my Browsers are configured to go to a web proxy for any domain but test_domain.com. So I suspect that mechanize tries to use the web proxy while it should actually not use the proxy.
How can I tell mechanize to NOT guess any proxy settings and instead force it to try to connect directly to the test_box?
Argh, found it out myself. The docstring says:
"To avoid all use of proxies, pass an empty proxies dict."
This fixed the issue.
Related
I'm attending an online Python course for beginners. The content of a unit is to teach students to extract all links in the source code of a webpage. The code is as follows, with Block_of_Code unknown:
def get_page(url):
<Block_of_Code>
def get_next_target(page):
start_link=page.find('<a href=')
if start_link==-1:
return None,0
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
url=page[start_quote+1:end_quote]
return url,end_quote
def print_all_links(page):
while True:
url,endpos=(get_next_target(page))
if url:
print(url)
page=page[endpos:]
else:
break
print_all_links(get_page('https://youtube.com'))
If I were not in China, the Block_of_Code should not have been a problem for me. As far as I know, it may have been:
import urllib.request
return urllib.request.urlopen(url).read().decode('utf-8')
But here in China, certain websites (youtube included) are blocked. So the above code doesn't apply to them.
My goal for Block_of_Code is to get the source code of any website, whether blocked or not.
I have searched on Google and found some codes using socks proxy, but none of them worked. For example, I wrote and tried the following code based on this article (having executed pip install PySocks).
import socket
import socks
import urllib.request
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 2012)
socket.socket = socks.socksocket
return urllib.request.urlopen(url).read().decode('utf-8')
The error message is:
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
The reason for my searching for code using socks proxy is that I have always been using socks proxy service to visit blocked websites. By launching an app provided by my service provider, I am able to visit those websites using a web browser like Firefox. (My socks proxy port is 2012)
Nevertheless, any kind of solution is welcome, whether it is socks proxy or not, as long as it will enable me to get the source of any page.
I'm using Python 3.6.3 on Windows 10.
Here is the code that i have till now
import socks
import socket
import requests
import json
socks.setdefaultproxy(proxy_type=socks.PROXY_TYPE_SOCKS5, addr="127.0.0.1", port=9050)
socket.socket = socks.socksocket
data = json.loads(requests.get("http://freegeoip.net/json/").text)
and it works fine. The problem is when i use a .onion url it shows error
Failed to establish a new connection: [Errno -2] Name or service not known
After researching a little i found that although the http request is made over tor the resolution still occours over clearnet. What is the proper way so i can also have the domain resolved over tor network to connect to .onion urls ?
Try to avoid the monkey patching if possible. If you're using modern version of requests, then you should have this functionality already.
import requests
import json
proxies = {
'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'
}
data = requests.get("http://altaddresswcxlld.onion",proxies=proxies).text
print(data)
It's important to specify the proxies using the socks5h:// scheme so that DNS resolution is handled over SOCKS so Tor can resolve the .onion address properly.
There is a more simple solution for this, but therefore you will need Kali Linux. If you have this OS, you can install tor service and kalitorify, start tor service with: sudo service tor start and start kalitorify with sudo kalitorify -t. Now your trafic will be send through tor and you can access .onion sites just as they would be normal sites.
I'm trying to execute a few simple Python scripts to retrieve information from the internet, unfortunately I'm behind a Corporate Proxy which makes this somewhat Tricky. So far I have installed CNTLM and configured it to work with Pycharm 1.4. I have configured both such that when I 'Check Connection' to www.google.com in pycharm using my manual proxy settings it returns 'Connection Successful'.
However when I try and simple scripts from pycharm, they all seem to timeout. Any advice? By way of code example, this will return a 503 response. Thanks!
import requests
URL = "http://google.com"
try:
response = requests.get(URL)
print response
except Exception as e:
print "Something went wrong:"
print e
I want to use python to log into a website which uses Microsoft Forefront, and retrieve the content of an internal webpage for processing.
I am not new to python but I have not used any URL libraries.
I checked the following posts:
How can I log into a website using python?
How can I login to a website with Python?
How to use Python to login to a webpage and retrieve cookies for later usage?
Logging in to websites with python
I have also tried a couple of modules such as requests. Still I am unable to understand how this should be done, Is it enough to enter username/password? Or should I somehow use the cookies to authenticate? Any sample code would really be appreciated.
This is the code I have so far:
import requests
NAME = 'XXX'
PASSWORD = 'XXX'
URL = 'https://intra.xxx.se/CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3'
def main():
# Start a session so we can have persistant cookies
session = requests.session()
# This is the form data that the page sends when logging in
login_data = {
'username': NAME,
'password': PASSWORD,
'SubmitCreds': 'login',
}
# Authenticate
r = session.post(URL, data=login_data)
# Try accessing a page that requires you to be logged in
r = session.get('https://intra.xxx.se/?t=1-2')
print r
main()
but the above code results in the following exception, on the session.post-line:
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='intra.xxx.se', port=443): Max retries exceeded with url: /CookieAuth.dll?GetLogon?curl=Z2F&reason=0&formdir=3 (Caused by <class 'socket.error'>: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond)
UPDATE:
I noticed that I was providing wrong username/password.
Once that was updated I get a HTTP-200 response with the above code, but when I try to access any internal site I get a HTTP 401 response. Why Is this happening? What is wrong with the above code? Should I be using the cookies somehow?
TMG can be notoriously fussy about what types of connections it blocks. The next step is to find out why TMG is blocking your connection attempts.
If you have access to the TMG server, log in to it, start the TMG management user-interface (I can't remember what it is called) and have a look at the logs for failed requests coming from your IP address. Hopefully it should tell you why the connection was denied.
It seems you are attempting to connect to it over an intranet. One way I've seen it block connections is if it receives them from an address it considers to be on its 'internal' network. (TMG has two network interfaces as it is intended to be used between two networks: an internal network, whose resources it protects from threats, and an external network, where threats may come from.) If it receives on its external network interface a request that appears to have come from the internal network, it assumes the IP address has been spoofed and blocks the connection. However, I can't be sure that this is the case as I don't know what this TMG server's internal network is set up as nor whether your machine's IP address is on this internal network.
I'm trying to access a website with python through tor, but I'm having problems. I started my attempts with this thread and the one referenced in it: How to make urllib2 requests through Tor in Python?
First I tried the original code snippet:
import urllib2
proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
then I tried the modified code posted in one of the answers, which people said worked for them. Unfortunately, the code works in that it downloads the page, but it doesn't work because my IP address is still the same:
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open('http://www.google.com').read()
I have TOR set up in the standard configuration, per the Ubuntu and TOR sites respective documentation, and nmap shows the TOR tcp proxy running on port 9050: 9050/tcp open tor-socks However, my IP address isn't changed when I run either of the above scripts. Is python not respecting the http environment variables, or is there a code problem that I'm missing?
TOR provides a SOCKS proxy. Since urllib2 can only handle HTTP proxies, you'll have to use a SOCKS implementation.