I'm attending an online Python course for beginners. The content of a unit is to teach students to extract all links in the source code of a webpage. The code is as follows, with Block_of_Code unknown:
def get_page(url):
<Block_of_Code>
def get_next_target(page):
start_link=page.find('<a href=')
if start_link==-1:
return None,0
start_quote=page.find('"',start_link)
end_quote=page.find('"',start_quote+1)
url=page[start_quote+1:end_quote]
return url,end_quote
def print_all_links(page):
while True:
url,endpos=(get_next_target(page))
if url:
print(url)
page=page[endpos:]
else:
break
print_all_links(get_page('https://youtube.com'))
If I were not in China, the Block_of_Code should not have been a problem for me. As far as I know, it may have been:
import urllib.request
return urllib.request.urlopen(url).read().decode('utf-8')
But here in China, certain websites (youtube included) are blocked. So the above code doesn't apply to them.
My goal for Block_of_Code is to get the source code of any website, whether blocked or not.
I have searched on Google and found some codes using socks proxy, but none of them worked. For example, I wrote and tried the following code based on this article (having executed pip install PySocks).
import socket
import socks
import urllib.request
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 2012)
socket.socket = socks.socksocket
return urllib.request.urlopen(url).read().decode('utf-8')
The error message is:
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
The reason for my searching for code using socks proxy is that I have always been using socks proxy service to visit blocked websites. By launching an app provided by my service provider, I am able to visit those websites using a web browser like Firefox. (My socks proxy port is 2012)
Nevertheless, any kind of solution is welcome, whether it is socks proxy or not, as long as it will enable me to get the source of any page.
I'm using Python 3.6.3 on Windows 10.
Related
I've been working through a LinkedIn Learning course trying to learn some Python, but I've run into a problem that's stopped my progress. I'm trying to work with JSONs and pull data from a website, but I keep getting an error saying that "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host failed to respond".
I'm using VSCode and have tried on both my work's network (which is heavily restricted, though not to this webpage for browsing) and on my home network. Is there some sort of network permission that would be stopping access? I experienced the same issue when trying to complete an API training course that used the OpenNotify API.
This is the code I'm trying to use.
import urllib.request
def main():
webUrl = urllib.request.urlopen("https://www.google.com")
print("result code: " + str(webUrl.getcode()))
if __name__ == "__main__":
main()
As ping works, but telnet to port 80 does not, the HTTP port 80 is closed on your machine. I assume that your browser's HTTP connection goes through a proxy (as browsing works, how else would you read stackoverflow?). You need to add some code to your python program, that handles the proxy.
You can take a look at here for more details info.
But why don't you try requests library, it is pretty much straightforward and easy to use also.
Heres some example:
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
'{"type":"User"...'
>>> r.json()
{'private_gists': 419, 'total_private_repos': 77, ...}
You can just start using it by doing pip install requests and here's the documentation.
Here is the code that i have till now
import socks
import socket
import requests
import json
socks.setdefaultproxy(proxy_type=socks.PROXY_TYPE_SOCKS5, addr="127.0.0.1", port=9050)
socket.socket = socks.socksocket
data = json.loads(requests.get("http://freegeoip.net/json/").text)
and it works fine. The problem is when i use a .onion url it shows error
Failed to establish a new connection: [Errno -2] Name or service not known
After researching a little i found that although the http request is made over tor the resolution still occours over clearnet. What is the proper way so i can also have the domain resolved over tor network to connect to .onion urls ?
Try to avoid the monkey patching if possible. If you're using modern version of requests, then you should have this functionality already.
import requests
import json
proxies = {
'http': 'socks5h://127.0.0.1:9050',
'https': 'socks5h://127.0.0.1:9050'
}
data = requests.get("http://altaddresswcxlld.onion",proxies=proxies).text
print(data)
It's important to specify the proxies using the socks5h:// scheme so that DNS resolution is handled over SOCKS so Tor can resolve the .onion address properly.
There is a more simple solution for this, but therefore you will need Kali Linux. If you have this OS, you can install tor service and kalitorify, start tor service with: sudo service tor start and start kalitorify with sudo kalitorify -t. Now your trafic will be send through tor and you can access .onion sites just as they would be normal sites.
I'm trying to execute a few simple Python scripts to retrieve information from the internet, unfortunately I'm behind a Corporate Proxy which makes this somewhat Tricky. So far I have installed CNTLM and configured it to work with Pycharm 1.4. I have configured both such that when I 'Check Connection' to www.google.com in pycharm using my manual proxy settings it returns 'Connection Successful'.
However when I try and simple scripts from pycharm, they all seem to timeout. Any advice? By way of code example, this will return a 503 response. Thanks!
import requests
URL = "http://google.com"
try:
response = requests.get(URL)
print response
except Exception as e:
print "Something went wrong:"
print e
I am currently using Python + Mechanize for retrieving pages from a local server. As you can see the code uses "localhost" as a proxy. The proxy is an instance of the Fiddler2 debug proxy. This works exactly as expected. This indicates that my machine can reach the test_box.
import time
import mechanize
url = r'http://test_box.test_domain.com:8000/helloWorldTest.html'
browser = mechanize.Browser();
browser.set_proxies({"http": "127.0.0.1:8888"})
browser.add_password(url, "test", "test1234")
start_timer = time.time()
resp = browser.open(url)
resp.read()
latency = time.time() - start_timer
However when I remove the browser.set_proxies statement it stops to work. I get an error <"urlopen error [Errno 10061] No connection could be made because the target machine actively refused it>". The point is that I can access the test_box from my machine with any browser. This also indicates that test_box can be reached from my machine.
My suspicion is that this has something to do with Mechanize trying to guess the proper proxy settings. That is: my Browsers are configured to go to a web proxy for any domain but test_domain.com. So I suspect that mechanize tries to use the web proxy while it should actually not use the proxy.
How can I tell mechanize to NOT guess any proxy settings and instead force it to try to connect directly to the test_box?
Argh, found it out myself. The docstring says:
"To avoid all use of proxies, pass an empty proxies dict."
This fixed the issue.
I'm trying to access a website with python through tor, but I'm having problems. I started my attempts with this thread and the one referenced in it: How to make urllib2 requests through Tor in Python?
First I tried the original code snippet:
import urllib2
proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"})
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
then I tried the modified code posted in one of the answers, which people said worked for them. Unfortunately, the code works in that it downloads the page, but it doesn't work because my IP address is still the same:
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
opener = urllib2.build_opener(proxy_support)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open('http://www.google.com').read()
I have TOR set up in the standard configuration, per the Ubuntu and TOR sites respective documentation, and nmap shows the TOR tcp proxy running on port 9050: 9050/tcp open tor-socks However, my IP address isn't changed when I run either of the above scripts. Is python not respecting the http environment variables, or is there a code problem that I'm missing?
TOR provides a SOCKS proxy. Since urllib2 can only handle HTTP proxies, you'll have to use a SOCKS implementation.