Difficulties doing a webpage search with Python through Tor - python

I am running python 2.5.1 and Tor 0.2.2.34 on OSX 10.5
I have checked the SOCKS question and the Trying to get Tor to work with Python question and the Tor with Python question and have tried them all, and a combination of the above, while running Tor in the background and none have worked really. If I try the "Tor with Python" way (just urllib2) the script works, but my IP goes unchanged when checked by reading and printing the source code on a whatsmyip page in the same way through Python.
this is the script I'm trying to run through Tor:
import socks
import socket
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 8118)
socket.socket = socks.socksocket
import urllib2
web_page = "http://www.cartage.org.lb/en/themes/arts/architec/architecturalstructure/LookingforLiminality/LookingforLiminality.htm"
req = urllib2.Request(web_page)
response = urllib2.urlopen(req)
the_page = response.read()
matches = re.findall('Gianni Vattimo', the_page)
if len(matches) == 0:
print 'RESULTS!'
else:
print 'There were NO results!'
(the web page is just an example and not my actual target obv.)
When I run this script it just stalls in Terminal for an indefinite amount of time. As I said, I've tried different renditions, changing the port to other suggestions, etc, but nothing has worked. Any suggestions or tested fixes?
Thank you.

Related

Python 3 Read data from URL [duplicate]

I have this simple minimal 'working' example below that opens a connection to google every two seconds. When I run this script when I have a working internet connection, I get the Success message, and when I then disconnect, I get the Fail message and when I reconnect again I get the Success again. So far, so good.
However, when I start the script when the internet is disconnected, I get the Fail messages, and when I connect later, I never get the Success message. I keep getting the error:
urlopen error [Errno -2] Name or service not known
What is going on?
import urllib2, time
while True:
try:
print('Trying')
response = urllib2.urlopen('http://www.google.com')
print('Success')
time.sleep(2)
except Exception, e:
print('Fail ' + str(e))
time.sleep(2)
This happens because the DNS name "www.google.com" cannot be resolved. If there is no internet connection the DNS server is probably not reachable to resolve this entry.
It seems I misread your question the first time. The behaviour you describe is, on Linux, a peculiarity of glibc. It only reads "/etc/resolv.conf" once, when loading. glibc can be forced to re-read "/etc/resolv.conf" via the res_init() function.
One solution would be to wrap the res_init() function and call it before calling getaddrinfo() (which is indirectly used by urllib2.urlopen().
You might try the following (still assuming you're using Linux):
import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
res_init = libc.__res_init
# ...
res_init()
response = urllib2.urlopen('http://www.google.com')
This might of course be optimized by waiting until "/etc/resolv.conf" is modified before calling res_init().
Another solution would be to install e.g. nscd (name service cache daemon).
For me, it was a proxy problem.
Running the following before import urllib.request helped
import os
os.environ['http_proxy']=''
response = urllib.request.urlopen('http://www.google.com')

urllib2.URLError when using Quandl for Python behind a proxy

I'm posting this because I tried searching for the answer myself and I was not able to find a solution. I was eventually able to figure out a way to get this to work & I hope this helps someone else in the future.
Scenario:
In Windows XP, I'm using Python with Pandas & Quandl to get data for a US Equity security using the following line of code:
bars = Quandl.get("GOOG/NYSE_SPY", collapse="daily")
Unfortunately, I was getting the following error:
urllib2.URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
#user3150079: kindly Ctrl+X / Ctrl+V your solution as an [Answer]. Such MOV is perfectly within StackOverflow
Solution:
I recognized that this was an issue with trying to contact a server without properly targeting my network's proxy server. Since I was not able to set the system variable for HTTP_PROXY, I added the following line which corrected the issue:
import os
os.environ['HTTP_PROXY']="10.11.123.456:8080"
Thanks - I'm interested to hear about any improvements to this solution or other suggestions.
You can set your user environment variable HTTP_PROXY if you can't or won't set the system environment variable:
set HTTP_PROXY "10.11.123.456:8080"
python yourscript.py
and to permanently set it (using setx from Windows XP Service Pack 2 Support Tools):
setx HTTP_PROXY "10.11.123.456:8080"
python yourscript.py
Other ways to get this environment variable set include: registry entries, putting os.environ["HTTP_PROXY"] = ..." insitecustomize.py`.
More control using requests without using the Quandl Package:
import requests
def main():
proxies = {'http': 'http://proxy.yourdomain.com:port',
'https': 'http://proxy.yourdomain.com:port',}
url = 'https://www.quandl.com/api/v3/datasets/GOOG/NYSE_SPY.json?collapse=daily'
response = requests.get(url, proxies=proxies)
status = response.status_code
html_text = response.text
repo_data = response.json()
print(repo_data)
print(status)
print('HTML TEXT')
print('=========')
print(html_text)
if __name__ == '__main__':
main()

Skip Connection Interruptions (Site & BeautifulSoup)

I'm currently doing this with my script:
Get the body (from sourcecode) and search for a string, it does it until the string is found. (If the site updates.)
Altough, if the connection is lost, the script stops.
My 'connection' code looks something like this (This keeps repeating in a while loop every 20 seconds):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = ('url')
openUrl = opener.open(url).read()
soup = BeautifulSoup(openUrl)
I've used urllib2 & BeautifulSoup.
Can anyone tell me how I could tell the script to "freeze" if the connection is lost and look to see if the internet connection is alive? Then continue based on the answer.(So, to check if the script CAN connect, not to see if the site is up. If it does checkings this way, the script will stop with a bunch of errors.)
Thank you!
Found the solution!
So, I need to check the connection every LOOP, before actually doing stuff.
So I created this function:
def check_internet(self):
try:
header = {"pragma" : "no-cache"}
req = urllib2.Request("http://www.google.ro", headers=header)
response = urllib2.urlopen(req,timeout=2)
return True
except urllib2.URLError as err:
return False
And it works, tested it with my connection down & up!
For the other newbies wodering:
while True:
conn = check_internet('Site or just Google, just checking for connection.')
try:
if conn is True:
#code
else:
#need to make it wait and re-do the while.
time.sleep(30)
except: urllib2.URLError as err:
#need to wait
time.sleep(20)
Works perfectly, the script has been running for about 10 hours now and it handles errors perfectly! It also works with my connection off and shows proper messages.
Open to suggestions for optimization!
Rather than "freeze" the script, I would have the script continue to run only if the connection is alive. If it's alive, run your code. If it's not alive, either attempt to reconnect, or halt execution.
while keepRunning:
if connectionIsAlive():
run_your_code()
else:
reconnect_maybe()
One way to check whether the connection is alive is described here Checking if a website is up via Python
If your program "stops with a bunch of errors" then that is likely because you're not properly handling the situation where you're unable to connect to the site (for various reasons such as you not having internet, their website is down, etc.).
You need to use a try/except block to make sure that you catch any errors that occur because you were unable to open a live connection.
try:
openUrl = opener.open(url).read()
except urllib2.URLError:
# something went wrong, how to respond?

Python - Controlling Tor

I'm attempting to control Tor with Python. I've read a couple of the other questions asked about this subject on stackoverflow but none of them answer this question.
I'm looking for a method to have tor give you a 'new identity', a new IP address, when the command is run. I've googled around and found the TorCtl module as a method for controlling tor, but can't find a way to get a new identity. Here's what I have so far for atleast connecting to tor, but can't get any farther.
from TorCtl import TorCtl
conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="123")
Any help on this is appreciated, if there are other modules better then TorCtl that'd be great too! Thank you!
Well, out of luck I managed to find a PHP script that did the exact same thing I wanted, and with the help of that I converted it to work in TorCtl. This is what it looks like for anyone else needing it in the future!
from TorCtl import TorCtl
conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="123")
TorCtl.Connection.send_signal(conn, "NEWNYM")
You can use a similar code in python:
def renewTorIdentity(self, passAuth):
try:
s = socket.socket()
s.connect(('localhost', 9051))
s.send('AUTHENTICATE "{0}"\r\n'.format(passAuth))
resp = s.recv(1024)
if resp.startswith('250'):
s.send("signal NEWNYM\r\n")
resp = s.recv(1024)
if resp.startswith('250'):
print "Identity renewed"
else:
print "response 2:", resp
else:
print "response 1:", resp
except Exception as e:
print "Can't renew identity: ", e
You can check this post for a mini-tutorial
Apparently the stem package works better. You can install tor on your computer and keep it running in terminal. Then run the following program:
from stem import Signal
from stem.control import Controller
with Controller.from_port(port = 9051) as controller:
controller.authenticate()
controller.signal(Signal.NEWNYM)
stem is the official package developed by tor.org, and you can see their documentation

How to detect the current open webbrowser and open new page in that same browser using Python?

I am making a website where I am using some html forms, which passes the values to a python script and in return the python script opens a new page/tab in the web browser. I am using the webbrowser module for it. Although I can choose the default browser or any other browser using "webbrowser.get([name])"; but my concern is, as this will be a public webpage, so anyone can open the page in any browser of their choice.The problem I am facing is : Lets say my default browser is "firefox", and I open the page in "chrome", so when the python script opens the new page it opens that in "firefox" instead of "chrome".
Here are my questions :
How do I detect the current web browser the user is using?
How to open the new page in that browser?
The code looks like this :
#!C:\Python27\python.exe -u
# -- coding: UTF-8 --
import MySQLdb
import sys
import cgi
import re
import cgitb
import webbrowser
cgitb.enable()
print "Content-Type: text/plain;charset=utf-8"
print
try:
db = MySQLdb.connect(host = "localhost", user = "root", passwd = "", db = "pymysql")
except MySQLdb.Error, e:
print "Error %d: %s" % (e.args[0], e.args[1])
sys.exit()
----- Do some analysis with the database ----
----- Create some kml files ----
#Use the kml files to display points in the map.
#Open the page where openlayers is present
webbrowser.open_new_tab('http://localhost/simulator.html')
The only reason that you are probably convinced that this is a working approach is most likely because you are running the server on your local machine. The python code you are executing is server-side so it has no control over the client. The client would normally be on a remote machine. In your case since your client is also on the server, you get the effect of seeing your python script open a browser tab with the webbrowser module.
This is impossible in a standard client server web situation. The client will be remote and your server side code cannot control their machine. You may only serve back http requests which will simply be something their browser receives and renders. If you want to open tabs it will need to be a javascript solution on the client side.
A more realistic solution would be to have your server serve back proper client side code. If the form is submitted via ajax then your response could contain javaacript that would open a new page.
Your python code will run on a server. The person who visits your website will only recieve html pages. See this about opening new tabs. But please read more about how websites work. The webbrowser module is used for client side applications and not for websites.

Categories

Resources