I am pretty new to python. I am trying to write a pretty simple web scraper for a project I am working on. In the process I am trying to use Tor to change my IP address so I don't get disconnected from the service I am scraping. I was trying to test the code specific to getting a new IP before adding it to my project. Here is the code I am testing.
from TorCtl import TorCtl
import urllib2
for i in range(1,51):
proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"} )
opener = urllib2.build_opener(proxy_support)
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)
print "IP " + str(i) + ":"
print urllib2.urlopen('http://ifconfig.me/ip').read()
conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="torPass")
conn.sendAndRecv('signal newnymrn')
conn.close()
When I do this i get the following error:
IP 1: Traceback (most recent call last): File "scrapingTools.py",
line 86, in
main() File "scrapingTools.py", line 76, in main
print urllib2.urlopen('http://ifconfig.me/ip').read() File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 126, in urlopen
return _opener.open(url, data, timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 394, in open
response = self._open(req, data) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 412, in _open
'_open', req) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 372, in _call_chain
result = func(*args) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 1199, in http_open
return self.do_open(httplib.HTTPConnection, req) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 1174, in do_open
raise URLError(err) urllib2.URLError:
Any help understanding what is going on here would be greatly appreciated.
There is some problem with your proxy configuration.
Your code works without the proxy settings.
I don't know anything about TorCtl but you're not sending an AUTHENTICATE string, tor will expect that. It should look something like:
telnet localhost:9051
>> 250 OK
AUTHENTICATE "xxx"
>> 250 OK
signal NEWNYM
>> 250 OK
Note, wait a few seconds for the identity to have changed.
Related
I'm using html2text in python to get raw text (tags included) of a HTML page by taking any URL but I'm getting an error.
My code -
import html2text
import urllib2
proxy = urllib2.ProxyHandler({'http': 'http://<proxy>:<pass>#<ip>:<port>'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
print html2text.html2text(html)
The error -
Traceback (most recent call last):
File "t.py", line 8, in <module>
html = urllib2.urlopen("http://www.ndtv.com/india-news/this-stunt-for-a-facebook-like-got-the-hyderabad-youth-arrested-740851").read()
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>
Can anyone explain what I'm doing wrong?
If you don't require SSL, this script in Python 2.7.x should work:
import urllib
url = "http://stackoverflow.com"
f = urllib.urlopen(url)
print f.read()
and in Python 3.x use urllib.request instead of urllib
Because urllib2 for Python 2, in Python 3 it was merged into urllib.
http:// is required.
EDIT: In 2020, you should use the 3rd party module requests. requests can be installed with pip.
import requests
print(requests.get("http://stackoverflow.com").text)
I'm doing a easy work to get the page of "http://search.jd.com/Search?keyword=%E5%A5%87%E7%9F%B3&enc=utf-8"
so my python code is:
# -*- coding: utf-8 -*-
import sys, codecs
import urllib, urllib2
url = "http://search.jd.com/Search?keyword=%E5%A5%87%E7%9F%B3&enc=utf-8"
print url
page=urllib2.urlopen(url).read()
print page
however I get
Traceback (most recent call last):
File "tmp.py", line 15, in <module>
page=urllib2.urlopen(url).read()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
can anyone tell me what's going on?
many thanks!
Sounds like it might be a network issue. Check that you have a consistent internet connection (e.g. by pinging an appropriate server continuously as you run the tests). Just ran the code you post and worked perfectly fine for me.
Your codes working fine for me too.
But the error could occur in case, the url has some characters like "+=#" it would then require
s = "http://search.jd.com/Search?keyword=%E5%A5%87%E7%9F%B3&enc=utf-8"
my_url = urllib2.quote(s.encode("utf8"))
page=urllib2.urlopen(my_url).read()
print page
Alternatively you could use requests.
response =requests.post(url)
print response.content
or
print response.text
It's network issue, please be sure you are on a proper internet connection.
I am trying to use proxy IP addresses in Selenium for web scraping. I am running Python 2.7.3 on Mac OSX 10.7.5 and I have the following python code
import urllib2
from selenium import webdriver
fileproxylist = open('proxylist.txt', 'r')
proxyList = fileproxylist.readlines()
indexproxy = 0
totalproxy = len(proxyList)
def get_source_html_proxy(url, proxip):
proxyip=urllib2.ProxyHandler({'http':proxip})
opener = urllib2.build_opener(proxyip)
urllib2.install_opener(opener)
req=urllib2.Request(url)
sock=urllib2.urlopen(req)
data = sock.read()
return data
browser = webdriver.Chrome()
browser.get(get_source_html_proxy(MyUrl,proxyList[0]))
where MyUrl is the url of the address I want to scrap and proxlist[0] is the IP address I want to scrape from instead of the IP address of my local machine. When I run this code, I get the following error:
Traceback (most recent call last):
File "Scrape.py", line 89, in <module>
browser.get(get_source_html_proxy(MyUrl,proxyList[0]))
File "Scrape.py", line 83, in get_source_html_proxy
sock=urllib2.urlopen(req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126,
in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400,
in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 418,
in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 378,
in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1207,
in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1177,
in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
I'm not sure what the issue is here. Can someone help me figure out what's going on? Thanks!
When I try to crawl Twitter using this code:
import urllib2
s = "https://mobile.twitter.com/bing/"
html = urllib2.urlopen(s).read()
print html
... I get the following error:
Traceback (most recent call last):
File "C:\Users\arpit\Downloads\Desktop\Wiki Code\final Crawler_wiki.py", line 14, in <module>
html = urllib2.urlopen(s).read()
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python27\lib\urllib2.py", line 1177, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 10061] No connection could be made because the target machine actively refused it>
If I replace mobile.twitter.com with twitter.com then it works, but I want it to work with mobile.twitter.com.
The twitter site is probably looking for a user-agent which you dont have set when you make the request through the urllib api.
You will likely need to use something like mechanize to fake your user-agent.
But I highly suggest your use the twitter api which provide a lot of easy and awesome way to play with data.
I want to make a simple stupid twitter app using Twitter API.
If I request this page from my browser it does work:
http://search.twitter.com/search.atom?q=hello&rpp=10&page=1
but if I request this page from python using urllib or urllib2 most of the times it doesn't work:
response = urllib2.urlopen("http://search.twitter.com/search.atom?q=hello&rpp=10&page=1")
and I get this error:
Traceback (most recent call last):
File "twitter.py", line 24, in <module>
response = urllib2.urlopen("http://search.twitter.com/search.atom?q=hello&rpp=10&page=1")
File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 110] Connection timed out>
Why ??
The code seems alright.
The following worked.
>>> import urllib
>>> import urllib2
>>> user_agent = 'curl/7.21.1 (x86_64-apple-darwin10.4.0) libcurl/7.21.1'
>>> url='http://search.twitter.com/search.atom?q=hello&rpp=10&page=1'
>>> headers = { 'User-Agent' : user_agent }
>>> req = urllib2.Request(url, None, headers)
>>> response = urllib2.urlopen(req)
>>> the_page = response.read()
>>> print the_page
The other is twitter actually could not respond. This happens once too often with Twitter.
did you change the default socket timeout somewhere in your script? your example code works reliably for me.
it could be your internet connection, or you might try
import socket
socket.setdefaulttimeout(30)
assuming urllib/2 don't override the socket timeout.