Python requests-html with Tor - python

The requirement is to scrap anonymously or change ip after certain number of calls. I use the https://github.com/kennethreitz/requests-html module to parse the HTML, but i get the below error,
socks.SOCKS5Error: 0x01: General SOCKS server failure
Code
import socks
import socket
import requests_html
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, addr='127.0.0.1', port=int('9150'))
socket.socket = socks.socksocket
session = requests_html.HTMLSession()
r = session.get('http://icanhazip.com')
r.html.render(sleep=5)
print(r.html.text)
But it works perfectly fine with requests module,
import socks
import socket
import requests
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, addr='127.0.0.1', port=int('9150'))
socket.socket = socks.socksocket
print(requests.get("http://icanhazip.com").text)
Any help to solve the issue with requests-html module would be highly appreciated.

Try:
session = requests_html.HTMLSession(browser_args=["--no-sandbox","--proxy-server=127.0.0.1:9150"])
Depends on how your proxy is set up to use tor but this worked for me!

Related

Python General SOCKS server failure when assigning socket.socket

I know similar questions have been asked several times:
General SOCKS server failure with python tor but working from tor browser
General SOCKS server failure when switching identity using stem
General SOCKS server failure while using tor proxy
I checked all related posts and googled a lot, but still got stuck.
I'm on Win10. I download Tor browser, run it and make sure it's on port 127.0.0.1:9150 with cmd netstat -aon in administrator.
Then I run the following example code in Python:
import socks
import socket
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
The last line socket.socket = socks.socksocket gives the Error message.
socks.GeneralProxyError: Socket error: 0x01: General SOCKS server failure
It's supposed to return a socket object which is assigned to socket.socket that opens a socket. Like this example:
https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/
Can anyone tell me what's wrong?
Thanks.
Update
Thanks to drew010's answer, this code will work (with Tor browser running and it's port = 9150):
import requests
proxies = {
'http': 'socks5h://127.0.0.1:9150',
'https': 'socks5h://127.0.0.1:9150'
}
url = 'http://icanhazip.com'
# request without Tor (original IP)
r = requests.get(url)
print(r.text)
# request with Tor (Tor IP)
r = requests.get(url, proxies=proxies)
print(r.text)
# Force change IP
from stem.control import Controller
from stem import Signal
with Controller.from_port(port = 9151) as controller:
controller.authenticate('mypassword')
controller.signal(Signal.NEWNYM)
# Changed Tor IP
r = requests.get(url, proxies=proxies)
print(r.text)
Note that we need to set password in torrc before.
by doing "socket.socket = socks.socksocket" you're actually replacing each future socket objects to actually be a socksocket object, which means after that you can just use regular sockets and they will go through your socks proxy.

Scraping web-page data with urllib with headers and proxy

I have got web-page data, but now I want to get it with proxy. How could I do it?
import urllib
def get_main_html():
request = urllib.request.Request(URL, headers=headers)
doc = lh.parse(urllib.request.urlopen(request))
return doc
From the documentation
urllib will auto-detect your proxy settings and use those. This is through the ProxyHandler, which is part of the normal handler chain when a proxy setting is detected. Normally that’s a good thing, but there are occasions when it may not be helpful. One way to do this is to setup our own ProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handle.
Check this, https://docs.python.org/3/howto/urllib2.html#proxies
use :
proxies = {'http': 'http://myproxy.example.com:1234'}
print "Using HTTP proxy %s" % proxies['http']
urllib.urlopen("http://yoursite", proxies=proxies)
You can use socksipy
import ftplib
import telnetlib
import urllib2
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
#Route an FTP session through the SOCKS proxy
socks.wrapmodule(ftplib)
ftp = ftplib.FTP('cdimage.ubuntu.com')
ftp.login('anonymous', 'support#aol.com')
print ftp.dir('cdimage') ftp.close()
#Route a telnet connection through the SOCKS proxy
socks.wrapmodule(telnetlib)
tn = telnetlib.Telnet('achaea.com')
print tn.read_very_eager() tn.close()
#Route an HTTP request through the SOCKS proxy
socks.wrapmodule(urllib2)
print urllib2.urlopen('http://www.whatismyip.com/automation/n09230945.asp').read()
in your case:
import urllib
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
socks.wrapmodule(urllib)
def get_main_html():
request = urllib.request.Request(URL, headers=headers)
doc = lh.parse(urllib.request.urlopen(request))
return doc

python AttributeError: module 'socks' has no attribute 'setdefaultproxy'

Trying to test out the socks module but in every case I get an "AttributeError: module 'socks' has no attribute 'setdefaultproxy'"
import socks
import socket
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket
import urllib2
print(urllib2.urlopen("http://www.yahoo.com").read())
Hey there trying replacing
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
with
socks.set_default_proxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
Note the below.
socks.set_default_proxy
In other words in should be
import socks
import socket
socks.set_default_proxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
socket.socket = socks.socksocket
import urllib2
print(urllib2.urlopen("http://www.yahoo.com").read())
If it still failes try to check.
Which version of socks are you using?, Which version of python are you using?, because I tested it on python 2.7.9, and I don't get your error. Which os are you running?
sock: 1.5.6
https://github.com/Anorov/PySocks
http://socksipy.sourceforge.net/
Try:
socks.set_default_proxy(----)
Do not name the file socks.py....

Socket with Proxy in Python 3.4

I have a code that I want to pass all packages through the proxy, how can I do it using:
import socket
import #Here your socket library
# The rest of the code here
Which library do you recommend me to use with Socket? Is it really possible what I'm saying?
Using Socksipy in Python 3.4 I can let the code with 3 more lines that are:
import socksipy
s=socket.socket( )
s = socks.socksocket()
s.setproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
s.connect((HOST, PORT))
Socksipy can be found at: http://socksipy.sourceforge.net/ and its very simple to use, you just need to read their wiki so than you can work with. The URL for the wiki is: https://code.google.com/p/socksipy-branch/

Python urllib over TOR? [duplicate]

This question already has answers here:
How to route urllib requests through the TOR network? [duplicate]
(3 answers)
Closed 7 years ago.
Sample code:
#!/usr/bin/python
import socks
import socket
import urllib2
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, "127.0.0.1", 9050, True)
socket.socket = socks.socksocket
print urllib2.urlopen("http://almien.co.uk/m/tools/net/ip/").read()
TOR is running a SOCKS proxy on port 9050 (its default). The request goes through TOR, surfacing at an IP address other than my own. However, TOR console gives the warning:
"Feb 28 22:44:26.233 [warn] Your
application (using socks4 to port 80)
is giving Tor only an IP address.
Applications that do DNS resolves
themselves may leak information.
Consider using Socks4A (e.g. via
privoxy or socat) instead. For more
information, please see
https://wiki.torproject.org/TheOnionRouter/TorFAQ#SOCKSAndDNS."
i.e. DNS lookups aren't going through the proxy. But that's what the 4th parameter to setdefaultproxy is supposed to do, right?
From http://socksipy.sourceforge.net/readme.txt:
setproxy(proxytype, addr[, port[, rdns[, username[, password]]]])
rdns - This is a boolean flag than
modifies the behavior regarding DNS
resolving. If it is set to True, DNS
resolving will be preformed remotely,
on the server.
Same effect with both PROXY_TYPE_SOCKS4 and PROXY_TYPE_SOCKS5 selected.
It can't be a local DNS cache (if urllib2 even supports that) because it happens when I change the URL to a domain that this computer has never visited before.
The problem is that httplib.HTTPConnection uses the socket module's create_connection helper function which does the DNS request via the usual getaddrinfo method before connecting the socket.
The solution is to make your own create_connection function and monkey-patch it into the socket module before importing urllib2, just like we do with the socket class.
import socks
import socket
def create_connection(address, timeout=None, source_address=None):
sock = socks.socksocket()
sock.connect(address)
return sock
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
# patch the socket module
socket.socket = socks.socksocket
socket.create_connection = create_connection
import urllib2
# Now you can go ahead and scrape those shady darknet .onion sites
The problem is that you are importing urllib2 before you set up the socks connection.
Try this instead:
import socks
import socket
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, '127.0.0.1', 9050, True)
socket.socket = socks.socksocket
import urllib2
print urllib2.urlopen("http://almien.co.uk/m/tools/net/ip/").read()
Manual request example:
import socks
import urlparse
SOCKS_HOST = 'localhost'
SOCKS_PORT = 9050
SOCKS_TYPE = socks.PROXY_TYPE_SOCKS5
url = 'http://www.whatismyip.com/automation/n09230945.asp'
parsed = urlparse.urlparse(url)
socket = socks.socksocket()
socket.setproxy(SOCKS_TYPE, SOCKS_HOST, SOCKS_PORT)
socket.connect((parsed.netloc, 80))
socket.send('''GET %(uri)s HTTP/1.1
host: %(host)s
connection: close
''' % dict(
uri=parsed.path,
host=parsed.netloc,
))
print socket.recv(1024)
socket.close()
I've published an article with complete source code showing how to use urllib2 + SOCKS + Tor on http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/
Hope it solves your issues.

Categories

Resources