Scraping web-page data with urllib with headers and proxy

Scraping web-page data with urllib with headers and proxy - python

I have got web-page data, but now I want to get it with proxy. How could I do it?
import urllib
def get_main_html():
request = urllib.request.Request(URL, headers=headers)
doc = lh.parse(urllib.request.urlopen(request))
return doc

From the documentation
urllib will auto-detect your proxy settings and use those. This is through the ProxyHandler, which is part of the normal handler chain when a proxy setting is detected. Normally that’s a good thing, but there are occasions when it may not be helpful. One way to do this is to setup our own ProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handle.
Check this, https://docs.python.org/3/howto/urllib2.html#proxies

use :
proxies = {'http': 'http://myproxy.example.com:1234'}
print "Using HTTP proxy %s" % proxies['http']
urllib.urlopen("http://yoursite", proxies=proxies)

You can use socksipy
import ftplib
import telnetlib
import urllib2
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
#Route an FTP session through the SOCKS proxy
socks.wrapmodule(ftplib)
ftp = ftplib.FTP('cdimage.ubuntu.com')
ftp.login('anonymous', 'support#aol.com')
print ftp.dir('cdimage') ftp.close()
#Route a telnet connection through the SOCKS proxy
socks.wrapmodule(telnetlib)
tn = telnetlib.Telnet('achaea.com')
print tn.read_very_eager() tn.close()
#Route an HTTP request through the SOCKS proxy
socks.wrapmodule(urllib2)
print urllib2.urlopen('http://www.whatismyip.com/automation/n09230945.asp').read()
in your case:
import urllib
import socks
#Set the proxy information
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, 'localhost', 9050)
socks.wrapmodule(urllib)
def get_main_html():
request = urllib.request.Request(URL, headers=headers)
doc = lh.parse(urllib.request.urlopen(request))
return doc

Related

Can't reach IP using Python httplib

I can't connect to anything on my network using the IP address of the host. I can open a browser and connect and I can ping the host just fine. Here is my code:
from httplib import HTTPConnection
addr = 192.168.14.203
conn = HTTPConnection(addr)
conn.request('HEAD', '/')
res = conn.getresponse()
if res.status == 200:
print "ok"
else:
print "problem : the query returned %s because %s" % (res.status, res.reason)
The following error gets returned:
socket.error: [Errno 51] Network is unreachable
If I change the addr var to google.com I get a 200 response. What am I doing wrong?

You should check the address and your proxy settings.
For making HTTP requests I recommend the requests library. It's much more high-level and user friendly compared to httplib and it makes it easy to set proxies:
import requests
addr = "http://192.168.14.203"
response = requests.get(addr)
# if you need to set a proxy:
response = requests.get(addr, proxies={"http": "...proxy address..."})
# to avoid using any proxy if your system sets one by default
response = requests.get(addr, proxies={"http": None})

Python requests-html with Tor

The requirement is to scrap anonymously or change ip after certain number of calls. I use the https://github.com/kennethreitz/requests-html module to parse the HTML, but i get the below error,
socks.SOCKS5Error: 0x01: General SOCKS server failure
Code
import socks
import socket
import requests_html
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, addr='127.0.0.1', port=int('9150'))
socket.socket = socks.socksocket
session = requests_html.HTMLSession()
r = session.get('http://icanhazip.com')
r.html.render(sleep=5)
print(r.html.text)
But it works perfectly fine with requests module,
import socks
import socket
import requests
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, addr='127.0.0.1', port=int('9150'))
socket.socket = socks.socksocket
print(requests.get("http://icanhazip.com").text)
Any help to solve the issue with requests-html module would be highly appreciated.

Try:
session = requests_html.HTMLSession(browser_args=["--no-sandbox","--proxy-server=127.0.0.1:9150"])
Depends on how your proxy is set up to use tor but this worked for me!

Python General SOCKS server failure when assigning socket.socket

I know similar questions have been asked several times:
General SOCKS server failure with python tor but working from tor browser
General SOCKS server failure when switching identity using stem
General SOCKS server failure while using tor proxy
I checked all related posts and googled a lot, but still got stuck.
I'm on Win10. I download Tor browser, run it and make sure it's on port 127.0.0.1:9150 with cmd netstat -aon in administrator.
Then I run the following example code in Python:
import socks
import socket
socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 9150)
socket.socket = socks.socksocket
The last line socket.socket = socks.socksocket gives the Error message.
socks.GeneralProxyError: Socket error: 0x01: General SOCKS server failure
It's supposed to return a socket object which is assigned to socket.socket that opens a socket. Like this example:
https://deshmukhsuraj.wordpress.com/2015/03/08/anonymous-web-scraping-using-python-and-tor/
Can anyone tell me what's wrong?
Thanks.
Update
Thanks to drew010's answer, this code will work (with Tor browser running and it's port = 9150):
import requests
proxies = {
'http': 'socks5h://127.0.0.1:9150',
'https': 'socks5h://127.0.0.1:9150'
}
url = 'http://icanhazip.com'
# request without Tor (original IP)
r = requests.get(url)
print(r.text)
# request with Tor (Tor IP)
r = requests.get(url, proxies=proxies)
print(r.text)
# Force change IP
from stem.control import Controller
from stem import Signal
with Controller.from_port(port = 9151) as controller:
controller.authenticate('mypassword')
controller.signal(Signal.NEWNYM)
# Changed Tor IP
r = requests.get(url, proxies=proxies)
print(r.text)
Note that we need to set password in torrc before.

by doing "socket.socket = socks.socksocket" you're actually replacing each future socket objects to actually be a socksocket object, which means after that you can just use regular sockets and they will go through your socks proxy.

Tornado HTTP proxy server not catching HTTPS request

I have set up a tornado HTTP server which is working as a proxy server.
I am using the python requests library to use it as a proxy server.
When I try to fetch HTTP url's with it, it works fine. But it isn't intercepting HTTPS requests.
The proxy server part:
class ProxyServer(HTTPServerConnectionDelegate):
def start_request(self, server_conn, request_conn):
print('In start request')
return ClientDelegator(request_conn)
def on_close(self):
pass
def client_send_error(self):
self.write('Error happened.')
self.finish()
def main():
server = HTTPServer(ProxyServer())
server.bind(8888)
server.start(0)
tornado.ioloop.IOLoop.current().start()
if __name__ == "__main__":
main()
The requests part:
import requests
url = 'https://example.com'
proxy = {'http' : '127.0.0.1:8888'}
r = requests.get(url, proxies=proxy, verify=False)
print(r.text)
When I use http://example.com, the connection starts as 'In start request' gets printed. However, when I use https://example.com then the connection doesn't start. The ProxyServer doesn't enter start_request.
What am I doing wrong?

Your proxy variable only specifies a proxy for http, not https. You need to set the proxy for both protocols separately.

Python urllib over TOR? [duplicate]

This question already has answers here:
How to route urllib requests through the TOR network? [duplicate]
(3 answers)
Closed 7 years ago.
Sample code:
#!/usr/bin/python
import socks
import socket
import urllib2
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, "127.0.0.1", 9050, True)
socket.socket = socks.socksocket
print urllib2.urlopen("http://almien.co.uk/m/tools/net/ip/").read()
TOR is running a SOCKS proxy on port 9050 (its default). The request goes through TOR, surfacing at an IP address other than my own. However, TOR console gives the warning:
"Feb 28 22:44:26.233 [warn] Your
application (using socks4 to port 80)
is giving Tor only an IP address.
Applications that do DNS resolves
themselves may leak information.
Consider using Socks4A (e.g. via
privoxy or socat) instead. For more
information, please see
https://wiki.torproject.org/TheOnionRouter/TorFAQ#SOCKSAndDNS."
i.e. DNS lookups aren't going through the proxy. But that's what the 4th parameter to setdefaultproxy is supposed to do, right?
From http://socksipy.sourceforge.net/readme.txt:
setproxy(proxytype, addr[, port[, rdns[, username[, password]]]])
rdns - This is a boolean flag than
modifies the behavior regarding DNS
resolving. If it is set to True, DNS
resolving will be preformed remotely,
on the server.
Same effect with both PROXY_TYPE_SOCKS4 and PROXY_TYPE_SOCKS5 selected.
It can't be a local DNS cache (if urllib2 even supports that) because it happens when I change the URL to a domain that this computer has never visited before.

The problem is that httplib.HTTPConnection uses the socket module's create_connection helper function which does the DNS request via the usual getaddrinfo method before connecting the socket.
The solution is to make your own create_connection function and monkey-patch it into the socket module before importing urllib2, just like we do with the socket class.
import socks
import socket
def create_connection(address, timeout=None, source_address=None):
sock = socks.socksocket()
sock.connect(address)
return sock
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
# patch the socket module
socket.socket = socks.socksocket
socket.create_connection = create_connection
import urllib2
# Now you can go ahead and scrape those shady darknet .onion sites

The problem is that you are importing urllib2 before you set up the socks connection.
Try this instead:
import socks
import socket
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, '127.0.0.1', 9050, True)
socket.socket = socks.socksocket
import urllib2
print urllib2.urlopen("http://almien.co.uk/m/tools/net/ip/").read()
Manual request example:
import socks
import urlparse
SOCKS_HOST = 'localhost'
SOCKS_PORT = 9050
SOCKS_TYPE = socks.PROXY_TYPE_SOCKS5
url = 'http://www.whatismyip.com/automation/n09230945.asp'
parsed = urlparse.urlparse(url)
socket = socks.socksocket()
socket.setproxy(SOCKS_TYPE, SOCKS_HOST, SOCKS_PORT)
socket.connect((parsed.netloc, 80))
socket.send('''GET %(uri)s HTTP/1.1
host: %(host)s
connection: close
''' % dict(
uri=parsed.path,
host=parsed.netloc,
))
print socket.recv(1024)
socket.close()

I've published an article with complete source code showing how to use urllib2 + SOCKS + Tor on http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/
Hope it solves your issues.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping web-page data with urllib with headers and proxy - python

I have got web-page data, but now I want to get it with proxy. How could I do it? import urllib def get_main_html(): request = urllib.request.Request(URL, headers=headers) doc = lh.parse(urllib.request.urlopen(request)) return doc

use : proxies = {'http': 'http://myproxy.example.com:1234'} print "Using HTTP proxy %s" % proxies['http'] urllib.urlopen("http://yoursite", proxies=proxies)

Related

Can't reach IP using Python httplib

Python requests-html with Tor

Python General SOCKS server failure when assigning socket.socket

Tornado HTTP proxy server not catching HTTPS request

Python urllib over TOR? [duplicate]

Categories

Resources