Python: Downloading files from HTTP server - python

I have written some python scripts to download images off an HTTP website, but because I'm using urllib2, it closes the existing connection and then opens another before opening another. I don't really understand networking all that much, but this probably slows things down considerably, and grabbing 100 images at a time would take a considerable amount of time.
I started looking at other alternatives like pycurl or httplib, but found them complicated to figure out compared to urllib2 and haven't found a lot of code snippets that I could just take and use.
Simply, how would I establish a persistent connection to a website and download a number of files and then close the connection only when I am done? (probably an explicit call to close it)

since you asked for an httplib snippet:
import httplib
images = ['img1.png', 'img2.png', 'img3.png']
conn = httplib.HTTPConnection('www.example.com')
for image in images:
conn.request('GET', '/images/%s' % image)
resp = conn.getresponse()
data = resp.read()
with open(image, 'wb') as f:
f.write(data)
conn.close()
this would issue multiple (sequential) GET requests for the images in the list, then close the connection.

I found urllib3 and it claims to reuse exisiting TCP connection.
As I already stated in a comment to the question I disagree with the claim, that this will not make a big difference: Because auf TCP Slow Start Algorithm every newly created connection will be slow at first. So reusing the same TCP socket will make a difference if the data is big enoug. And I think for 100 the data will be between 10 and 100 MB.
Here is a code sample from http://code.google.com/p/urllib3/source/browse/test/benchmark.py
TO_DOWNLOAD = [
'http://code.google.com/apis/apps/',
'http://code.google.com/apis/base/',
'http://code.google.com/apis/blogger/',
'http://code.google.com/apis/calendar/',
'http://code.google.com/apis/codesearch/',
'http://code.google.com/apis/contact/',
'http://code.google.com/apis/books/',
'http://code.google.com/apis/documents/',
'http://code.google.com/apis/finance/',
'http://code.google.com/apis/health/',
'http://code.google.com/apis/notebook/',
'http://code.google.com/apis/picasaweb/',
'http://code.google.com/apis/spreadsheets/',
'http://code.google.com/apis/webmastertools/',
'http://code.google.com/apis/youtube/',
]
from urllib3 import HTTPConnectionPool
import urllib
pool = HTTPConnectionPool.from_url(url_list[0])
for url in url_list:
r = pool.get_url(url)

If you are not going to make any complicated requests you could open a socket and make requests your self like:
import sockets
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((server_name, server_port))
for url in urls:
sock.write('get %s\r\nhost: %s\r\n\r\n' % (url, server_name))
# Parse HTTP header
# Download picture (Size should be in the HTTP header)
sock.close()
But I do not think establishing 100 TCP sessions will make a lot of overhead in general.

Related

How do I gracefully close a socket with a persistent HTTP connection?

I'm writing a very simple client in Python that fetches an HTML page from the WWW. This is the code I've come up with so far:
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(("www.mywebsite.com", 80))
sock.send(b"GET / HTTP/1.1\r\nHost:www.mywebsite.com\r\n\r\n")
while True:
chunk = sock.recv(1024) # (1)
if len(chunk) == 0:
break
print(chunk)
sock.close()
The problem is: being an HTTP/1.1 connection persistent by default, the code gets stuck in # (1) waiting for more data from the server once the transmission is over.
I know I can solve this by a) adding the Connection: close request header, or by b) setting a timeout to the socket. A non-blocking socket here would not help, as the select() syscall would still hang (unless I set a timeout on it, but that's just another form of case b)).
So is there another way to do it, while keeping the connection persistent?
As has already been said in the comments, there's a lot to consider if you're trying to write an all-singing, all-dancing HTTP processor. However, if you're just practising with sockets then consider this.
Let's assume that you know how the response will end. For example, if we do essentially what you're doing in your code to the main Google page, we know that the response will end with '\r\n\r\n'. So, what we can do is just read 1 byte at a time and look out for that terminating sequence.
This code will NOT give you the full Google main page because, as you will see, the response is chunked - and that's a whole new ball game.
Having said all of that, you may find this instructive:
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
sock.connect(('www.google.com', 80))
sock.send(b'GET / HTTP/1.1\r\nHost:www.google.com\r\n\r\n')
end = [b'\r', b'\n', b'\r', b'\n']
d = []
while d[-len(end):] != end:
d.append(sock.recv(1))
print(''.join(b.decode() for b in d))
finally:
sock.close()

Python program to find active port on a website?

My college has some ports. Something like this
http://www.college.in:913
I want a program to find the active ones. I mean I want those port number in which the website is working.
Here is a code. But it takes a lot of time.
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
for i in range(1,10000):
req = Request("http://college.edu.in:"+str(i))
try:
response = urlopen(req)
except URLError as e:
print("Error at port"+str(i) )
else:
print ('Website is working fine'+str(i))
It might be faster to try open a socket connection to each port in the range and then only try to make a request if the socket is actually open. But it's often slow to iterate through a bunch of ports. if it takes 0.5 seconds for each, and you're scanning 10000 ports that's a lot of time waiting.
# create an INET, STREAMing socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# now connect to the web server on port 80 - the normal http port
s.connect(("www.python.org", 80))
s.close()
from https://docs.python.org/3/howto/sockets.html
You might also consider profiling the code and finding out where the slow parts are.
You can use python-nmap, which is similar to nmap.

Very simple Web Service in Python

I am trying to implement a server in Python. When the browser connects to localhost with port number 9999, it will open the file index.html with the images.jpg in that page, but the image can not be shown. How can I make the web server handle the image as well?
Here is my code so far:
from socket import *
import os
serversocket = socket(AF_INET, SOCK_STREAM)
port = 5000
host = '127.0.0.1'
size = os.path.getsize("index.html")
myfile = open('index.html', 'rb')
mycontent = "Welcome to Very Simple Web Server"
size = len(mycontent)
header = "HTTP/1.0 200 OK \r\n Content_Length:" + str(size) + "\r\n\r\n"
mycontent = myfile.read()
serversocket.bind((host, port))
serversocket.listen(5)
print('Server is listening on port 9999')
while (1):
conn, addr = serversocket.accept()
print('Connected by', addr)
conn.send(bytes(header))
conn.send(mycontent)
conn.close()
Your code creates an infinite loop that will just only send one file, and never accepts other connections.
In order for the image to show, the browser has to send another request to the URL of the image, and this request is not being serviced by your code.
In order for your server to work, you need to:
Start a loop
Listen for connections
Interpret the headers of the incoming request, and then act appropriately. Lets assume that you only deal with GET requests and not other things like POST, HEAD, PUT, etc.
Look at the requested resource (the URL)
Find the resource on the file system (so now, you have to parse the URL)
Package the resource into a HTTP response (read the file set the appropriate mime type)
Send the response back to the client with the appropriate headers (the server response headers)
Repeat
To display a HTML page with one image, it takes two requests, one for the HTML page, and another for the image. If the HTML code has a link to a CSS file, now you need three requests - one for the HTML page, one for the CSS file and a final one for the image. All these requests need to be completed successfully in order for the browser to render the page.
You never need to do this by hand, use a web development framework which will take care of all this "boring" stuff so you can then deal with solving the actual problem.

How can I implement a simple web server using Python without using any libraries?

I need to implement a very simple web-server-like app in Python which would perform basic HTTP requests and responses and display very basic output on the web page. I am not too concerned about actually coding it in Python, but I am not sure where to start? How to set this up? One file? Multiple files? I guess I have no idea how to approach the fact that this is a "server" - so I am unfamiliar with how to approach dealing with HTTP requests/sockets/processing requests, etc. Any advice? Resources?
You should look at the SimpleHttpServer (py3: http.server) module.
Depending on what you're trying to do, you can either just use it, or check out the module's source (py2, py3) for ideas.
If you want to get more low-level, SimpleHttpServer extends BaseHttpServer (source) to make it just work.
If you want to get even more low-level, take a look at SocketServer (source: py2, py3).
People will often run python like python -m SimpleHttpServer (or python3 -m http.server) if they just want to share a directory: it's a fully functional and... simple server.
You can use socket programming for this purpose. The following snippet creates a tcp socket and listens on port 9000 for http requests:
from socket import *
def createServer():
serversocket = socket(AF_INET, SOCK_STREAM)
serversocket.bind(('localhost',9000))
serversocket.listen(5)
while(1):
(clientsocket, address) = serversocket.accept()
clientsocket.send("HTTP/1.1 200 OK\n"
+"Content-Type: text/html\n"
+"\n" # Important!
+"<html><body>Hello World</body></html>\n")
clientsocket.shutdown(SHUT_WR)
clientsocket.close()
serversocket.close()
createServer()
Start the server, $ python server.py.
Open http://localhost:9000/ in your web-browser (which acts as client). Then in the browser window, you can see the text "Hello World" (http response).
EDIT**
The previous code was only tested on chrome, and as you guys suggested about other browsers, the code was modified as:
To make the response http-alike you can send in plain header with http version 1.1, status code 200 OK and content-type text/html.
The client socket needs to be closed once response is submitted as it's a TCP socket.
To properly close the client socket, shutdown() needs to be called socket.shutdown vs socket.close
Then the code was tested on chrome, firefox (http://localhost:9000/) and simple curl in terminal (curl http://localhost:9000).
I decided to make this work in Python 3 and make it work for Chrome to use as an example for an online course I am developing. Python 3 of course needs encode() and decode() in the right places. Chrome - really wants to send its GET request before it gets data. I also added some error checking so it cleans up its socket if you abort the server or it blows up:
def createServer():
serversocket = socket(AF_INET, SOCK_STREAM)
try :
serversocket.bind(('localhost',9000))
serversocket.listen(5)
while(1):
(clientsocket, address) = serversocket.accept()
rd = clientsocket.recv(5000).decode()
pieces = rd.split("\n")
if ( len(pieces) > 0 ) : print(pieces[0])
data = "HTTP/1.1 200 OK\r\n"
data += "Content-Type: text/html; charset=utf-8\r\n"
data += "\r\n"
data += "<html><body>Hello World</body></html>\r\n\r\n"
clientsocket.sendall(data.encode())
clientsocket.shutdown(SHUT_WR)
except KeyboardInterrupt :
print("\nShutting down...\n");
except Exception as exc :
print("Error:\n");
print(exc)
serversocket.close()
print('Access http://localhost:9000')
createServer()
The server also prints out the incoming HTTP request. The code of course only sends text/html regardless of the request - even if the browser is asking for the favicon:
$ python3 server.py
Access http://localhost:9000
GET / HTTP/1.1
GET /favicon.ico HTTP/1.1
^C
Shutting down...
But it is a pretty good example that mostly shows why you want to use a framework like Flask or DJango instead of writing your own. Thanks for the initial code.
There is a very simple solution mentioned above, but the solution above doesn't work. This solution is tested on chrome and it works. This is python 3 although it may work on python 2 since I never tested it.
from socket import *
def createServer():
serversocket = socket(AF_INET, SOCK_STREAM)
serversocket.bind(('localhost',9000))
serversocket.listen(5)
while(1):
(clientsocket, address) = serversocket.accept()
clientsocket.send(bytes("HTTP/1.1 200 OK\n"
+"Content-Type: text/html\n"
+"\n" # Important!
+"<html><body>Hello World</body></html>\n",'utf-8'))
clientsocket.shutdown(SHUT_WR)
clientsocket.close()
serversocket.close()
createServer()
This is improved from the answer that was accepted, but I will post this so future users can use it easily.

How to achieve tcpflow functionality (follow tcp stream) purely within python

I am writing a tool in python (platform is linux), one of the tasks is to capture a live tcp stream and to
apply a function to each line. Currently I'm using
import subprocess
proc = subprocess.Popen(['sudo','tcpflow', '-C', '-i', interface, '-p', 'src', 'host', ip],stdout=subprocess.PIPE)
for line in iter(proc.stdout.readline,''):
do_something(line)
This works quite well (with the appropriate entry in /etc/sudoers), but I would like to avoid calling an external program.
So far I have looked into the following possibilities:
flowgrep: a python tool which looks just like what I need, BUT: it uses pynids
internally, which is 7 years old and seems pretty much abandoned. There is no pynids package
for my gentoo system and it ships with a patched version of libnids
which I couldn't compile without further tweaking.
scapy: this is a package manipulation program/library for python,
I'm not sure if tcp stream
reassembly is supported.
pypcap or pylibpcap as wrappers for libpcap. Again, libpcap is for packet
capturing, where I need stream reassembly which is not possible according
to this question.
Before I dive deeper into any of these libraries I would like to know if maybe someone
has a working code snippet (this seems like a rather common problem). I'm also grateful if
someone can give advice about the right way to go.
Thanks
Jon Oberheide has led efforts to maintain pynids, which is fairly up to date at:
http://jon.oberheide.org/pynids/
So, this might permit you to further explore flowgrep. Pynids itself handles stream reconstruction rather elegantly.See http://monkey.org/~jose/presentations/pysniff04.d/ for some good examples.
Just as a follow-up: I abandoned the idea to monitor the stream on the tcp layer. Instead I wrote a proxy in python and let the connection I want to monitor (a http session) connect through this proxy. The result is more stable and does not need root privileges to run. This solution depends on pymiproxy.
This goes into a standalone program, e.g. helper_proxy.py
from multiprocessing.connection import Listener
import StringIO
from httplib import HTTPResponse
import threading
import time
from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
class FakeSocket(StringIO.StringIO):
def makefile(self, *args, **kw):
return self
class Interceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):
conn = None
def do_request(self, data):
# do whatever you need to sent data here, I'm only interested in responses
return data
def do_response(self, data):
if Interceptor.conn: # if the listener is connected, send the response to it
response = HTTPResponse(FakeSocket(data))
response.begin()
Interceptor.conn.send(response.read())
return data
def main():
proxy = AsyncMitmProxy()
proxy.register_interceptor(Interceptor)
ProxyThread = threading.Thread(target=proxy.serve_forever)
ProxyThread.daemon=True
ProxyThread.start()
print "Proxy started."
address = ('localhost', 6000) # family is deduced to be 'AF_INET'
listener = Listener(address, authkey='some_secret_password')
while True:
Interceptor.conn = listener.accept()
print "Accepted Connection from", listener.last_accepted
try:
Interceptor.conn.recv()
except: time.sleep(1)
finally:
Interceptor.conn.close()
if __name__ == '__main__':
main()
Start with python helper_proxy.py. This will create a proxy listening for http connections on port 8080 and listening for another python program on port 6000. Once the other python program has connected on that port, the helper proxy will send all http replies to it. This way the helper proxy can continue to run, keeping up the http connection, and the listener can be restarted for debugging.
Here is how the listener works, e.g. listener.py:
from multiprocessing.connection import Client
def main():
address = ('localhost', 6000)
conn = Client(address, authkey='some_secret_password')
while True:
print conn.recv()
if __name__ == '__main__':
main()
This will just print all the replies. Now point your browser to the proxy running on port 8080 and establish the http connection you want to monitor.

Categories

Resources