How to use Content-Encoding: gzip with Python SimpleHTTPServer - python

I'm using python -m SimpleHTTPServer to serve up a directory for local testing in a web browser. Some of the content includes large data files. I would like to be able to gzip them and have SimpleHTTPServer serve them with Content-Encoding: gzip.
Is there an easy way to do this?

This is an old question, but it still ranks #1 in Google for me, so I suppose a proper answer might be of use to someone beside me.
The solution turns out to be very simple. in the do_GET(), do_POST, etc, you only need to add the following:
content = self.gzipencode(strcontent)
...your other headers, etc...
self.send_header("Content-length", str(len(str(content))))
self.send_header("Content-Encoding", "gzip")
self.end_headers()
self.wfile.write(content)
self.wfile.flush()
strcontent being your actual content (as in HTML, javascript or other HTML resources)
and the gzipencode:
def gzipencode(self, content):
import StringIO
import gzip
out = StringIO.StringIO()
f = gzip.GzipFile(fileobj=out, mode='w', compresslevel=5)
f.write(content)
f.close()
return out.getvalue()

Since this was the top google result I figured I would post my simple modification to the script that got gzip to work.
https://github.com/ksmith97/GzipSimpleHTTPServer

As so many others, I've been using python -m SimpleHTTPServer for local testing as well. This is still the top result on google and while https://github.com/ksmith97/GzipSimpleHTTPServer is a nice solution, it enforces gzip even if not requested and there's no flag to enable/disable it.
I decided to write a tiny cli tool that supports this. It's go, so the regular install procedure is simply:
go get github.com/rhardih/serve
If you already have $GOPATH added to $PATH, that's all you need. Now you have serve as a command.
https://github.com/rhardih/serve

This was a feature request but rejected due to wanting to keep the simple http server simple: https://bugs.python.org/issue30576
The issue author eventually released a standalone version for Python 3: https://github.com/PierreQuentel/httpcompressionserver

Building on #velis answer above, here is how I do it. gZipping small data is not worth the time and can increase its size. Tested with Dalvik client.
def do_GET(self):
... get content
self.send_response(returnCode) # 200, 401, etc
...your other headers, etc...
if len(content) > 100: # don't bother compressing small data
if 'accept-encoding' in self.headers: # case insensitive
if 'gzip' in self.headers['accept-encoding']:
content = gzipencode(content) # gzipencode defined above in #velis answer
self.send_header('content-encoding', 'gzip')
self.send_header('content-length', len(content))
self.end_headers() # send a blank line
self.wfile.write(content)

From looking at SimpleHTTPServer's documentation, there is no way. However, I recommend lighttpd with the mod_compress module.

Related

Google drive python api: export never completes.

Summary:
I have an issue where sometimes a the google-drive-sdk for python does not detect the end of the document being exported. It seems to think that the google document is of infinite size.
Background, source code and tutorials I followed:
I am working on my own python based google-drive backup script (one with a nice CLI interface for browsing around). git link for source code
Its still in the making and currently only finds new files and downloads them (with 'pull' command).
To do the most important google-drive commands, I followed the official google drive api tutorials for downloading media. here
What works:
When a document or file is a non-google-docs document, the file is downloaded properly. However, when I try to "export" a file. I see that I need to use a different mimeType. I have a dictionary for this.
For example: I map application/vnd.google-apps.document to application/vnd.openxmlformats-officedocument.wordprocessingml.document when exporting a document.
When downloading google documents documents from google drive, this seems to work fine. By this I mean: my while loop with the code status, done = downloader.next_chunk() will eventual set done to true and the download completes.
What does not work:
However, on some files, the done flag never gets to true and script will download forever. This eventually amounts to several Gb. Perhaps I am looking for the wrong flag that says the file is complete when doing an export. I am surprised that google-drive never throws an error. Anybody know what could cause this?
Current status
For now I have exporting of google documents disabled in my code.
When I use scripts like "drive by rakyll" (at least the version I have) just puts a link to the online copy. I would really like to do a proper export so that my offline system can maintain a complete backup of everything on drive.
P.s. It's fine to put "you should use this service instead of the api" for the sake of others finding this page. I know that there are other services out there for this, but I'm really looking to explore the drive-api functions for integration with my own other systems.
OK. I found a pseudo solution here.
The problem is that the Google API never returns the Content-Length and the response is done in Chunks. However, either the chunk returned is wrong, or the Python API is not able to process it correctly.
What I did was, grab the code for the MediaIoBaseDownload from here
I left all the same, but changed this part:
if 'content-range' in resp:
content_range = resp['content-range']
length = content_range.rsplit('/', 1)[1]
self._total_size = int(length)
elif 'content-length' in resp:
self._total_size = int(resp['content-length'])
else:
# PSEUDO BUG FIX: No content-length, no chunk info, cut the response here.
self._total_size = self._progress
The else at the end is what I've added. I've also changed the default chunk size by setting DEFAULT_CHUNK_SIZE = 2*1024*1024. Also you will have to copy a few imports from that file, including this one from googleapiclient.http import _retry_request, _should_retry_response
Of course this is not a solution, it just says "if I don't understand the response, just stop it here". This will probably make some exports not work, but at least it doesn't kill the server. This is only until we can find a good solution.
UPDATE:
Bug is already reported here: https://github.com/google/google-api-python-client/issues/15
and as of January 2017, the only workaround is to not use MediaIoBaseDownload and do this instead (not suitable to large files):
req = service.files().export(fileId=file_id, mimeType=mimeType)
resp = req.execute(http=http)
I'm using this and it's works with the following library:
google-auth-oauthlib==0.4.1
google-api-python-client
google-auth-httplib2
This is the snippet I'm using:
from apiclient import errors
from googleapiclient.http import MediaIoBaseDownload
from googleapiclient.discovery import build
def download_google_document_from_drive(self, file_id):
try:
request = self.service.files().get_media(fileId=file_id)
fh = io.BytesIO()
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print('Download %d%%.' % int(status.progress() * 100))
return fh
except Exception as e:
print('Error downloading file from Google Drive: %s' % e)
You can write the file stream to a file:
import xlrd
workbook = xlrd.open_workbook(file_contents=fh.getvalue())

Using Tablib Library with Web2py

I've been trying for a while to make tablib work with web2py without luck. The code is delivering a .xls file as expected, but it's corrupted and empty.
import tablib
data = []
headers = ('first_name', 'last_name')
data = tablib.Dataset(*data, headers=headers)
data.append(('John', 'Adams'))
data.append(('George', 'Washington'))
response.headers['Content-Type']= 'application/vnd.ms-excel;charset=utf-8'
response.headers['Content-disposition']='attachment; filename=test.xls'
response.write(data.xls, escape=False)
Any ideas??
Thanks!
Per http://en.wikipedia.org/wiki/Process_state , response.write is documented as serving
to write text into the output page body
(my emphasis). data.xls is not text -- it's binary stuff! To verify that is indeed the cause of your problem, try using data.csv instead, and that should work, since it is text.
I believe you'll need to use response.stream instead, to send "binary stuff" as your response (or as an attachment thereto).

Script to pull html and completely de-relativise it. (single file offline)

I am trying to learn python and also create a web utility. One task I am trying to accomplish is creating a single html file which can be run locally but link to everything it needs to look like the original web page. (if you are going to ask why i want this, its because it may act of a part of a utility i am creating, or if not, just for education) So i have two questions, a theoretical one and a practical one:
1) Is this, for visual (as opposed to functional) purposes, possible? Can a html page work offline while linking to everything it needs online? or if their something fundamental about having the html file itself execute on the web server which does not allow this to be possible? How far can I go with it?
2) I have started a python script which de-relativises (made that one up) linked elements on a html page, but I am a noob so most likely I missed some elements or attributes which would also link to outside resources. I have noticed after trying a few pages that the one in the code below does not work properly, their appears to be a .js file which is not linking correctly. (the first of many problems to come) Assuming the answer to my first question was at least a partial yes, can anyone help me fix the code for this website?
Thank you.
Update, I missed the script tag on this, but even after I added it it still does not work correctly.
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen("http://"+site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in tags_to_derealitivise:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
active_html = ""
tags_to_derealitivise = ("//img", "//a", "//link", "//embed", "//audio", "//video", "//script")
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Furthermore, I could make the code more through by checking all of the elements...
It would look kind of like this, but I do not know the exact way to iterate through all of the elements. This is a seperate problem, and I will most likely figure it out by the time anyone responds...:
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in active_html.xpath:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
update
Thanks to Burhan Khalid's solution, which seemed too simple to be viable at first glance, I got it working. The code is so simple most of you will most likely not require it, but I will post it anyway incase it helps:
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen(site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = html_input.replace('<head>', '<head> <base href='+site+'>')
return active_html
active_html = ""
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Despite all of this, and its great simplicity, the .js object running on the website I have listed in the script still will not load correctly. Does anyone know if this is possible to fix?
while i am trying to make only the html file offline, while using the
linked resources over the web.
This is a two step process:
Copy the HTML file and save it to your local directory.
Add a BASE tag in the HEAD section, and point the href attribute of it to the absolute URL.
Since you want to learn how to do it yourself, I will leave it at that.
#Burhan has an easy answer using <base href="..."> tag in the <head>, and it works as you have found out. I ran the script you posted, and the page downloaded fine. As you noticed, some of the JavaScript now fails. This can be for multiple reasons.
If you are opening the HTML file as a local file:/// URL, the page may not work. Many browsers heavily sandbox local HTML files, not allowing them to perform network requests or examine local files.
The page may perform XmlHTTPRequests or other network operations to the remote site, which will be denied for cross domain scripting reasons. Looking in the JS console, I see the following errors for the script you posted:
XMLHttpRequest cannot load http://www.script-tutorials.com/menus.php?give=menu. Origin http://localhost:8000 is not allowed by Access-Control-Allow-Origin.
Unfortunately, if you do not have control of www.script-tutorials.com, there is no easy way around this.

How to write a python script for downloading?

I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php
But I only want to get certain ones.
Is there a way to write a python script to do this? I have intermediate knowledge of python
I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this
thanks,
Shrub
Here's a link to my code
I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.
I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.
Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:
f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()
Hope that helps
Make a url request for the page. Once you have the source, filter out and get urls.
The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria.
After filtration, then do a url request for each matched url's data and write it to memory.
Sample code:
#!/usr/bin/python
import re
import sys
import urllib
#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()
#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)
#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)
#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)
#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)
def fileDl(targetUrl):
#Function to handle downloading of files.
#Arg: url => a String
#Output: Boolean to signify if file has been written to memory
#Validation of the url assumed, for the sake of keeping the illustration short
urlAddInfo = urllib.urlopen(targetUrl)
data = urlAddInfo.read()
fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
if not fileNameSearch:
sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
return False
fileName = fileNameSearch.groups(1)[0]
with open(fileName, "wb") as f:
f.write(data)
sys.stderr.write("Wrote %s to memory\n"%(fileName))
return True
#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))
#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.
The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X
Why yes! 5 years later and, not only is this possible, but you've now got a lot of ways to do it.
I'm going to avoid code-examples here, because mainly want to help break your problem into segments and give you some options for exploration:
Segment 1: GET!
If you must stick to the stdlib, for either python2 or python3, urllib[n]* is what you're going to want to use to pull-down something from the internet.
So again, if you don't want dependencies on other packages:
urllib or urllib2 or maybe another urllib[n] I'm forgetting about.
If you don't have to restrict your imports to the Standard Library:
you're in luck!!!!! You've got:
requests with docs here. requests is the golden standard for gettin' stuff off the web with python. I suggest you use it.
uplink with docs here. It's relatively new & for more programmatic client interfaces.
aiohttp via asyncio with docs here. asyncio got included in python >= 3.5 only, and it's also extra confusing. That said, it if you're willing to put in the time it can be ridiculously efficient for exactly this use-case.
...I'd also be remiss not to mention one of my favorite tools for crawling:
fake_useragent repo here. Docs like seriously not necessary.
Segment 2: Parse!
So again, if you must stick to the stdlib and not install anything with pip, you get to use the extra-extra fun and secure (<==extreme-sarcasm) xml builtin module. Specifically, you get to use the:
xml.etree.ElementTree() with docs here.
It's worth noting that the ElementTree object is what the pip-downloadable lxml package is based on, and made make easier to use. If you want to recreate the wheel and write a bunch of your own complicated logic, using the default xml module is your option.
If you don't have to restrict your imports to the Standard Library:
lxml with docs here. As i said before, lxml is a wrapper around xml.etree that makes it human-usable & implements all those parsing tools you'd need to make yourself. However, as you can see by visiting the docs, it's not easy to use by itself. This brings us to...
BeautifulSoup aka bs4 with docs here. BeautifulSoup makes everything easier. It's my recommendation for this.
Segment 3: GET GET GET!
This section is nearly exactly the same as "Segment 1," except you have a bunch of links not one.
The only thing that changes between this section and "Segment 1" is my recommendation for what to use: aiohttp here will download way faster when dealing with several URLs because it's allows you to download them in parallel.**
* - (where n was decided-on from python-version to ptyhon-version in a somewhat frustratingly arbitrary manner. Look up which urllib[n] has .urlopen() as a top-level function. You can read more about this naming-convention clusterf**k here, here, and here.)
** - (This isn't totally true. It's more sort-of functionally-true at human timescales.)
I would use a combination of wget for downloading - http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/#more-1885 and BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/ for parsing the downloaded file

wget Vs urlretrieve of python

I have a task to download Gbs of data from a website. The data is in form of .gz files, each file being 45mb in size.
The easy way to get the files is use "wget -r -np -A files url". This will donwload data in a recursive format and mirrors the website. The donwload rate is very high 4mb/sec.
But, just to play around I was also using python to build my urlparser.
Downloading via Python's urlretrieve is damm slow, possible 4 times as slow as wget. The download rate is 500kb/sec. I use HTMLParser for parsing the href tags.
I am not sure why is this happening. Are there any settings for this.
Thanks
Probably a unit math error on your part.
Just noticing that 500KB/s (kilobytes) is equal to 4Mb/s (megabits).
urllib works for me as fast as wget. try this code. it shows the progress in percentage just as wget.
import sys, urllib
def reporthook(a,b,c):
# ',' at the end of the line is important!
print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
#you can also use sys.stdout.write
#sys.stdout.write("\r% 3.1f%% of %d bytes"
# % (min(100, float(a * b) / c * 100), c)
sys.stdout.flush()
for url in sys.argv[1:]:
i = url.rfind('/')
file = url[i+1:]
print url, "->", file
urllib.urlretrieve(url, file, reporthook)
import subprocess
myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])
As for the html parsing, the fastest/easiest you will probably get is using lxml
As for the http requests themselves: httplib2 is very easy to use, and could possibly speed up downloads because it supports http 1.1 keep-alive connections and gzip compression. There is also pycURL which claims to be very fast (but more difficult to use), and is build on curllib, but I've never used that.
You could also try to download different files concurrently, but also keep in mind that trying to optimize your download times too far may be not very polite towards the website in question.
Sorry for the lack of hyperlinks, but SO tells me "sorry, new users can only post a maximum of one hyperlink"
Transfer speeds can be easily misleading.. Could you try with the following script, which simply downloads the same URL with both wget and urllib.urlretrieve - run it a few times incase you're behind a proxy which caches the URL on the second attempt.
For small files, wget will take slightly longer due to the external process' startup time, but for larger files that should be come irrelevant.
from time import time
import urllib
import subprocess
target = "http://example.com" # change this to a more useful URL
wget_start = time()
proc = subprocess.Popen(["wget", target])
proc.communicate()
wget_end = time()
url_start = time()
urllib.urlretrieve(target)
url_end = time()
print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s" % (url_end - url_start)
Maybe you can wget and then inspect the data in Python?
Since python suggests using urllib2 instead of urllib, I take a test between urllib2.urlopen and wget.
The result is, it takes nearly the same time for both of them to download the same file.Sometimes, urllib2 performs even better.
The advantage of wget lies in a dynamic progress bar to show the percent finished and the current download speed when transferring.
The file size in my test is 5MB.I haven't used any cache module in python and I am not aware of how wget works when downloading big size file.
There shouldn't be a difference really. All urlretrieve does is make a simple HTTP GET request. Have you taken out your data processing code and done a straight throughput comparison of wget vs. pure python?
Please show us some code. I'm pretty sure that it has to be with the code and not on urlretrieve.
I've worked with it in the past and never had any speed related issues.
You can use wget -k to engage relative links in all urls.

Categories

Resources