Optimize PDF conversion in Django / Python

Optimize PDF conversion in Django / Python - python

I have a webapp that export reports in PDF. Everything is fine when the query returns less than 100 values. When the number of records raise above 100 the server raise a 502 Proxy Error. The report outputs fine in HTML. The process that hangs up the server is the conversion from html to PDF.
I'm using xhtml2pdf (AKA pisa 3.0) to generate the PDF. The algorythm is something like this:
def view1(request, **someargs):
queryset = someModel.objects.get(someargs)
if request.GET['pdf']:
return pdfWrapper('template.html',queryset,'filename')
else:
return render_to_response('template.html',queryset)
def pdfWrapper(template_src, context_dict, filename):
################################################
#
# The code comented below is an older version
# I updated the code according the comment recived
# The function still works for short HTML documents
# and produce the 502 for larger onese
#
################################################
##import cStringIO as StringIO
import ho.pisa as pisa
from django.template.loader import get_template
from django.template import Context
from django.http import HttpResponse
##from cgi import escape
template = get_template(template_src)
context = Context(context_dict)
html = template.render(context)
response = HttpResponse()
response['Content-Type'] ='application/pdf'
response['Content-Disposition']='attachment; filename=%s.pdf'%(filename)
pisa.CreatePDF(
src=html,
dest=response,
show_error_as_pdf=True)
return response
##result = StringIO.StringIO()
##pdf = pisa.pisaDocument(
## StringIO.StringIO(html.encode("ISO-8859-1")),
## result)
##if not pdf.err:
## response = HttpResponse(
## result.getvalue(),
## mimetype='application/pdf')
## response['Content-Disposition']='attachement; filename=%s.pdf'%(filename)
## return response
##return HttpResponse('Hubo un error<pre>%s</pre>' % escape(html))
I've put some thought about creating a buffer so the server can free some memory but I didn't find anything yet.
Anyone could help? please?

I can't tell you exactly what causes your problem - it could be caused by buffering problems in StringIO.
However, you are wrong if you assume that this code would actually stream the generated PDF data: StringIO.getvalue() returns the content of the string buffer at the time this method is called, not an output stream (see http://docs.python.org/library/stringio.html#StringIO.StringIO.getvalue).
If you want to stream the output, you can treat the HttpResponse instance as a file-like object (see http://docs.djangoproject.com/en/1.2/ref/request-response/#usage).
Secondly, I don't see any reason to make use of StringIO here. According to the documentation of Pisa I found (which calls this function CreatePDF, by the way) the source can be a string or a unicode object.
Personally, I would try the following:
Create the HTML as unicode string
Create and configure the HttpResponse object
Call the PDF generator with the string as input and the response as output
In outline, this could look like this:
html = template.render(context)
response = HttpResponse()
response['Content-Type'] ='application/pdf'
response['Content-Disposition']='attachment; filename=%s.pdf'%(filename)
pisa.CreatePDF(
src=html,
dest=response,
show_error_as_pdf=True)
#response.flush()
return response
However, I did not try if this actually works. (I did this sort of PDF streaming only in Java, so far.)
Update: I just looked at the implementation of HttpResponse. It implements the file interface by collecting the chunks of strings written to it in a list. Calling response.flush() is pointless, because it does nothing. Also, you can set response parameters like Content-Type even after the response has been accessed as file-object.
Your original problem may also be related to the fact you never closed the StringIO objects. The underlying buffer of a StringIO object is not released before close() is called.

Related

Should I switch from "urllib.request.urlretrieve(..)" to "urllib.request.urlopen(..)"?

1. Deprecation problem
In Python 3.7, I download a big file from a URL using the urllib.request.urlretrieve(..) function. In the documentation (https://docs.python.org/3/library/urllib.request.html) I read the following just above the urllib.request.urlretrieve(..) docs:
Legacy interface
The following functions and classes are ported from the Python 2 module urllib (as opposed to urllib2). They might become deprecated at some point in the future.
2. Searching an alternative
To keep my code future-proof, I'm on the lookout for an alternative. The official Python docs don't mention a specific one, but it looks like urllib.request.urlopen(..) is the most straightforward candidate. It's at the top of the docs page.
Unfortunately, the alternatives - like urlopen(..) - don't provide the reporthook argument. This argument is a callable you pass to the urlretrieve(..) function. In turn, urlretrieve(..) calls it regularly with the following arguments:
block nr.
block size
total file size
I use it to update a progressbar. That's why I miss the reporthook argument in alternatives.
3. urlretrieve(..) vs urlopen(..)
I discovered that urlretrieve(..) simply uses urlopen(..). See the request.py code file in the Python 3.7 installation (Python37/Lib/urllib/request.py):
_url_tempfiles = []
def urlretrieve(url, filename=None, reporthook=None, data=None):
"""
Retrieve a URL into a temporary location on disk.
Requires a URL argument. If a filename is passed, it is used as
the temporary file location. The reporthook argument should be
a callable that accepts a block number, a read size, and the
total file size of the URL target. The data argument should be
valid URL encoded data.
If a filename is passed and the URL points to a local resource,
the result is a copy from local file to new file.
Returns a tuple containing the path to the newly created
data file as well as the resulting HTTPMessage object.
"""
url_type, path = splittype(url)
with contextlib.closing(urlopen(url, data)) as fp:
headers = fp.info()
# Just return the local path and the "headers" for file://
# URLs. No sense in performing a copy unless requested.
if url_type == "file" and not filename:
return os.path.normpath(path), headers
# Handle temporary file setup.
if filename:
tfp = open(filename, 'wb')
else:
tfp = tempfile.NamedTemporaryFile(delete=False)
filename = tfp.name
_url_tempfiles.append(filename)
with tfp:
result = filename, headers
bs = 1024*8
size = -1
read = 0
blocknum = 0
if "content-length" in headers:
size = int(headers["Content-Length"])
if reporthook:
reporthook(blocknum, bs, size)
while True:
block = fp.read(bs)
if not block:
break
read += len(block)
tfp.write(block)
blocknum += 1
if reporthook:
reporthook(blocknum, bs, size)
if size >= 0 and read < size:
raise ContentTooShortError(
"retrieval incomplete: got only %i out of %i bytes"
% (read, size), result)
return result
4. Conclusion
From all this, I see three possible decisions:
I keep my code unchanged. Let's hope the urlretrieve(..) function won't get deprecated anytime soon.
I write myself a replacement function behaving like urlretrieve(..) on the outside and using urlopen(..) on the inside. Actually, such function would be a copy-paste of the code above. It feels unclean to do that - compared to using the official urlretrieve(..).
I write myself a replacement function behaving like urlretrieve(..) on the outside and using something entirely different on the inside. But hey, why would I do that? urlopen(..) is not deprecated, so why not use it?
What decision would you take?

The following example uses urllib.request.urlopen to download a zip file containing Oceania's crop production data from the FAO statistical database. In that example, it is necessary to define a minimal header, otherwise FAOSTAT throws an Error 403: Forbidden.
import shutil
import urllib.request
import tempfile
# Create a request object with URL and headers
url = “http://fenixservices.fao.org/faostat/static/bulkdownloads/Production_Crops_Livestock_E_Oceania.zip”
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '}
req = urllib.request.Request(url=url, headers=header)
# Define the destination file
dest_file = tempfile.gettempdir() + '/' + 'crop.zip'
print(f“File located at:{dest_file}”)
# Create an http response object
with urllib.request.urlopen(req) as response:
# Create a file object
with open(dest_file, "wb") as f:
# Copy the binary content of the response to the file
shutil.copyfileobj(response, f)
Based on https://stackoverflow.com/a/48691447/2641825 for the request part and https://stackoverflow.com/a/66591873/2641825 for the header part, see also urllib's documentation at https://docs.python.org/3/howto/urllib2.html

Why does Python Request's JSON decoder return the concatenation of top-level keys

I am using the Requests library to decode a JSON response as follows:
Payload Being Decoded:
{
"objectOne": {
"desc": "one"
},
"objectTwo": {
"desc": "two"
}
}
Code:
from django.http import HttpResponse
import requests
class ApiService:
#staticmethod
def say_something(self):
resp = requests.get('http://127.0.0.1:9501/polls/test/')
return HttpResponse(resp.json())
Output:
objectOneobjectTwo
I followed the simple example from the official documentation:
JSON Response Content
In addition I wrapped the response in [] brackets to see if the response must be in a JSON array but it just returns an array with 'objectOneobjectTwo' as the 1st and only element.

You have misdiagnosed what is happening. requests is returning the correct dictionary. But you are passing a dictionary to HttpResponse(), without any further processing. However, HttpResponse() is not set up to handle dictionaries.
What happens is that HttpResponse() takes an iterable and will serve each value from the iterable as a string. In essence, you told Django to send only the keys, concatenated, to the client. From the HttpResponse() documentation:
content should be an iterator or a string. If it’s an iterator, it should return strings, and those strings will be joined together to form the content of the response.
and from the dict() documentation:
iter(d)
Return an iterator over the keys of the dictionary. This is a shortcut for iter(d.keys()).
If you wanted to send JSON data, use a JSONResponse() object instead. It is set up to encode Python objects correctly:
from django.http import JSONResponse
import requests
def say_something(request):
resp = requests.get('http://127.0.0.1:9501/polls/test/')
return JSONResponse(resp.json())
or don't bother with decoding and re-encoding, just pass on the original response data:
from django.http import HttpResponse
import requests
def say_something(request):
resp = requests.get('http://127.0.0.1:9501/polls/test/')
return HttpResponse(resp.text, resp.headers['content-type'])
Note: I removed the class and staticmethod decorator; there is little point in wrapping static view functions in a class when no state is shared. Use modules to create namespaces instead.

Your data is a JSON object, which is parsed into a Python dictionary. HttpResponse expects an iterable, which it iterates through when returning the response upstream; usually you pass a string (ie the result of rendering a template), but here you pass a dictionary. Iterating through a dictionary gives the keys only.
If you want to show the full output, convert to a string first by passing str(response.json()) - which of course is pointless since it will just be the same as the raw response.

How can I test "uploading a file" using Tornado unit tests?

I'm want to test my web service (built on Tornado) using tornado.testing.AsyncHTTPTestCase. It says here that using POST for AsyncHttpClients should look like the following.
from tornado.testing import AsyncHTTPTestCase
from urllib import urlencode
class ApplicationTestCase(AsyncHTTPTestCase):
def get_app(self):
return app.Application()
def test_file_uploading(self):
url = '/'
filepath = 'uploading_file.zip' # Binary file
data = ??????? # Read from "filepath" and put the generated something into "data"
self.http_client.fetch(self.get_url(url),
self.stop,
method="POST",
data=urlencode(data))
response = self.wait()
self.assertEqual(response.code, 302) # Do assertion
if __name__ == '__main__':
unittest.main()
The problem is that I've no idea what to write at ???????. Are there any utility functions built in Tornado, or is it better to use alternative libraries like Requests?
P.S.
... actually, I've tried using Requests, but my test stopped working because probably I didn't do good for asynchronous tasking
def test_file_uploading(self):
url = '/'
filepath = 'uploading_file.zip' # Binary file
files = {'file':open(filepath,'rb')}
r = requests.post(self.get_url(url),files=files) # Freezes here
self.assertEqual(response.code, 302) # Do assertion

You need to construct a multipart/form-data request body. This is officially defined in the HTML spec. Tornado does not currently have any helper functions for generating a multipart body. However, you can use the MultipartEncoder class from the requests_toolbelt package. Just use the to_string() method instead of passing the encoder object directly to fetch().

Python 2&3: both urllib & requests POST data mysteriously disappears

I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:
http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)
Firefox Inspect Element indicates this form field has the following HTML structure:
<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username:
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>
All I want to do is fill out this form and get the resulting page:
http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName
Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.
I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?
HERE IS WHAT I'VE TRIED SO FAR:
Using urllib.requests in Python 3:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Encode dict
example_data = example_data.encode('utf-8')
# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data)
# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)
# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()
# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)
But what is returned to me and saved in 'my_html_file.html' is the original page containing
the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp
...which is the same thing I would have expected if I made this request without the
data parameter at all (which would change the request from a POST to a GET).
Naturally the first thing I did was check whether my request was being constructed properly:
# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())
Which produces the following output:
GET or POST? POST
DATA: b'user=ThisIsMyUserName'
HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]
So it appears the POST request has been structured correctly. After re-reading the
documentation and unsuccessfuly searching the web for an answer to this problem, I
moved on to a different tool: the requests module. I attempted to perform the same task:
import requests
example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content
And I get the same exact result. At this point I'm thinking maybe this is a Python 3
issue. So I fire up my trusty Python 2.7 and try the following:
import urllib, urllib2
data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()
And I get the same result again! For thoroughness I figured I'd attempt to achieve the
same result by encoding the dictionary values into the url and attempting a GET request:
# Using Python 3
# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)
This spits out the following value for final_url:
qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName
I plug this into my browser and I see that this page is exactly the same as
the original page, which is exactly what my program is downloading.
I've also tried adding additional headers and cookie support to no avail.
I've tried everything I can think of. Any idea what could be going wrong?

The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.
The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.
The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data
# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))
or, using requests:
import requests
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)
In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).

Using an image URL for link_callback in the pisa html to pdf library

Related to: django - pisa : adding images to PDF output
I've got a site that uses the Google Chart API to display a bunch of reports to the user, and I'm trying to implement a PDF version. I'm using the link_callback parameter in pisa.pisaDocument which works great for local media (css/images), but I'm wondering if it would work with remote images (using a google charts URL).
From the documentation on the pisa website, they imply this is possible, but they don't show how:
Normaly pisa expects these files to be found on the local drive. They may also be referenced relative to the original document. But the programmer might want to load form different kind of sources like the Internet via HTTP requests or from a database or anything else.
This is in a Django project, but that's pretty irrelevant. Here's what I'm using for rendering:
html = render_to_string('reporting/pdf.html', keys,
context_instance=RequestContext(request))
result = StringIO.StringIO()
pdf = pisa.pisaDocument(
StringIO.StringIO(html.encode('ascii', 'xmlcharrefreplace')),
result, link_callback=link_callback)
return HttpResponse(result.getvalue(), mimetype='application/pdf')
I tried having the link_callback return a urllib request object, but it does not seem to work:
def link_callback(uri, rel):
if uri.find('chxt') != -1:
url = "%s?%s" % (settings.GOOGLE_CHART_URL, uri)
return urllib2.urlopen(url)
return os.path.join(settings.MEDIA_ROOT, uri.replace(settings.MEDIA_URL, ""))
The PDF it generates comes out perfectly except that the google charts images are not there.

Well this was a whole lot easier than I expected. In your link_callback method, if the uri is a remote image, simply return that value.
def link_callback(uri, rel):
if uri.find('chart.apis.google.com') != -1:
return uri
return os.path.join(settings.MEDIA_ROOT, uri.replace(settings.MEDIA_URL, ""))
The browser is a lot less picky about the image URL, so make sure the uri is properly quoted for pisa. I had space characters in mine which is why it was failing at first (replacing w/ '+' fixed it).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimize PDF conversion in Django / Python - python

Related

Should I switch from "urllib.request.urlretrieve(..)" to "urllib.request.urlopen(..)"?

Why does Python Request's JSON decoder return the concatenation of top-level keys

How can I test "uploading a file" using Tornado unit tests?

Python 2&3: both urllib & requests POST data mysteriously disappears

Using an image URL for link_callback in the pisa html to pdf library

Categories

Resources