How to encode UTF8 filename for HTTP headers? (Python, Django)

How to encode UTF8 filename for HTTP headers? (Python, Django) - python

I have problem with HTTP headers, they're encoded in ASCII and I want to provided a view for downloading files that names can be non ASCII.
response['Content-Disposition'] = 'attachment; filename="%s"' % (vo.filename.encode("ASCII","replace"), )
I don't want to use static files serving for same issue with non ASCII file names but in this case there would be a problem with File system and it's file name encoding. (I don't know target os.)
I've already tried urllib.quote(), but it raises KeyError exception.
Possibly I'm doing something wrong but maybe it's impossible.

This is a FAQ.
There is no interoperable way to do this. Some browsers implement proprietary extensions (IE, Chrome), other implement RFC 2231 (Firefox, Opera).
See test cases at http://greenbytes.de/tech/tc2231/.
Update: as of November 2012, all current desktop browsers support the encoding defined in RFC 6266 and RFC 5987 (Safari >= 6, IE >= 9, Chrome, Firefox, Opera, Konqueror).

Don't send a filename in Content-Disposition. There is no way to make non-ASCII header parameters work cross-browser(*).
Instead, send just “Content-Disposition: attachment”, and leave the filename as a URL-encoded UTF-8 string in the trailing (PATH_INFO) part of your URL, for the browser to pick up and use by default. UTF-8 URLs are handled much more reliably by browsers than anything to do with Content-Disposition.
(*: actually, there's not even a current standard that says how it should be done as the relationships between RFCs 2616, 2231 and 2047 are pretty dysfunctional, something that Julian is trying to get cleared up at a spec level. Consistent browser support is in the distant future.)

Note that in 2011, RFC 6266 (especially Appendix D) weighed in on this issue and has specific recommendations to follow.
Namely, you can issue a filename with only ASCII characters, followed by filename* with a RFC 5987-formatted filename for those agents that understand it.
Typically this will look like filename="my-resume.pdf"; filename*=UTF-8''My%20R%C3%A9sum%C3%A9.pdf, where the Unicode filename ("My Résumé.pdf") is encoded into UTF-8 and then percent-encoded (note, do NOT use + for spaces).
Please do actually read RFC 6266 and RFC 5987 (or use a robust and tested library that abstracts this for you), as my summary here is lacking in important detail.

Starting with Django 2.1 (see issue #16470), you can use FileResponse, which will correctly set the Content-Disposition header for attachments. Starting with Django 3.0 (issue #30196) it will also set it correctly for inline files.
For example, to return a file named my_img.jpg with MIME type image/jpeg as an HTTP response:
response = FileResponse(open("my_img.jpg", 'rb'), as_attachment=True, content_type="image/jpeg")
return response
Or, if you can't use FileResponse, you can use the relevant part from FileResponse's source to set the Content-Disposition header yourself. Here's what that source currently looks like:
from urllib.parse import quote
disposition = 'attachment' if as_attachment else 'inline'
try:
filename.encode('ascii')
file_expr = 'filename="{}"'.format(filename)
except UnicodeEncodeError:
file_expr = "filename*=utf-8''{}".format(quote(filename))
response.headers['Content-Disposition'] = '{}; {}'.format(disposition, file_expr)

I can say that I've had success using the newer (RFC 5987) format of specifying a header encoded with the e-mail form (RFC 2231). I came up with the following solution which is based on code from the django-sendfile project.
import unicodedata
from django.utils.http import urlquote
def rfc5987_content_disposition(file_name):
ascii_name = unicodedata.normalize('NFKD', file_name).encode('ascii','ignore').decode()
header = 'attachment; filename="{}"'.format(ascii_name)
if ascii_name != file_name:
quoted_name = urlquote(file_name)
header += '; filename*=UTF-8\'\'{}'.format(quoted_name)
return header
# e.g.
# request['Content-Disposition'] = rfc5987_content_disposition(file_name)
I have only tested my code on Python 3.4 with Django 1.8. So the similar solution in django-sendfile may suite you better.
There's a long standing ticket in Django's tracker which acknowledges this but no patches have yet been proposed afaict. So unfortunately this is as close to using a robust tested library as I could find, please let me know if there's a better solution.

The escape_uri_path function from Django is the solution that worked for me.
Read the Django Docs here to see which RFC standards are currently specified.
from django.utils.encoding import escape_uri_path
file = "response.zip"
response = HttpResponse(content_type='application/zip')
response['Content-Disposition'] = f"attachment; filename*=utf-8''{escape_uri_path(file)}"

A hack:
if (Request.UserAgent.Contains("IE"))
{
// IE will accept URL encoding, but spaces don't need to be, and since they're so common..
filename = filename.Replace("%", "%25").Replace(";", "%3B").Replace("#", "%23").Replace("&", "%26");
}

Related

File extension from MIME type with ;charset=UTF-8

I have a Python web crawler which is downloading files with different extensions. To get the extension from the HTTP header content type, I am using the Python library mimetypes.
http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'])
Everything is working fine, except when the HTTP header content type contains
;charset=UTF-8. E.g. mimetypes.guess_extension is returning None for the following examples
content-type: text/plain;charset=UTF-8 # extension should be .txt OR
content-type: text/x-c;charset=UTF-8 # extension should be .java
Check with mimetypes:
>>> import mimetypes
>>> print(mimetypes.guess_extension('text/plain;charset=UTF-8'))
None
>>>
Question: How do I handle this and get the correct extension from content-types ending with ;charset=UTF-8?
I guess it is not a good solution to catch such exceptions with an if statement since I never know if the whitelist is complete or whether I am missing some content-type.

One simple way to deal with that is to split the MIME string and get only the first element.
The following code will return the expected result for both conditions.
http_header = session.head(url, headers={'Accept-Encoding': 'identity'})
extension = mimetypes.guess_extension(http_header.headers['content-type'].split(";")[0])))
Remember it is a guess. You can't expect much from it for such broad definitions such as plain text. It seems like mimetypes.guess_extension() just takes the first element of this list. This is also the reason guessing the mimetype of text/plain returns .h when .txt is the obvious choice.

GAE Python Blobstore doesn't save filename containing unicode literals in Firefox only

I am developing an app which prompts the user to upload a file which is then available for download.
Here is the download handler:
class ViewPrezentacje(blobstore_handlers.BlobstoreDownloadHandler, BaseHandler):
def get(self,blob_key):
blob_key = str(urllib.unquote(blob_key))
blob_info=blobstore.BlobInfo.get(blob_key)
self.send_blob(blob_info, save_as=urllib.quote(blob_info.filename.encode('utf-8')))
The file is downloaded with the correct file name (i.e. unicode literals are properly displayed) while using Chrome or IE, but in Firefox it is saved as a string of the form "%83%86%E3..."
Is there any way to make it work properly in Firefox?

Sending filenames with non-ASCII characters in attachments is fraught with difficulty, as the original specification was broken and browser behaviours have varied.
You shouldn't be %-encoding (urllib.quote) the filename; Firefox is right to offer it as literal % sequences as a result. IE's behaviour of %-decoding sequences in the filename is incorrect, even though Chrome eventually went on to copy it.
Ultimately the right way to send non-ASCII filenames is to use the mechanism specified in RFC6266, which ends up with a header that looks like this:
Content-Disposition: attachment; filename*=UTF-8''foo-%c3%a4-%e2%82%ac.html
However:
older browsers such as IE8 don't support it so if you care you should pass something as an ASCII-only filename= as well;
BlobstoreDownloadHandler doesn't know about this mechanism.
The bit of BlobstoreDownloadHandler that needs fixing is this inner function in send_blob:
def send_attachment(filename):
if isinstance(filename, unicode):
filename = filename.encode('utf-8')
self.response.headers['Content-Disposition'] = (
_CONTENT_DISPOSITION_FORMAT % filename)
which really wants to do:
rfc6266_filename = "UTF-8''" + urllib.quote(filename.encode('utf-8'))
fallback_filename = filename.encode('us-ascii', 'ignore')
self.response.headers['Content-Disposition'] = 'attachment; filename="%s"; filename*=%s' % (rfc6266_filename, fallback_filename)
but unfortunately being an inner function makes it annoying to try to fix in a subclass. You could:
override the whole of send_blob to replace the send_attachment inner function
or maybe you can write self.response.headers['Content-Disposition'] like this after calling send_blob? I'm not sure how GAE handles this
or, probably most practical of all, give up on having Unicode filenames for now until GAE fixes it

Open URL encoded filenames in Unix

I'm a python n00b. I have downloaded URL encoded file and I want to work with it on my unix system(Ubuntu 14).
When I try and run some operations on my file, the system says that the file doesn't exist. How do I change my filename to a unix recognizable format?
Some of the files I have download have spaces in them so they would have to be presented with a backslash and then a space. Below is a snippet of my code
link = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
output = open(link.split('/')[-1],'wb')
output.write(site.read())
output.close()
shutil.copy(link.split('/')[-1], tmp_dir)

The "link" you have actually is a URL. URLs are special and are not allowed to contain certain characters, such as spaces. These special characters can still be represented, but in an encoded form. The translation from special characters to this encoded form happens via a certain rule set, often known as "URL encoding". If interested, have a read over here: http://en.wikipedia.org/wiki/Percent-encoding
The encoding operation can be inverted, which is called decoding. The tool set with which you downloaded the files you mentioned most probably did the decoding already, for you. In your link example, there is only one special character in the URL, "%20", and this encodes a space. Your download tool set probably decoded this, and saved the file to your file system with the actual space character in the file name. That is, most likely you have a file in the file system with the following basename:
Scheherezade Theme.mp3
So, when you want to open that file from within Python, and all you have is the link, you first need to get the decoded variant of it. Python can decode URL-encoded strings with built-in tools. This is what you need:
>>> import urllib.parse
>>> url = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
>>> urllib.parse.unquote(url)
'http://www.stephaniequinn.com/Music/Scheherezade Theme.mp3'
>>>
This assumes that you are using Python 3, and that your link object is a unicode object (type str in Python 3).
Starting off with the decoded URL, you can derive the filename. Your link.split('/')[-1] method might work in many cases, but J.F. Sebastian's answer provides a more reliable method.

To extract a filename from an url:
#!/usr/bin/env python2
import os
import posixpath
import urllib
import urlparse
def url2filename(url):
"""Return basename corresponding to url.
>>> url2filename('http://example.com/path/to/file?opt=1')
'file'
"""
urlpath = urlparse.urlsplit(url).path # pylint: disable=E1103
basename = posixpath.basename(urllib.unquote(urlpath))
if os.path.basename(basename) != basename:
raise ValueError # refuse 'dir%5Cbasename.ext' on Windows
return basename
Example:
>>> url2filename("http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3")
'Scheherezade Theme.mp3'
You do not need to escape the space in the filename if you use it inside a Python script.
See complete code example on how to download a file using Python (with a progress report).

Checking file type with django form: 'application/octet-stream' issue

I'm using django validators and python-magic to check the mime type of uploaded documents and accept only pdf, zip and rar files.
Accepted mime-types are:
'application/pdf’,
'application/zip’, 'multipart/x-zip’, 'application/x-zip-compressed’, 'application/x-compressed',
'application/rar’, 'application/x-rar’ 'application/x-rar-compressed’, 'compressed/rar',
The problem is that sometimes pdf files seem to have 'application/octet-stream' as mime-type.
'application/octet-stream' means generic binary file, so I can't simply add that mime type to the list of accepted files, because in that case also other files such es excel files would be accepted, and I don't want that to happen.
How can I do in this case?
Thanks in advance.

The most fool proof way of telling is by snooping into the file contents by reading its metadata in the file header.
In most files, this file header is usually stored at the beginning of the file, though in some, it may be located in other locations.
python-magic helps you to do this, but the trick is to always reset the pointer at the beginning of the file, before trying to guess its mime type, else you will sometimes be getting appliation/octet-stream mime type if the reader's pointer has advanced past the file header location to other locations that just contains arbitrary stream of bytes.
For example, if you have a django validator function that tries to validate uploaded files for mime types:
import magic
from django.core.exceptions import ValidationError
def validate_file_type(upload):
allowed_filetypes = [
'application/pdf', 'image/jpeg', 'image/jpg', 'image/png',
'application/msword']
upload.seek(0)
file_type = magic.from_buffer(upload.read(1024), mime=True)
if file_type not in allowed_filetypes:
raise ValidationError(
'Unsupported file')

As a follow up to Liyosi answer, I also used python-magic. There seems to be a bug with libmagic where it still incorrectly identifies some files as application/octet-stream
Described better on the code
def _handle509Bug(self, e):
# libmagic 5.09 has a bug where it might fail to identify the
# mimetype of a file and returns null from magic_file (and
# likely _buffer), but also does not return an error message.
if e.message is None and (self.flags & MAGIC_MIME):
return "application/octet-stream"
else:
raise e
To get around this issue, I had to instantiate a magic object and make use of uncompressed and mime attributes. To complete Liyosi example:
import magic
from django.core.exceptions import ValidationError
def validate_file_type(upload):
allowed_filetypes = [
'application/pdf', 'image/jpeg', 'image/jpg', 'image/png',
'application/msword']
validator = magic.Magic(uncompress=True, mime=True)
file_type = validator.from_buffer(upload.read(), mime=True)
if file_type not in allowed_filetypes:
raise ValidationError('Unsupported file')

You should not rely on the MIME type provided, but rather the MIME type discovered from the first few bytes of the file itself.
This will help eliminate the generic MIME type issue.
The problem with this approach is that it will usually rely on some third party tool (for example the file command commonly found on Linux systems is great; use it with -b --mime - and pass in the first few bytes of your file to have it give you the mime type).
The other option you have is to accept the file, and try to validate it by opening it with a library.
So if pypdf cannot open the file, and the built-in zip module cannot open the file, and rarfile cannot open the file - its most likely something that you don't want to accept.

Processing a Django UploadedFile as UTF-8 with universal newlines

In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):
Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.
import csv as csv_mod
import codecs
file = request.FILES['file']
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open()
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )

For CSV and Excel upload to django, this site may help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to encode UTF8 filename for HTTP headers? (Python, Django) - python

A hack: if (Request.UserAgent.Contains("IE")) { // IE will accept URL encoding, but spaces don't need to be, and since they're so common.. filename = filename.Replace("%", "%25").Replace(";", "%3B").Replace("#", "%23").Replace("&", "%26"); }

Related

File extension from MIME type with ;charset=UTF-8

GAE Python Blobstore doesn't save filename containing unicode literals in Firefox only

Open URL encoded filenames in Unix

Checking file type with django form: 'application/octet-stream' issue

Processing a Django UploadedFile as UTF-8 with universal newlines

Categories

Resources