Open URL encoded filenames in Unix

Open URL encoded filenames in Unix - python

I'm a python n00b. I have downloaded URL encoded file and I want to work with it on my unix system(Ubuntu 14).
When I try and run some operations on my file, the system says that the file doesn't exist. How do I change my filename to a unix recognizable format?
Some of the files I have download have spaces in them so they would have to be presented with a backslash and then a space. Below is a snippet of my code
link = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
output = open(link.split('/')[-1],'wb')
output.write(site.read())
output.close()
shutil.copy(link.split('/')[-1], tmp_dir)

The "link" you have actually is a URL. URLs are special and are not allowed to contain certain characters, such as spaces. These special characters can still be represented, but in an encoded form. The translation from special characters to this encoded form happens via a certain rule set, often known as "URL encoding". If interested, have a read over here: http://en.wikipedia.org/wiki/Percent-encoding
The encoding operation can be inverted, which is called decoding. The tool set with which you downloaded the files you mentioned most probably did the decoding already, for you. In your link example, there is only one special character in the URL, "%20", and this encodes a space. Your download tool set probably decoded this, and saved the file to your file system with the actual space character in the file name. That is, most likely you have a file in the file system with the following basename:
Scheherezade Theme.mp3
So, when you want to open that file from within Python, and all you have is the link, you first need to get the decoded variant of it. Python can decode URL-encoded strings with built-in tools. This is what you need:
>>> import urllib.parse
>>> url = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
>>> urllib.parse.unquote(url)
'http://www.stephaniequinn.com/Music/Scheherezade Theme.mp3'
>>>
This assumes that you are using Python 3, and that your link object is a unicode object (type str in Python 3).
Starting off with the decoded URL, you can derive the filename. Your link.split('/')[-1] method might work in many cases, but J.F. Sebastian's answer provides a more reliable method.

To extract a filename from an url:
#!/usr/bin/env python2
import os
import posixpath
import urllib
import urlparse
def url2filename(url):
"""Return basename corresponding to url.
>>> url2filename('http://example.com/path/to/file?opt=1')
'file'
"""
urlpath = urlparse.urlsplit(url).path # pylint: disable=E1103
basename = posixpath.basename(urllib.unquote(urlpath))
if os.path.basename(basename) != basename:
raise ValueError # refuse 'dir%5Cbasename.ext' on Windows
return basename
Example:
>>> url2filename("http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3")
'Scheherezade Theme.mp3'
You do not need to escape the space in the filename if you use it inside a Python script.
See complete code example on how to download a file using Python (with a progress report).

Related

Unicode issues with tarfile.extractall() (Python 2.7)

I'm using python 2.7.6 on Windows and I'm using the tarfile module to extract a file a gzip file. The mode option of tarfile.open() is set to "r:gz". After the open call, if I were to print the contents of the archive via tarfile.list(), I see the following directory in the list:
./Θ¥ÖµÇüσêåµ₧É Part 1.v1/
However, after I call tarfile.extractall(), I don't see the above directory in the extracted list of files, instead I see this:
é™æ€åˆ†æž Part 1.v1/
If I were to extract the archive via 7zip, I see a directory with the same name as the first item above. So, clearly, the extractall() method is screwing up, but I don't know how to fix this.

I learned that tar doesn't retain the encoding information as part of the archive and treats filenames as raw byte sequences. So, the output I saw from tarfile.extractall() was simply raw the character sequence that comprised the file's name prior to compression. In order to get the extractall() method to recreate the original filenames, I discovered that you have to manually convert the members of the TarFile object to the appropriate encoding before calling extractall(). In my case, the following did the trick:
modeltar = tarfile.open(zippath, mode="r:gz")
updatedMembers = []
for m in modeltar.getmembers():
m.name = unicode(m.name, 'utf-8')
updatedMembers.append(m)
modeltar.extractall(members=updatedMembers, path=dbpath)
The above code is based on this superuser answer: https://superuser.com/a/190786/354642

Split string using delimiter "\" in python [duplicate]

This question already has answers here:
Splitting path strings into drive, path and file name parts
(2 answers)
Closed 8 years ago.
I need to split the string using delimiter "\"
The string can be in any of the following format:
file://C:\Users\xyz\filename.txt
C:\Users\xyz\filename.txt
I need my script to give the output as "filename.txt"
I tried to use split('\\\\'). It does not work out. Which is the better function to use?

Suppose your string is pathName, then you can use fileName = pathName.split('\\')[-1].

Try the following steps, do notice the valid string format for using \ inside strings and to avoid \x scope error
>>> file = 'file://C:\\Users\\xyz\\filename.txt'
>>> file.split('\\')[-1]
'filename.txt'
>>> file = 'C:\\Users\\xyz\\filename.txt'
>>> file.split('\\')[-1]
'filename.txt'

Two issues here.
Path splitting
You'd normally use os.path.split to work with paths:
>>> import os.path
>>> p=r'C:\Users\xyz\filename.txt'
>>> head, tail = os.path.split(p)
>>> head
'C:\\Users\\xyz'
>>> tail
'filename.txt'
Caveat: os.path works with the path format of the operating system it's used on. If you know you specifically want to work with Windows paths (even when your program is ran on Linux or OSX), then instead of the os.path you'd work with the ntpath module. See the note:
Note Since different operating systems have different path name conventions, there are several versions of this module in the standard library. The os.path module is always the path module suitable for the operating system Python is running on, and therefore usable for local paths. However, you can also import and use the individual modules if you want to manipulate a path that is always in one of the different formats. They all have the same interface:
posixpath for UNIX-style paths
ntpath for Windows paths
macpath for old-style MacOS paths
os2emxpath for OS/2 EMX paths
Format support
You have 2 formats to support:
file://C:\Users\xyz\filename.txt
C:\Users\xyz\filename.txt
2 is a normal Windows path, and 1 is... Frankly, I have no idea what that is. It kind of looks like a file URI, but uses Windows-style delimiters (backslashes). This is strange. When I open a PDF in Chrome on Windows the URI looks different:
file:///C:/Users/kos/Downloads/something.pdf
and I'll assume that's the format you're interested in. If not, then I can't vouch for what you're dealing with and you can make some educated guess on how to interpret it (drop the file:// prefix and treat it as a Windows path?).
An URI you can split into meaningful parts using the urlparse module (see urllib.parse for python 3), and once you've extracted the path part of the URI, you can just .split('/') it (URI grammar is simple enough to allow that). Here's what happens if you use this module on a file:// URI:
>>> r = urlparse.urlparse(r'file:///C:/Users/xyz/filename.txt')
>>> r
ParseResult(scheme='file', netloc='', path='/C:/Users/xyz/filename.txt', params='', query='', fragment='')
>>> r.path
'/C:/Users/xyz/filename.txt'
>>> r.path.lstrip('/').split('/')
['C:', 'Users', 'xyz', 'filename.txt']
Please read this URI scheme description to have a better idea how this format looks like and why there are three slashes after file:.

Putting gzipped data into a script as a string

I snagged a Lorem Ipsupm generator last week, and I admit, it's pretty cool.
My question: can someone show me a tutorial on how the author of the above script was able to post the contents of a gzipped file into their code as a string? I keep getting examples of gzipping a regular file, and I'm feeling kind of lost here.
For what it's worth, I have another module that is quite similar (it generates random names, companies, etc), and right now it reads from a couple different text files. I like this approach better; it requires one less sub-directory in my project to place data into, and it also presents a new way of doing things for me.
I'm quite new to streams, IO types, and the like. Feel free to dump the links on my lap. Snipptes are always appreciated too.

Assuming you are in a *nix environment, you just need gzip and a base64 encoder to generate the string. Lets assume your content is in file.txt, for the purpose of this example I created the file with random bytes with that specific name.
So you need to compress it first:
$ gzip file.txt
That will generate a file.txt.gz file that you now need to embed into your code. To do that, you need to encode it. A common way to do so is to use Base64 encoding, which can be done with the base64 program:
$ base64 file.txt.gz
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
Now you have all what you need to use the contents of that file in your python script:
from cStringIO import StringIO
from base64 import b64decode
from gzip import GzipFile
# this is the variable with your file's contents
gzipped_data = """
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
"""
# we now decode the file's content from the string and unzip it
orig_file_desc = GzipFile(mode='r',
fileobj=StringIO(b64decode(gzipped_data)))
# get the original's file content to a variable
orig_file_cont = orig_file_desc.read()
# and close the file descriptor
orig_file_desc.close()
Obviously, your program will depend on the base64, gzip and cStringIO python modules.

I'm not sure exactly what you're asking, but here's a stab...
The author of lipsum.py has included the compressed data inline in their code as chunks of Base64 encoded text. Base64 is an encoding mechanism for representing binary data using printable ASCII characters. It can be used for including binary data in your Python code. It is more commonly used to include binary data in email attachments...the next time someone sends you a picture or PDF document, take a look at the raw message and you'll see very much the same thing.
Python's base64 module provides routines for converting between base64 and binary representations of data...and once you have the binary representation of the data, it doesn't really matter how you got, whether it was by reading it from a file or decoding a string embedded in your code.
Python's gzip module can be used to decompress data. It expects a file-like object...and Python provides the StringIO module to wrap strings in the right set of methods to make them act like files. You can see that in lipsum.py in the following code:
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED)))
This is creating a StringIO object containing the binary representation of the base64 encoded value stored in DEFAULT_SAMPLE_COMPRESSED.
All the modules mentioned here are described in the documentation for the Python standard library.
I wouldn't recommend including data in your code inline like this as a good idea in general, unless your data is small and relatively static. Otherwise, package it up into your Python package which makes it easier to edit and track changes.
Have I answered the right question?

How about this: Zips and encodes a string, prints it out encoded, then decodes and unzips it again.
from StringIO import StringIO
import base64
import gzip
contents = 'The quick brown fox jumps over the lazy dog'
zip_text_file = StringIO()
zipper = gzip.GzipFile(mode='wb', fileobj=zip_text_file)
zipper.write(contents)
zipper.close()
enc_text = base64.b64encode(zip_text_file.getvalue())
print enc_text
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(enc_text)))
DEFAULT_SAMPLE = sample_text_file.read()
sample_text_file.close()
print DEFAULT_SAMPLE

Old question but I had to do this recent for AWS logs. In Python3 use BytesIO instead of StringIO:
import base64
from io import BytesIO
DEFAULT_SAMPLE_COMPRESSED = "Some base 64 encoded and gzip compressed string"
sample_text_file = gzip.GzipFile(
mode='rb',
fileobj=BytesIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED))
)
binary_text = sample_text_file.read() # This will be the final string as bianry
text = binary_text .decode() # This will make the binary text a string.

What config file format to use for user-friendly strings of arbitrary bytes?

So I made a short Python script to launch files in Windows with ambiguous extensions by examining their magic number/file signature first:
https://superuser.com/a/317927/13889
https://gist.github.com/1119561
I'd like to compile it to a .exe to make association easier (either using bbfreeze or rewriting in C), but I need some kind of user-friendly config file to specify the matching byte strings and program paths. Basically I want to put this information into a plain text file somehow:
magic_numbers = {
# TINA
'OBSS': r'%PROGRAMFILES(X86)%\DesignSoft\Tina 9 - TI\TINA.EXE',
# PSpice
'*version': r'%PROGRAMFILES(X86)%\Orcad\Capture\Capture.exe',
'x100\x88\xce\xcf\xcfOrCAD ': '', #PSpice?
# Protel
'DProtel': r'%PROGRAMFILES(X86)%\Altium Designer S09 Viewer\dxp.exe',
# Eagle
'\x10\x80': r'%PROGRAMFILES(X86)%\EAGLE-5.11.0\bin\eagle.exe',
'\x10\x00': r'%PROGRAMFILES(X86)%\EAGLE-5.11.0\bin\eagle.exe',
'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE eagle ': r'%PROGRAMFILES(X86)%\EAGLE-5.11.0\bin\eagle.exe',
# PADS Logic
'\x00\xFE': r'C:\MentorGraphics\9.3PADS\SDD_HOME\Programs\powerlogic.exe',
}
(The hex bytes are just arbitrary bytes, not Unicode characters.)
I guess a .py file in this format works, but I have to leave it uncompiled and somehow still import it into the compiled file, and there's still a bunch of extraneous content like { and , to be confused by/screw up.
I looked at YAML, and it would be great except that it requires base64-encoding binary stuff first, which isn't really what I want. I'd prefer the config file to contain hex representations of the bytes. But also ASCII representations, if that's all the file signature is. And maybe also regexes. :D (In case the XML-based format can be written with different amounts of whitespace, for instance)
Any ideas?

You've already got your answer: YAML.
The data you posted up above is storing text representations of binary data; that will be fine for YAML, you just need to parse it properly. Usually you'd use something from the binascii module; in this case, likely the binascii.a2b_qp function.
magic_id_str = 'x100\x88\xce\xcf\xcfOrCAD '
magic_id = binascii.a2b_qp(magic_id_str)
To elucidate, I will use a unicode character as an easy way to paste binary data into the REPL (Python 2.7):
>>> a = 'Φ'
>>> a
'\xce\xa6'
>>> binascii.b2a_qp(a)
'=CE=A6'
>>> magic_text = yaml.load("""
... magic_string: '=CE=A6'
... """)
>>> magic_text
{'magic_string': '=CE=A6'}
>>> binascii.a2b_qp(magic_text['magic_string'])
'\xce\xa6'

I would suggest doing this a little differently. I would decouple these two settings from each other:
Magic number signature ===> mimetype
mimetype ==> program launcher
For the first part, I would use python-magic, a library that has bindings to libmagic. You can have python-magic use a custom magic file like this:
import magic
m = magic.Magic(magic_file='/path/to/magic.file')
Your users can specify a custom magic file mapping magic numbers to mimetypes. The syntax of magic files is documented. Here's an example showing the magic file for the TIFF format:
# Tag Image File Format, from Daniel Quinlan (quinlan#yggdrasil.com)
# The second word of TIFF files is the TIFF version number, 42, which has
# never changed. The TIFF specification recommends testing for it.
0 string MM\x00\x2a TIFF image data, big-endian
!:mime image/tiff
0 string II\x2a\x00 TIFF image data, little-endian
!:mime image/tiff
The second part then is pretty easy, since you only need to specify text data now. You could go with an INI or yaml format, as suggested by others, or you could even have just a simple tab-delimited file like this:
image/tiff C:\Program Files\imageviewer.exe
application/json C:\Program Files\notepad.exe

I've used some packages to build configuration files, also yaml. I recommend that you use ConfigParser or ConfigObj.
At last, the best option If you wanna build a human-readable configuration file with comments I strongly recommend use ConfigObj.
ConfigObj
Brief ConfigObj tutorial
ConfigParser
Brief ConfigParser tutorial
Enjoy!
Example of ConfigObj
With this code:
You can use ConfigObj to store them too. Try this one:
import configobj
def createConfig(path):
config = configobj.ConfigObj()
config.filename = path
config["Sony"] = {}
config["Sony"]["product"] = "Sony PS3"
config["Sony"]["accessories"] = ['controller', 'eye', 'memory stick']
config["Sony"]["retail price"] = "$400"
config["Sony"]["binary one"]= bin(173)
config.write()
You get this file:
[Sony]
product = Sony PS3
accessories = controller, eye, memory stick
retail price = $400
binary one = 0b10101101

Processing a Django UploadedFile as UTF-8 with universal newlines

In my django application, I provide a form that allows users to upload a file. The file can be in a variety of formats (Excel, CSV), come from a variety of platforms (Mac, Linux, Windows), and be encoded in a variety of encodings (ASCII, UTF-8).
For the purpose of this question, let's assume that I have a view which is receiving request.FILES['file'], which is an instance of InMemoryUploadedFile, called file. My problem is that InMemoryUploadedFile objects (like file):
Do not support UTF-8 encoding (I see a \xef\xbb\xbf at the beginning of the file, which as I understand is a flag meaning 'this file is UTF-8').
Do not support universal newlines (which probably the majority of the files uploaded to this system will need).
Complicating the issue is that I wish to pass the file in to the python csv module, which does not natively support Unicode. I will happily accept answers that avoid this issue - once I get django playing nice with UTF-8 I'm sure I can bludgeon csv into doing the same. (Similarly, please ignore the requirement to support Excel - I am waiting until CSV works before I tackle parsing Excel files.)
I have tried using StringIO,mmap,codec, and any of a wide variety of ways of accessing the data in an InMemoryUploadedFile object. Each approach has yielded differing errors, none so far have been perfect. This shows some of the code that I feel came the closest:
import csv
import codecs
class CSVParser:
def __init__(self,file):
# 'file' is assumed to be an InMemoryUploadedFile object.
dialect = csv.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open() # seek to 0
self.reader = csv.reader(codecs.EncodedFile(file,"utf-8"),
dialect=dialect)
try:
self.field_names = self.reader.next()
except StopIteration:
# The file was empty - this is not allowed.
raise ValueError('Unrecognized format (empty file)')
if len(self.field_names) <= 1:
# This probably isn't a CSV file at all.
# Note that the csv module will (incorrectly) parse ALL files, even
# binary data. This will catch most such files.
raise ValueError('Unrecognized format (too few columns)')
# Additional methods snipped, unrelated to issue
Please note that I haven't spent too much time on the actual parsing algorithm so it may be wildly inefficient, right now I'm more concerned with getting encoding to work as expected.
The problem is that the results are also not encoded, despite being wrapped in the Unicode codecs.EncodedFile file wrapper.
EDIT: It turns out, the above code does in fact work. codecs.EncodedFile(file,"utf-8") is the ticket. It turns out the reason I thought it didn't work was that the terminal I was using does not support UTF-8. Live and learn!

As mentioned above, the code snippet I provided was in fact working as intended - the problem was with my terminal, and not with python encoding.
If your view needs to access a UTF-8 UploadedFile, you can just use utf8_file = codecs.EncodedFile(request.FILES['file_field'],"utf-8") to open a file object in the correct encoding.
I also noticed that, at least for InMemoryUploadedFiles, opening the file through the codecs.EncodedFile wrapper does NOT reset the seek() position of the file descriptor. To return to the beginning of the file (again, this may be InMemoryUploadedFile specific) I just used request.FILES['file_field'].open() to send the seek() position back to 0.

I use the csv.DictReader and it appears to be working well. I attached my code snippet, but it is basically the same as another answer here.
import csv as csv_mod
import codecs
file = request.FILES['file']
dialect = csv_mod.Sniffer().sniff(codecs.EncodedFile(file,"utf-8").read(1024))
file.open()
csv = csv_mod.DictReader( codecs.EncodedFile(file,"utf-8"), dialect=dialect )

For CSV and Excel upload to django, this site may help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Open URL encoded filenames in Unix - python

Related

Unicode issues with tarfile.extractall() (Python 2.7)

Split string using delimiter "\" in python [duplicate]

Putting gzipped data into a script as a string

What config file format to use for user-friendly strings of arbitrary bytes?

Processing a Django UploadedFile as UTF-8 with universal newlines

Categories

Resources