Putting gzipped data into a script as a string

Putting gzipped data into a script as a string - python

I snagged a Lorem Ipsupm generator last week, and I admit, it's pretty cool.
My question: can someone show me a tutorial on how the author of the above script was able to post the contents of a gzipped file into their code as a string? I keep getting examples of gzipping a regular file, and I'm feeling kind of lost here.
For what it's worth, I have another module that is quite similar (it generates random names, companies, etc), and right now it reads from a couple different text files. I like this approach better; it requires one less sub-directory in my project to place data into, and it also presents a new way of doing things for me.
I'm quite new to streams, IO types, and the like. Feel free to dump the links on my lap. Snipptes are always appreciated too.

Assuming you are in a *nix environment, you just need gzip and a base64 encoder to generate the string. Lets assume your content is in file.txt, for the purpose of this example I created the file with random bytes with that specific name.
So you need to compress it first:
$ gzip file.txt
That will generate a file.txt.gz file that you now need to embed into your code. To do that, you need to encode it. A common way to do so is to use Base64 encoding, which can be done with the base64 program:
$ base64 file.txt.gz
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
Now you have all what you need to use the contents of that file in your python script:
from cStringIO import StringIO
from base64 import b64decode
from gzip import GzipFile
# this is the variable with your file's contents
gzipped_data = """
H4sICGmHsE8AA2ZpbGUudHh0AAGoAFf/jIMKME+MgnEhgS4vd6SN0zIuVRhsj5fac3Q1EV1EvFJK
fBsw+Ln3ZSX7d5zjBXJR1BUn+b2/S3jHXO9h6KEDx37U7iOvmSf6BMo1gOJEgIsf57yHwUKl7f9+
Beh4kwF+VljN4xjBfdCiXKk0Oc9g/5U/AKR02fRwI+zYlp1ELBVDzFHNsxpjhIT43sBPklXW8L5P
d8Ao3i2tQQPf2JAHRQZYYn3vt0tKg7drVKgAAAA=
"""
# we now decode the file's content from the string and unzip it
orig_file_desc = GzipFile(mode='r',
fileobj=StringIO(b64decode(gzipped_data)))
# get the original's file content to a variable
orig_file_cont = orig_file_desc.read()
# and close the file descriptor
orig_file_desc.close()
Obviously, your program will depend on the base64, gzip and cStringIO python modules.

I'm not sure exactly what you're asking, but here's a stab...
The author of lipsum.py has included the compressed data inline in their code as chunks of Base64 encoded text. Base64 is an encoding mechanism for representing binary data using printable ASCII characters. It can be used for including binary data in your Python code. It is more commonly used to include binary data in email attachments...the next time someone sends you a picture or PDF document, take a look at the raw message and you'll see very much the same thing.
Python's base64 module provides routines for converting between base64 and binary representations of data...and once you have the binary representation of the data, it doesn't really matter how you got, whether it was by reading it from a file or decoding a string embedded in your code.
Python's gzip module can be used to decompress data. It expects a file-like object...and Python provides the StringIO module to wrap strings in the right set of methods to make them act like files. You can see that in lipsum.py in the following code:
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED)))
This is creating a StringIO object containing the binary representation of the base64 encoded value stored in DEFAULT_SAMPLE_COMPRESSED.
All the modules mentioned here are described in the documentation for the Python standard library.
I wouldn't recommend including data in your code inline like this as a good idea in general, unless your data is small and relatively static. Otherwise, package it up into your Python package which makes it easier to edit and track changes.
Have I answered the right question?

How about this: Zips and encodes a string, prints it out encoded, then decodes and unzips it again.
from StringIO import StringIO
import base64
import gzip
contents = 'The quick brown fox jumps over the lazy dog'
zip_text_file = StringIO()
zipper = gzip.GzipFile(mode='wb', fileobj=zip_text_file)
zipper.write(contents)
zipper.close()
enc_text = base64.b64encode(zip_text_file.getvalue())
print enc_text
sample_text_file = gzip.GzipFile(mode='rb',
fileobj=StringIO(base64.b64decode(enc_text)))
DEFAULT_SAMPLE = sample_text_file.read()
sample_text_file.close()
print DEFAULT_SAMPLE

Old question but I had to do this recent for AWS logs. In Python3 use BytesIO instead of StringIO:
import base64
from io import BytesIO
DEFAULT_SAMPLE_COMPRESSED = "Some base 64 encoded and gzip compressed string"
sample_text_file = gzip.GzipFile(
mode='rb',
fileobj=BytesIO(base64.b64decode(DEFAULT_SAMPLE_COMPRESSED))
)
binary_text = sample_text_file.read() # This will be the final string as bianry
text = binary_text .decode() # This will make the binary text a string.

Related

Open URL encoded filenames in Unix

I'm a python n00b. I have downloaded URL encoded file and I want to work with it on my unix system(Ubuntu 14).
When I try and run some operations on my file, the system says that the file doesn't exist. How do I change my filename to a unix recognizable format?
Some of the files I have download have spaces in them so they would have to be presented with a backslash and then a space. Below is a snippet of my code
link = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
output = open(link.split('/')[-1],'wb')
output.write(site.read())
output.close()
shutil.copy(link.split('/')[-1], tmp_dir)

The "link" you have actually is a URL. URLs are special and are not allowed to contain certain characters, such as spaces. These special characters can still be represented, but in an encoded form. The translation from special characters to this encoded form happens via a certain rule set, often known as "URL encoding". If interested, have a read over here: http://en.wikipedia.org/wiki/Percent-encoding
The encoding operation can be inverted, which is called decoding. The tool set with which you downloaded the files you mentioned most probably did the decoding already, for you. In your link example, there is only one special character in the URL, "%20", and this encodes a space. Your download tool set probably decoded this, and saved the file to your file system with the actual space character in the file name. That is, most likely you have a file in the file system with the following basename:
Scheherezade Theme.mp3
So, when you want to open that file from within Python, and all you have is the link, you first need to get the decoded variant of it. Python can decode URL-encoded strings with built-in tools. This is what you need:
>>> import urllib.parse
>>> url = "http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3"
>>> urllib.parse.unquote(url)
'http://www.stephaniequinn.com/Music/Scheherezade Theme.mp3'
>>>
This assumes that you are using Python 3, and that your link object is a unicode object (type str in Python 3).
Starting off with the decoded URL, you can derive the filename. Your link.split('/')[-1] method might work in many cases, but J.F. Sebastian's answer provides a more reliable method.

To extract a filename from an url:
#!/usr/bin/env python2
import os
import posixpath
import urllib
import urlparse
def url2filename(url):
"""Return basename corresponding to url.
>>> url2filename('http://example.com/path/to/file?opt=1')
'file'
"""
urlpath = urlparse.urlsplit(url).path # pylint: disable=E1103
basename = posixpath.basename(urllib.unquote(urlpath))
if os.path.basename(basename) != basename:
raise ValueError # refuse 'dir%5Cbasename.ext' on Windows
return basename
Example:
>>> url2filename("http://www.stephaniequinn.com/Music/Scheherezade%20Theme.mp3")
'Scheherezade Theme.mp3'
You do not need to escape the space in the filename if you use it inside a Python script.
See complete code example on how to download a file using Python (with a progress report).

Writing PDFs to STDOUT with Python

I want to merge two PDF documents with Python (prepend a pre-made cover sheet to an existing document) and present the result to a browser. I'm currently using the PyPDF2 library which can perform the merge easily enough, but the PdfFileWriter class write() method only seems to support writing to a file object (must support write() and tell() methods). In this case, there is no reason to touch the filesystem; the merged PDF is already in memory and I just want to send a Content-type header and then the document to STDOUT (the browser via CGI). Is there a Python library better suited to writing a document to STDOUT than PyPDF2? Alternately, is there a way to pass STDIO as an argument to PdfFileWriter's write() method in such a way that it appears to write() as though it were a file handle?
Letting write() write the document to the filesystem and then opening the resulting file and sending it to the browser works, but is not an option in this case (aside from being terribly inelegant).
solution
Using mgilson's advice, this is how I got it to work in Python 2.7:
#!/usr/bin/python
import cStringIO
import sys
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
###
# Actual PDF open/merge code goes here
###
output = cStringIO.StringIO()
merger.write(output)
print("Content-type: application/pdf\n")
sys.stdout.write(output.getvalue())
output.close()

Python supports an "in-memory" filetype via cStringIO.StringIO (or io.BytesIO, ... depending on python version). In your case, you could create an instance of one of those classes, pass that to the method which expects a file and then you can use the .getvalue() method to return the contents as a string (or bytes depending on python version). Once you have the contents as a string, you can simply print them or use sys.stdout.write to write the string to standard output.

pygame image to base64

I capture screen of my pygame program like this
data = pygame.image.tostring(pygame.display.get_surface(),"RGB")
How can I convert it into base64 string? (WITHOUT having to save it to HDD). Its important that there is no saving to HDD. I know I can save it to a file and then just encode the file to base64 but I cant seem to encode "on the fly"
thanks

If you want, you can save it to a StringIO, which is basically a virtual file stored as a string.
However, I'd really recommend using the base64 module, which has a method called base64.b64encode. It handles your 'on the fly' requirement well.
Code example:
import base64
data = pygame.image.tostring(pygame.display.get_surface(),"RGB")
base64data = base64.b64encode(data)
Happy coding!

Actually - pygame.image.tostring() is a pretty strange function (really dont understand the binary string it returns, I cant find anythin that can process it right).
There seems to be an enhancement issue on this at pygame bitbucket:
(https://bitbucket.org/pygame/pygame/issue/48/add-optional-format-argument-to)
I got around it like this:
data = cStringIO.StringIO()
pygame.image.save(pygame.display.get_surface(), data)
data = base64.b64encode(data.getvalue())
So in the end you get the valid and RIGHT base64 string. And it seems to work. Not sure about the format yet tho, will add more info tmrw

What config file format to use for user-friendly strings of arbitrary bytes?

So I made a short Python script to launch files in Windows with ambiguous extensions by examining their magic number/file signature first:
https://superuser.com/a/317927/13889
https://gist.github.com/1119561
I'd like to compile it to a .exe to make association easier (either using bbfreeze or rewriting in C), but I need some kind of user-friendly config file to specify the matching byte strings and program paths. Basically I want to put this information into a plain text file somehow:
magic_numbers = {
# TINA
'OBSS': r'%PROGRAMFILES(X86)%\DesignSoft\Tina 9 - TI\TINA.EXE',
# PSpice
'*version': r'%PROGRAMFILES(X86)%\Orcad\Capture\Capture.exe',
'x100\x88\xce\xcf\xcfOrCAD ': '', #PSpice?
# Protel
'DProtel': r'%PROGRAMFILES(X86)%\Altium Designer S09 Viewer\dxp.exe',
# Eagle
'\x10\x80': r'%PROGRAMFILES(X86)%\EAGLE-5.11.0\bin\eagle.exe',
'\x10\x00': r'%PROGRAMFILES(X86)%\EAGLE-5.11.0\bin\eagle.exe',
'<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE eagle ': r'%PROGRAMFILES(X86)%\EAGLE-5.11.0\bin\eagle.exe',
# PADS Logic
'\x00\xFE': r'C:\MentorGraphics\9.3PADS\SDD_HOME\Programs\powerlogic.exe',
}
(The hex bytes are just arbitrary bytes, not Unicode characters.)
I guess a .py file in this format works, but I have to leave it uncompiled and somehow still import it into the compiled file, and there's still a bunch of extraneous content like { and , to be confused by/screw up.
I looked at YAML, and it would be great except that it requires base64-encoding binary stuff first, which isn't really what I want. I'd prefer the config file to contain hex representations of the bytes. But also ASCII representations, if that's all the file signature is. And maybe also regexes. :D (In case the XML-based format can be written with different amounts of whitespace, for instance)
Any ideas?

You've already got your answer: YAML.
The data you posted up above is storing text representations of binary data; that will be fine for YAML, you just need to parse it properly. Usually you'd use something from the binascii module; in this case, likely the binascii.a2b_qp function.
magic_id_str = 'x100\x88\xce\xcf\xcfOrCAD '
magic_id = binascii.a2b_qp(magic_id_str)
To elucidate, I will use a unicode character as an easy way to paste binary data into the REPL (Python 2.7):
>>> a = 'Φ'
>>> a
'\xce\xa6'
>>> binascii.b2a_qp(a)
'=CE=A6'
>>> magic_text = yaml.load("""
... magic_string: '=CE=A6'
... """)
>>> magic_text
{'magic_string': '=CE=A6'}
>>> binascii.a2b_qp(magic_text['magic_string'])
'\xce\xa6'

I would suggest doing this a little differently. I would decouple these two settings from each other:
Magic number signature ===> mimetype
mimetype ==> program launcher
For the first part, I would use python-magic, a library that has bindings to libmagic. You can have python-magic use a custom magic file like this:
import magic
m = magic.Magic(magic_file='/path/to/magic.file')
Your users can specify a custom magic file mapping magic numbers to mimetypes. The syntax of magic files is documented. Here's an example showing the magic file for the TIFF format:
# Tag Image File Format, from Daniel Quinlan (quinlan#yggdrasil.com)
# The second word of TIFF files is the TIFF version number, 42, which has
# never changed. The TIFF specification recommends testing for it.
0 string MM\x00\x2a TIFF image data, big-endian
!:mime image/tiff
0 string II\x2a\x00 TIFF image data, little-endian
!:mime image/tiff
The second part then is pretty easy, since you only need to specify text data now. You could go with an INI or yaml format, as suggested by others, or you could even have just a simple tab-delimited file like this:
image/tiff C:\Program Files\imageviewer.exe
application/json C:\Program Files\notepad.exe

I've used some packages to build configuration files, also yaml. I recommend that you use ConfigParser or ConfigObj.
At last, the best option If you wanna build a human-readable configuration file with comments I strongly recommend use ConfigObj.
ConfigObj
Brief ConfigObj tutorial
ConfigParser
Brief ConfigParser tutorial
Enjoy!
Example of ConfigObj
With this code:
You can use ConfigObj to store them too. Try this one:
import configobj
def createConfig(path):
config = configobj.ConfigObj()
config.filename = path
config["Sony"] = {}
config["Sony"]["product"] = "Sony PS3"
config["Sony"]["accessories"] = ['controller', 'eye', 'memory stick']
config["Sony"]["retail price"] = "$400"
config["Sony"]["binary one"]= bin(173)
config.write()
You get this file:
[Sony]
product = Sony PS3
accessories = controller, eye, memory stick
retail price = $400
binary one = 0b10101101

Python Input/Output, files

I need to write some methods for loading/saving some classes to and from a binary file. However I also want to be able to accept the binary data from other places, such as a binary string.
In c++ I could do this by simply making my class methods use std::istream and std::ostream which could be a file, a stringstream, the console, whatever.
Does python have a similar input/output class which can be made to represent almost any form of i/o, or at least files and memory?

The Python way to do this is to accept an object that implements read() or write(). If you have a string, you can make this happen with StringIO:
from cStringIO import StringIO
s = "My very long string I want to read like a file"
file_like_string = StringIO(s)
data = file_like_string.read(10)
Remember that Python uses duck-typing: you don't have to involve a common base class. So long as your object implements read(), it can be read like a file.

The Pickle and cPickle modules may also be helpful to you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Putting gzipped data into a script as a string - python

Related

Open URL encoded filenames in Unix

Writing PDFs to STDOUT with Python

pygame image to base64

What config file format to use for user-friendly strings of arbitrary bytes?

Python Input/Output, files

Categories

Resources