Gzip and subprocess' stdout in python - python

I'm using python 2.6.4 and discovered that I can't use gzip with subprocess the way I might hope. This illustrates the problem:
May 17 18:05:36> python
Python 2.6.4 (r264:75706, Mar 10 2010, 14:41:19)
[GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> import subprocess
>>> fh = gzip.open("tmp","wb")
>>> subprocess.Popen("echo HI", shell=True, stdout=fh).wait()
0
>>> fh.close()
>>>
[2]+ Stopped python
May 17 18:17:49> file tmp
tmp: data
May 17 18:17:53> less tmp
"tmp" may be a binary file. See it anyway?
May 17 18:17:58> zcat tmp
zcat: tmp: not in gzip format
Here's what it looks like inside less
HI
^_<8B>^H^Hh<C0><F1>K^B<FF>tmp^#^C^#^#^#^#^#^#^#^#^#
which looks like it put in the stdout as text and then put in an empty gzip file. Indeed, if I remove the "Hi\n", then I get this:
May 17 18:22:34> file tmp
tmp: gzip compressed data, was "tmp", last modified: Mon May 17 18:17:12 2010, max compression
What is going on here?
UPDATE:
This earlier question is asking the same thing: Can I use an opened gzip file with Popen in Python?

You can't use file-likes with subprocess, only real files. The fileno() method of GzipFile returns the FD of the underlying file, so that's what the echo redirects to. The GzipFile then closes, writing an empty gzip file.

just pipe that sucker
from subprocess import Popen,PIPE
GZ = Popen("gzip > outfile.gz",stdin=PIPE,shell=True)
P = Popen("echo HI",stdout=GZ.stdin,shell=True)
# these next three must be in order
P.wait()
GZ.stdin.close()
GZ.wait()

I'm not totally sure why this isn't working (perhaps the output redirection is not calling python's write, which is what gzip works with?) but this works:
>>> fh.write(subprocess.Popen("echo Hi", shell=True, stdout=subprocess.PIPE).stdout.read())

You don't need to use subprocess to write to the gzip.GzipFile. Instead, write to it like any other file-like object. The result is automagically gzipped!
import gzip
with gzip.open("tmp.gz", "wb") as fh:
fh.write('echo HI')

Related

Python hashlib MD5 digest of any UNC file always yields same hash

The below code shows that three files which are on a UNC share hosted on another machine have the same hash. It also shows that local files have different hashes. Why would this be? I feel that there is some UNC consideration that I don't know about.
Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> fn_a = '\\\\some.host.com\\Shares\\folder1\\file_a'
>>> fn_b = '\\\\some.host.com\\Shares\\folder1\\file_b'
>>> fn_c = '\\\\some.host.com\\Shares\\folder2\\file_c'
>>> fn_d = 'E:\\file_d'
>>> fn_e = 'E:\\file_e'
>>> fn_f = 'E:\\folder3\\file_f'
>>> f_a = open(fn_a, 'r')
>>> f_b = open(fn_b, 'r')
>>> f_c = open(fn_c, 'r')
>>> f_d = open(fn_d, 'r')
>>> f_e = open(fn_e, 'r')
>>> f_f = open(fn_f, 'r')
>>> hashlib.md5(f_a.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_b.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_c.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_d.read()).hexdigest()
'd2bf541b1a9d2fc1a985f65590476856'
>>> hashlib.md5(f_e.read()).hexdigest()
'e84be3c598a098f1af9f2a9d6f806ed5'
>>> hashlib.md5(f_f.read()).hexdigest()
'e11f04ed3534cc4784df3875defa0236'
EDIT: To further investigate the problem, I also tested using a file from another host. It appears that changing the host will change the result.
>>> fn_h = '\\\\host\\share\\file'
>>> f_h = open(fn_h, 'r')
>>> hashlib.md5(f_h.read()).hexdigest()
'f23ee2dbbb0040bf2586cfab29a03634'
...but then I tried a different file on the new host, and got a new result!
>>> fn_i = '\\\\host\\share\\different_file'
>>> f_i = open(fn_i, 'r')
>>> hashlib.md5(f_i.read()).hexdigest()
'a8ad771db7af8c96f635bcda8fdce961'
So, now I'm really confused. Could it have something to do with the fact that the original host is a \\host.com format and the new host is a \\host format?
I did some additional research based on the comments and answers everyone provided. I decided I needed to study permutations of these two features of the code:
A raw string literal is used for the path name, i.e. whether or not:
A. The file path string is raw with single backslashes in the path, vs.
B. The file path string is not raw with double backslashes in the path
(FYI to those who don't know, a raw string is one which is proceeded by an "r" like this: r'This is a raw string')
The open function mode is r or rb.
(FYI again to those who don't know, the b in rb mode indicates to read the file as binary.)
The results demonstrated:
The string literal / backslashes make no difference in whether or not the hashes of different files are different
My error was not opening the file in binary mode. When using rb mode in open, I got different results.
Yay! And thanks for the help.
Use f1.seek(0) if you intend to use it again, otherwise it would be a file completely read and calling read() again would just return a empty string.
I don't reproduce your problem. I'm using Python 3.4 on Windows 7 here with the following test script which accesses files on a network hard disk:
import sys, hashlib
def main():
fn0 = r'\\NAS\Public\Software\Backup\Test\Vagrantfile'
fn1 = r'\\NAS\Public\Software\Backup\Test\z.xml'
with open(fn0, 'rb') as f:
h0 = hashlib.md5(f.read())
print(h0.hexdigest())
with open(fn1, 'rb') as f:
h1 = hashlib.md5(f.read())
print(h1.hexdigest())
if __name__ == '__main__':
sys.exit(main())
Running this results in two different hash values (as expected):
c:\src\python>python hashtest.py
8af202dffb88739c2dbe188c12291e3d
2ff3db61ff37ca5ceac6a59fd7c1018b
If reading the file contents returns different data for the remote files then passing that data into md5 has to result in different hash values. You might want to print out the first 80 bytes of each file as a check that you are getting what you expect.

python not recognizing new created files

I have a python script that will create a text file and then will run a command on this newly created file.
The problem is that the command line is not recognizing the newly created file and I'm getting an error that the file is empty.
my code is something like this:
randomval is a function that will create random characters and return them as a string.
text_file = open("test.txt", "w")
text_file.write(randomval(20,10))
# do something with the `test.txt` file
but I'm getting an error that the file test.txt is empty.
Is there anyway to solve this?
This happens because unless you flush or close the file, the OS will not write any data to the disk. To make sure that the file is closed, use the with statement:
with open("test.txt", "w") as f:
f.write(randomval(20,10))
print('Whoa, at this point the file descriptor is automatically closed!')
Don't forget to close the file:
br#ymir:~$ python
Python 2.6.5 (r265:79063, Oct 1 2012, 22:04:36)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> file=open('foo.bar','w')
>>> file.write('42\n')
>>>
[2]+ Stopped python
br#ymir:~$ cat foo.bar
br#ymir:~$ fg
python
>>> file.close()
>>>
[2]+ Stopped python
br#ymir:~$ cat foo.bar
42
br#ymir:~$
You should, at least, flush the buffer before trying to do something with your new file (text_file.flush()). The best would be to close the file and reopen it when needed.
Do a f.close() and then open it again as text_file = open("test.txt", "r")
The stream to the file is still open!!!! Just try: text_file.close()

Python: Save a list and recover it for AI learning application

I have an application that requires saving a list as it grows. This is a step towards doing AI learning stuff. Anyway here is my code:
vocab = ["cheese", "spam", "eggs"]
word = raw_input()
vocab.append(word)
However when the code is run the finished vocab will return to just being cheese, spam and eggs. How can I make whatever I append to the list stay there permenantly even when windows CMD is closed and I return to the code editing stage. Is this clear enough??
Thanks
You're looking into the more general problem of object persistence. There are bazillions of ways to do this. If you're just starting out, the easiest way to save/restore data structures in Python is using the pickle module. As you get more advanced, there are different methods and they all have their trade-offs...which you'll learn when you need to.
You could use json and files, something like this:
import json
#write the data to a file
outfile = open("dumpFile", 'w')
json.dump(vocab, outfile)
#read the data back in
with open("dumpFile") as infile:
newVocab = json.load(infile)
This has the advantage of being a plain text file, so you can view the data stored in it easily.
Look into pickle
With it, you can serialize a Python data structure and then reload it like so:
>>> import pickle
>>> vocab =["cheese", "spam", "eggs"]
>>> outf=open('vocab.pkl','wb')
>>> pickle.dump(vocab,outf)
>>> outf.close()
>>> quit()
Python interpreter is now exited, restart Python and reload the data structure:
abd-pb:~ andrew$ python
Python 2.7.1 (r271:86882M, Nov 30 2010, 10:35:34)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> pklf=open('vocab.pkl','rb')
>>> l=pickle.load(pklf)
>>> l
['cheese', 'spam', 'eggs']
You could use pickle, but as a technique it's limited to python programs. The normal way to deal with persistent data is to write it to a file. Just read your word list from a regular text file (one word per line), and write out an updated word list later. Later you can learn how to keep this kind of information in a database (more powerful but less flexible than a file).
You can happily program for years without actually needing pickle, but you can't do without file i/o and databases.
PS. Keep it simple: You don't need to mess with json, pickle or any other structured format unless and until you NEED structure.

Python file.write() taking two tries?

not sure how to explain this, any help will be appreciated!
Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2, pynotify, tempfile, os
>>> opener = urllib2.build_opener()
>>> page = opener.open('http://img.youtube.com/vi/RLGJU_xUVTs/1.jpg')
>>> thumb = page.read()
>>> temp = tempfile.NamedTemporaryFile(suffix='.jpg')
>>> temp.write(thumb)
>>> os.path.getsize(temp.name)
0
>>> temp.write(thumb)
>>> os.path.getsize(temp.name)
4096
thanks!
If you open the thumb file, you'll see that there's one whole copy and a partial copy of the data that you are writing in it.
flush the file instead of writing a second time.
temp.flush()
The file wasn't writing the first time because the contents aren't large enough to fill the buffer. The second write overfills the buffer and so a buffer's worth of data gets written.
As Cameron points out in his answer, the buffer is automatically flushed when you close the file. If you want to keep it open for some reason (and the fact that this is an issue for you seems to indicate that you do), then you can call flush and the data will be written right away.
You haven't called flush() or close() on the file object before checking its size on disk -- there is an internal buffer that is automatically flushed only after a certain amount of data is written (this avoids too many expensive trips to the disk when doing many writes).

Is Python's seek() on OS X broken?

I'm trying to implement a simple method to read new lines from a log file each time the method is called.
I've looked at the various suggestions both on stackoverflow (e.g. here) and elsewhere for simulating "tail" functionality; most involve using readline() to read in new lines as they're appended to the file. It should be simple enough, but can't get it to work properly on OS X 10.6.4 with the included Python 2.6.1.
To get to the heart of the problem, I tried the following:
Open two terminal windows.
In one, create a text file "test.log" with three lines:
one
two
three
In the other, start python and execute the following code:
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.stat('test.log')
posix.stat_result(st_mode=33188, st_ino=23465217, st_dev=234881025L, st_nlink=1, st_uid=666, st_gid=20, st_size=14, st_atime=1281782739, st_mtime=1281782738, st_ctime=1281782738)
>>> log = open('test.log')
>>> log.tell()
0
>>> log.seek(0,2)
>>> log.tell()
14
>>>
So we see with the tell() that seek(0,2) brought us to the end of the file as reported by os.stat(), byte 14.
In the first shell, add another two lines to "test.log" so that it looks like this:
one
two
three
four
five
Go back to the second shell, and execute the following code:
>>> os.stat('test.log')
posix.stat_result(st_mode=33188, st_ino=23465260, st_dev=234881025L, st_nlink=1, st_uid=666, st_gid=20, st_size=24, st_atime=1281783089, st_mtime=1281783088, st_ctime=1281783088)
>>> log.seek(0,2)
>>> log.tell()
14
>>>
Here we see from os.stat() that the file's size is now 24 bytes, but seeking to the end of the file somehow still points to byte 14?? I've tried the same on Ubuntu with Python 2.5 and it works as I expect. I tried with 2.5 on my Mac, but got the same results as with 2.6.
I must be missing something fundamental here. Any ideas?
How are you adding two more lines to the file?
Most text editors will go through operations a lot like this:
fd = open(filename, read)
file_data = read(fd)
close(fd)
/* you edit your file, and save it */
unlink(filename)
fd = open(filename, write, create)
write(fd, file_data)
The file is different. (Check it with ls -li; the inode number will change for almost every text editor.)
If you append to the log file using your shell's >> redirection, it'll work exactly as it should:
$ echo one >> test.log
$ echo two >> test.log
$ echo three >> test.log
$ ls -li test.log
671147 -rw-r--r-- 1 sarnold sarnold 14 2010-08-14 04:15 test.log
$ echo four >> test.log
$ ls -li test.log
671147 -rw-r--r-- 1 sarnold sarnold 19 2010-08-14 04:15 test.log
>>> log=open('test.log')
>>> log.tell()
0
>>> log.seek(0,2)
>>> log.tell()
19
$ echo five >> test.log
$ echo six >> test.log
>>> log.seek(0,2)
>>> log.tell()
28
Note that the tail(1) command has an -F command line option to handle the case where the file is changed, but a file by the same name exists. (Great for watching log files that might be periodically rotated.)
Short answer: no, your assumptions are.
Your text editor is creating a new file with the same name, not modifying the old file in place. You can see in your stat result that the st_ino is different. If you were to do os.fstat(log.fileno()), you'd get the old size and old st_ino.
If you want to check for this in your implementation of tail, periodically compare the st_ino of the stat and fstat results. If they differ, there's a new file with the same name.

Categories

Resources