After some frustration with unzip(1L), I've been trying to create a script that will unzip and print out raw data from all of the files inside a zip archive that is coming from stdin. I currently have the following, which works:
import sys, zipfile, StringIO
stdin = StringIO.StringIO(sys.stdin.read())
zipselect = zipfile.ZipFile(stdin)
filelist = zipselect.namelist()
for filename in filelist:
print filename, ':'
print zipselect.read(filename)
When I try to add validation to check if it truly is a zip file, however, it doesn't like it.
...
zipcheck = zipfile.is_zipfile(zipselect)
if zipcheck is not None:
print 'Input is not a zip file.'
sys.exit(1)
...
results in
File "/home/chris/simple/zipcat/zipcat.py", line 13, in <module>
zipcheck = zipfile.is_zipfile(zipselect)
File "/usr/lib/python2.7/zipfile.py", line 149, in is_zipfile
result = _check_zipfile(fp=filename)
File "/usr/lib/python2.7/zipfile.py", line 135, in _check_zipfile
if _EndRecData(fp):
File "/usr/lib/python2.7/zipfile.py", line 203, in _EndRecData
fpin.seek(0, 2)
AttributeError: ZipFile instance has no attribute 'seek'
I assume it can't seek because it is not a file, as such?
Sorry if this is obvious, this is my first 'go' with Python.
You should pass stdin to is_zipfile, not zipselect. is_zipfile takes a path to a file or a file object, not a ZipFile.
See the zipfile.is_zipfile documentation
You are correct that a ZipFile can't seek because it isn't a file. It's an archive, so it can contain many files.
To do this entirely in memory will take some work. The AttributeError message means that the is_zipfile method is trying to use the seek method of the file handle you provide. But standard input is not seekable, and therefore your file object for it has no seek method.
If you really, really can't store the file on disk temporarily, then you could buffer the entire file in memory (you would need to enforce a size limit for security), and then implement some "duck" code that looks and acts like a seekable file object but really just uses the byte-string in memory.
It is possible that you could cheat and buffer only enough of the data for is_zipfile to do its work, but I seem to recall that the table-of-contents for ZIP is at the end of the file. I could be wrong about that though.
Your 2011 python2 fragment was: StringIO.StringIO(sys.stdin.read())
In 2018 a python3 programmer might phrase that as: io.StringIO(...).
What you wanted was the following python3 fragment: io.BytesIO(...).
Certainly that works well for me when using the requests module to download binary ZIP files from webservers:
zf = zipfile.ZipFile(io.BytesIO(req.content))
Related
I need to create a temporary file to write some data out to in Python 3. The file will be written to via a separate module which deals with opening the file from a path given as a string.
I'm using tempfile.mkstemp() to create this temporary file and according to the docs:
mkstemp() returns a tuple containing an OS-level handle to an open file (as would be returned by os.open()) and the absolute pathname of that file, in that order.
Given I'm not going to be using the open OS-level file handle given to me, do I need to close it? I understand about regular Python file handles and closing them but I'm not familiar with OS-level file descriptors/handles.
So is this better:
fd, filename = tempfile.mkstemp()
os.close(fd)
Or can I simply just do this:
_, output_filename = tempfile.mkstemp()
The returned file descriptor is not a file object, the garbage collector will not close it for you.
You should use:
fd, filename = tempfile.mkstemp()
os.close(fd)
The returned file descriptor is useful to avoid race conditions where the filename is replaced with a symbolic link to a file that the attacker can not read but you can which can result in data exposure.
I seem to remember that the Python gzip module previously allowed you to read non-gzipped files transparently. This was really useful, as it allowed to read an input file whether or not it was gzipped. You simply didn't have to worry about it.
Now,I get an IOError exception (in Python 2.7.5):
Traceback (most recent call last):
File "tst.py", line 14, in <module>
rec = fd.readline()
File "/sw/lib/python2.7/gzip.py", line 455, in readline
c = self.read(readsize)
File "/sw/lib/python2.7/gzip.py", line 261, in read
self._read(readsize)
File "/sw/lib/python2.7/gzip.py", line 296, in _read
self._read_gzip_header()
File "/sw/lib/python2.7/gzip.py", line 190, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file
If anyone has a neat trick, I'd like to hear about it. Yes, I know how to catch the exception, but I find it rather clunky to first read a line, then close the file and open it again.
The best solution for this would be to use something like https://github.com/ahupp/python-magic with libmagic. You simply cannot avoid at least reading a header to identify a file (unless you implicitly trust file extensions)
If you're feeling spartan the magic number for identifying gzip(1) files is the first two bytes being 0x1f 0x8b.
In [1]: f = open('foo.html.gz')
In [2]: print `f.read(2)`
'\x1f\x8b'
gzip.open is just a wrapper around GzipFile, you could have a function like this that just returns the correct type of object depending on what the source is without having to open the file twice:
#!/usr/bin/python
import gzip
def opener(filename):
f = open(filename,'rb')
if (f.read(2) == '\x1f\x8b'):
f.seek(0)
return gzip.GzipFile(fileobj=f)
else:
f.seek(0)
return f
Maybe you're thinking of zless or zgrep, which will open compressed or uncompressed files without complaining.
Can you trust that the file name ends in .gz?
if file_name.endswith('.gz'):
opener = gzip.open
else:
opener = open
with opener(file_name, 'r') as f:
...
Read the first four bytes. If the first three are 0x1f, 0x8b, 0x08, and if the high three bits of the fourth byte are zeros, then fire up the gzip compression starting with those four bytes. Otherwise write out the four bytes and continue to read transparently.
You should still have the clunky solution to back that up, so that if the gzip read fails nevertheless, then back up and read transparently. But it should be quite unlikely to have the first four bytes mimic a gzip file so well, but not be a gzip file.
You can iterate over files transparently using fileinput(files, openhook=fileinput.hook_compressed)
I am doing some file processing and for generating the file i need to generate some temporary file from existing data and then use that file as input to my function.
But i am confused where should i save that file and then delete it.
Is there any temp location where files automatically gets deleted after user session
Python has the tempfile module for exactly this purpose. You do not need to worry about the location/deletion of the file, it works on all supported platforms.
There are three types of temporary files:
tempfile.TemporaryFile - just basic temporary file,
tempfile.NamedTemporaryFile - "This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the file object.",
tempfile.SpooledTemporaryFile - "This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().",
EDIT: The example usage you asked for could look like this:
>>> with TemporaryFile() as f:
f.write('abcdefg')
f.seek(0) # go back to the beginning of the file
print(f.read())
abcdefg
You should use something from the tempfile module. I think that it has everything you need.
I would add that Django has a built-in NamedTemporaryFile functionality in django.core.files.temp which is recommended for Windows users over using the tempfile module. This is because the Django version utilizes the O_TEMPORARY flag in Windows which prevents the file from being re-opened without the same flag being provided as explained in the code base here.
Using this would look something like:
from django.core.files.temp import NamedTemporaryFile
temp_file = NamedTemporaryFile(delete=True)
Here is a nice little tutorial about it and working with in-memory files, credit to Mayank Jain.
I just added some important changes: convert str to bytes and a command call to show how external programs can access the file when a path is given.
import os
from tempfile import NamedTemporaryFile
from subprocess import call
with NamedTemporaryFile(mode='w+b') as temp:
# Encode your text in order to write bytes
temp.write('abcdefg'.encode())
# put file buffer to offset=0
temp.seek(0)
# use the temp file
cmd = "cat "+ str(temp.name)
print(os.system(cmd))
filename = fileobject.read()
i want to transfer/assign the whole data of a object within a file.
You are almost doing it correctly already; the code should read
filecontent = fileobject.read()
read() with no arguments will read the whole data, i.e. the whole file content. The file name has nothing to do with that.
I am attempting to use the 'tempfile' module for manipulating and creating text files. Once the file is ready I want to save it to disk. I thought it would be as simple as using 'shutil.copy'. However, I get a 'permission denied' IOError:
>>> import tempfile, shutil
>>> f = tempfile.TemporaryFile(mode ='w+t')
>>> f.write('foo')
>>> shutil.copy(f.name, 'bar.txt')
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
shutil.copy(f.name, 'bar.txt')
File "C:\Python25\lib\shutil.py", line 80, in copy
copyfile(src, dst)
File "C:\Python25\lib\shutil.py", line 46, in copyfile
fsrc = open(src, 'rb')
IOError: [Errno 13] Permission denied: 'c:\\docume~1\\me\\locals~1\\temp\\tmpvqq3go'
>>>
Is this not intended when using the 'tempfile' library? Is there a better way to do this? (Maybe I am overlooking something very trivial)
hop is right, and dF. is incorrect on why the error occurs.
Since you haven't called f.close() yet, the file is not removed.
The doc for NamedTemporaryFile says:
Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows NT or later).
And for TemporaryFile:
Under Unix, the directory entry for the file is removed immediately after the file is created. Other platforms do not support this; your code should not rely on a temporary file created using this function having or not having a visible name in the file system.
Therefore, to persist a temporary file (on Windows), you can do the following:
import tempfile, shutil
f = tempfile.NamedTemporaryFile(mode='w+t', delete=False)
f.write('foo')
file_name = f.name
f.close()
shutil.copy(file_name, 'bar.txt')
os.remove(file_name)
The solution Hans Sjunnesson provided is also off, because copyfileobj only copies from file-like object to file-like object, not file name:
shutil.copyfileobj(fsrc, fdst[, length])
Copy the contents of the file-like object fsrc to the file-like object fdst. The integer length, if given, is the buffer size. In particular, a negative length value means to copy the data without looping over the source data in chunks; by default the data is read in chunks to avoid uncontrolled memory consumption. Note that if the current file position of the fsrc object is not 0, only the contents from the current file position to the end of the file will be copied.
The file you create with TemporaryFile or NamedTemporaryFile is automatically removed when it's closed, which is why you get an error. If you don't want this, you can use mkstemp instead (see the docs for tempfile).
>>> import tempfile, shutil, os
>>> fd, path = tempfile.mkstemp()
>>> os.write(fd, 'foo')
>>> os.close(fd)
>>> shutil.copy(path, 'bar.txt')
>>> os.remove(path)
Starting from python 2.6 you can also use NamedTemporaryFile with the delete= option set to False. This way the temporary file will be accessible, even after you close it.
Note that on Windows (NT and later) you cannot access the file a second time while it is still open. You have to close it before you can copy it. This is not true on Unix systems.
You could always use shutil.copyfileobj, in your example:
new_file = open('bar.txt', 'rw')
shutil.copyfileobj(f, new_file)