Python ZipFile giving different namelist than unzipping utility - python

I have a bunch of timestamped .jpgs in a zip file, and when I open that zip file using Python's ZipFile package, I see three files:
>>> cameraZip = zipfile.ZipFile(zipPath, 'r')
>>> cameraZip.namelist()
['20131108_200152.jpg', '20131108_203158.jpg', '20131108_205521.jpg']
When I unpack the file using Mac OSX's default .zip unexpander, I get 371 files, from '20131101_000159.jpg' up to '20131108_193152.jpg'.
Unzipping this file gives the same result as the .zip unexpander:
$ unzip 2013.11.zip
extracting: 20131101_000159.jpg
extracting: 20131101_003156.jpg
...
extracting: 20131108_190155.jpg
extracting: 20131108_193152.jpg
Anybody have any idea what's going on?

Most likely the problem is in zip central directory record, which wasn't correctly flushed when zip file was created. While Python looks for central directory (I guess), other implementations process local file headers and found all of them.

Related

How can I unpack multi-part archives (zip/rar) in Python?

I have a 2 GB archive (prefer .zip or .rar) file in parts (let's assume 100 parts x 20MB), and I am trying to find a way to unpack it properly. I started with a .zip archive; I had files like test.zip, test.z01, test.z02...test.99, etc. When I merge them in Python like this:
for zipName in zips:
with open(os.path.join(path_to_zip_file, "test.zip"), "ab") as f:
with open(os.path.join(path_to_zip_file, zipName), "rb") as z:
f.write(z.read())
and then, after merge, unpack it like thod"
with zipfile.ZipFile(os.path.join(path_to_zip_file, "test.zip"), "r") as zipObj:
zipObj.extractall(path_to_zip_file)
I get errors, likr
test.zip file isn't zip file.
So then I tried with a .rar archive. I tried to unpack just the first file to see if my code would intelligently look for and pick up the remaining archive fragments, but it did not. So again I merged the .rar files (just like in the .zip case), and then tried to unpack it by using patoolib:
patoolib.extract_archive("test.rar", outdir="path here")
When I do that, I get errors like:
patoolib.util.PatoolError: could not find an executable program to extract format rar; candidates are (rar,unrar,7z)
After some work I figured out that these merged files are corrupted (I copied it and try to unpack normally on windows using WinRAR, and encountered some problems). So I tried other ways to merge for example using cat cat test.part.* >test.rar, but those don't help.
How can I merge and then unpack these archive files properly in Python?
Calling 7z out of python
rename the .zip to .zip.001 and .z01 to zip.002 and so on.
call 7z on the 001 ( 7z x test.zip.001 )
import subprocess
cmd = ['7z', 'x', 'test.zip.001']
sp = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
CAT
cat test.zip* > test.zip should also work, but not always imho. Tried it for single file and works, but failed with subfolders. Maintaining the right order is mandatory.
Testing:
7z -v1m a test.zip 12MFile
cat test.zip* > test.zip
7z t test.zip
>> Everything is Ok
Can't check with "official" WinRAR (does this even still exist?!) nor WinZIP Files.
Merge File in Python
If you want to stay in python this works too (again for my 7z testfiles..):
import shutil
import glob
with open('output_file.zip','wb') as wfd:
for f in glob.glob('test.zip.*'): # Search for all files matching searchstring
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd) # Concatinate
Further remarks
pyunpack (python frontend) with patool (python backend) and installed unrar or p7zip-rar (7z with the unfree rar-stuff) for linux or 7z in windows can handle zip and rar (and many more) in python
there is a 7z x -t flag for explicitly set it as split archive (if file is not named 001 maybe helps). Give as e.g. 7z x -trar.split or 7z x -tzip.split or something.

Why Does a Strange File Shows Up in Directory When Using os.walk()?

The project is written in Pycharm on Windows 10.
I wrote a program that grabs .docx files from a directory and searches for information. At the end of the list of file names I get this file: "~$640188.docx"
I get this error when it hits this file:
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file
This error happens when I try to put file '~$640188.docx' into the docx2text method process
text = docx2txt.process(r'C:\path\to\folder\~$640188.docx')
From what I can see, this file does not exist in the directory I'm searching nor anywhere on my computer. The other strange part is that yesterday I wasn't getting this error.
I know there are sometimes "hidden" files in directories and I ran into those before on my mac (specifically '.DS_Store') but this is a .docx file.
I currently have an ugly solution, which says "don't run the code if you run into '~$640188.docx'". My concern is that this will become more of a problem when I dump 11000 files into the directory.
Where does this file come from?
Below is the code for reference
import docx2txt
import os
check_files = []
for dir, subdir, files in os.walk(r'C:\path\to\folder'):
for file in files:
check_files.append(file)
for file in check_files:
print "file: {0}".format(file)
text = docx2txt.process(r'C:\path\to\folder\{0}'.format(file))
Hidden .docx files starting with ~$ are simply temporary files created by Word while a file is actively open and being edited – the first two characters of the respective parent file's name are replaced with the ~$. They are usually deleted once you save and close a document, but sometimes they manage to stick around after you quit anyway. Since they are designed to be temporary compliments to a proper .docx file, they do not necessary have the correct zip package structure at all times.
You will do well to skip those. Checking if the file name starts with '~' should be good enough. Just add the following filtering:
check_files2 = [fl for fl in check_files if fl[0] != '~']
for file in check_files2:

Unzip folder by chunks in python

I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)

python zipfile.ZipFile() method generates 20G zip file from 6M original

I am running a python program (v2.7) which zips output so that it can be emailed.
Usually this works as expected, but occasionally the zipped file is so huge that the machine runs out of disk space. Yet when I zip the file manually using the finder, it works fine.
In this case, the 6MB file gets zipped down to a 1.6MB file using the finder, but the python zip method generated a 20GB file. Here is the code where the zipping is happening:
zip = zipfile.ZipFile(zipfilename,"w",zipfile.ZIP_DEFLATED)
for f in os.listdir("."):
if fnmatch.fnmatch(f,"*final*"):
zip.write(f)
zip.close()
Is there a way to fix this or at least avoid generating a gigantic file?
Do you maybe create that zip file in the same directory and the program is then trying to add the zipfile itself to the zip file?
Its Linux?, i think you are including hidden files and folders?

How to walk a tar.gz file that contains zip files without extraction

I have a large tar.gz file to analyze using a python script. The tar.gz file contains a number of zip files which might embed other .gz files in it. Before extracting the file, I would like to walk through the directory structure within the compressed files to see if certain files or directories are present. By looking at tarfile and zipfile module I don't see any existing function that allow me to get a table of content of a zip file within a tar.gz file.
Appreciate your help,
You can't get at it without extracting the file. However, you don't need to extract it to disk if you don't want to. You can use the tarfile.TarFile.extractfile method to get a file-like object that you can then pass to tarfile.open as the fileobj argument. For example, given these nested tarfiles:
$ cat bar/baz.txt
This is bar/baz.txt.
$ tar cvfz bar.tgz bar
bar/
bar/baz.txt
$ tar cvfz baz.tgz bar.tgz
bar.tgz
You can access files from the inner one like so:
>>> import tarfile
>>> baz = tarfile.open('baz.tgz')
>>> bar = tarfile.open(fileobj=baz.extractfile('bar.tgz'))
>>> bar.extractfile('bar/baz.txt').read()
'This is bar/baz.txt.\n'
and they're only ever extracted to memory.
I suspect that this is not possible and that you'll have to program it manually.
.tar.gz files are first tar'd then gzipped with what is essentially two different applications, in succession. To access the tar file, you're probably going to have to un-gzip it, first.
Also, once you do have access to the tar file after ungzipping it, it does not do random-access well. There is no central repository in the tar file that lists the contents.

Categories

Resources