How can I unpack multi-part archives (zip/rar) in Python?

How can I unpack multi-part archives (zip/rar) in Python? - python

I have a 2 GB archive (prefer .zip or .rar) file in parts (let's assume 100 parts x 20MB), and I am trying to find a way to unpack it properly. I started with a .zip archive; I had files like test.zip, test.z01, test.z02...test.99, etc. When I merge them in Python like this:
for zipName in zips:
with open(os.path.join(path_to_zip_file, "test.zip"), "ab") as f:
with open(os.path.join(path_to_zip_file, zipName), "rb") as z:
f.write(z.read())
and then, after merge, unpack it like thod"
with zipfile.ZipFile(os.path.join(path_to_zip_file, "test.zip"), "r") as zipObj:
zipObj.extractall(path_to_zip_file)
I get errors, likr
test.zip file isn't zip file.
So then I tried with a .rar archive. I tried to unpack just the first file to see if my code would intelligently look for and pick up the remaining archive fragments, but it did not. So again I merged the .rar files (just like in the .zip case), and then tried to unpack it by using patoolib:
patoolib.extract_archive("test.rar", outdir="path here")
When I do that, I get errors like:
patoolib.util.PatoolError: could not find an executable program to extract format rar; candidates are (rar,unrar,7z)
After some work I figured out that these merged files are corrupted (I copied it and try to unpack normally on windows using WinRAR, and encountered some problems). So I tried other ways to merge for example using cat cat test.part.* >test.rar, but those don't help.
How can I merge and then unpack these archive files properly in Python?

Calling 7z out of python
rename the .zip to .zip.001 and .z01 to zip.002 and so on.
call 7z on the 001 ( 7z x test.zip.001 )
import subprocess
cmd = ['7z', 'x', 'test.zip.001']
sp = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
CAT
cat test.zip* > test.zip should also work, but not always imho. Tried it for single file and works, but failed with subfolders. Maintaining the right order is mandatory.
Testing:
7z -v1m a test.zip 12MFile
cat test.zip* > test.zip
7z t test.zip
>> Everything is Ok
Can't check with "official" WinRAR (does this even still exist?!) nor WinZIP Files.
Merge File in Python
If you want to stay in python this works too (again for my 7z testfiles..):
import shutil
import glob
with open('output_file.zip','wb') as wfd:
for f in glob.glob('test.zip.*'): # Search for all files matching searchstring
with open(f,'rb') as fd:
shutil.copyfileobj(fd, wfd) # Concatinate
Further remarks
pyunpack (python frontend) with patool (python backend) and installed unrar or p7zip-rar (7z with the unfree rar-stuff) for linux or 7z in windows can handle zip and rar (and many more) in python
there is a 7z x -t flag for explicitly set it as split archive (if file is not named 001 maybe helps). Give as e.g. 7z x -trar.split or 7z x -tzip.split or something.

Related

Combine two files chunked in this format XXXXX.csv.gz_1_2.tar & XXXXX.csv.gz_2_2.tar (with python or pyspark)

I have two files with this format XXXX.csv.gz_1_2.tar & XXXX.csv.gz_2_2.tar, my goal is to combine those files to be able to unzip the complete file in order to get the csv file.
Can you help me please ?
I tried to use tar or cat function from linux cmd with import os like:
import os
cat="cat C:/Users/AAAA/XXXX.csv.gz_1_2.tar C:/Users/AAAA/XXXX.csv.gz_2_2.tar > C:/Users/AAAA/XXXX.csv.gz.tar "
os.system(cat)
Thank you !

The code below is (almost) completely stolen from Add files from one tar into another tar in python, with the obvious adaptation of using two (or any number) of original tar files.
import tarfile
old_tars = ("….tar", "….tar.gz", "….tar.xz", …)
with tarfile.open("new.tar", "w") as new_tar:
for old_tar in (tarfile.open(tar_name, "r") for tar_name in old_tars):
for member in old_tar.getmembers():
new_tar.addfile(member, old_tar.extractfile(member.name))
old_tar.close()
(of course in a real world program the names of the tar files wouldn't be not hard-coded into the source).

Can't extract gz file using the patool package

I am trying to use the patool package to perform a simple operation: decompressing a gz archive that consists of one file. This one file in the archive is and xml file that has exactly the same name as the archive, just without the .gz ending.
The code I use for this is:
import patoolib
filePath = 'D:\\inpath\\file.xml.gz'
outPath= 'D:\\outpath'
patoolib.extract_archive(filePath,outdir=outPath, interactive=False, verbosity=-1)
But what happens is that the file is being extracted but in a corrupt manner. That is, the file appears in the outPath folder, but has 0kb and cannot be opened. The error I get is:
PatoolError: Command `['c:\Rtools\bin\gzip.EXE', '-c', '-d', '--', 'D:\inpath\file.xml.gz', '>', 'D:\outPath\file.xml']' returned non-zero exit status 1
Now, I am certain that the archive is not corrupt, since when I perform the extraction manually using Windows Explorer, it does work properly.
This code did work for some other files, but I can't understand why this is occurring for this file. Also, I am wondering whether there is perhaps a simpler way of doing this that is known o work more smoothly.

Python ZipFile giving different namelist than unzipping utility

I have a bunch of timestamped .jpgs in a zip file, and when I open that zip file using Python's ZipFile package, I see three files:
>>> cameraZip = zipfile.ZipFile(zipPath, 'r')
>>> cameraZip.namelist()
['20131108_200152.jpg', '20131108_203158.jpg', '20131108_205521.jpg']
When I unpack the file using Mac OSX's default .zip unexpander, I get 371 files, from '20131101_000159.jpg' up to '20131108_193152.jpg'.
Unzipping this file gives the same result as the .zip unexpander:
$ unzip 2013.11.zip
extracting: 20131101_000159.jpg
extracting: 20131101_003156.jpg
...
extracting: 20131108_190155.jpg
extracting: 20131108_193152.jpg
Anybody have any idea what's going on?

Most likely the problem is in zip central directory record, which wasn't correctly flushed when zip file was created. While Python looks for central directory (I guess), other implementations process local file headers and found all of them.

How to compress with 7zip instead of zip, code changing

I have a code that compress every file in a specific folder with zip but I want to compress it with 7zip, so how to do ?
This is what I have so far:
for date in dict_date:#zipping folders and get same name like the folder
with ZipFile(os.path.join(src, '{0}.7z'.format(date)), 'w') as myzip:
for subFolder in dict_date[date]:
for fil in os.listdir(os.path.join(src, date, subFolder)):
if not fil.endswith('.7z'):
myzip.write(os.path.join(src, date, subFolder, fil))

You can try the command line method
import subprocess
subprocess.call(['7z', 'a', filename+'.7z', filename])
or for all files in folder
subprocess.call(['7z', 'a', filename+'.7z', "*.*"])

There doesn't appear to be a good Python module for creating a 7z archive (despite what the documentation says, py7zlib can only read them).
A workaround is to download the 7z SDK (http://www.7-zip.org/sdk.html) and use the 7zr executables that come with it via the subprocess module. 7z is in the public domain so you can carry this standalone program around without restriction.

How to walk a tar.gz file that contains zip files without extraction

I have a large tar.gz file to analyze using a python script. The tar.gz file contains a number of zip files which might embed other .gz files in it. Before extracting the file, I would like to walk through the directory structure within the compressed files to see if certain files or directories are present. By looking at tarfile and zipfile module I don't see any existing function that allow me to get a table of content of a zip file within a tar.gz file.
Appreciate your help,

You can't get at it without extracting the file. However, you don't need to extract it to disk if you don't want to. You can use the tarfile.TarFile.extractfile method to get a file-like object that you can then pass to tarfile.open as the fileobj argument. For example, given these nested tarfiles:
$ cat bar/baz.txt
This is bar/baz.txt.
$ tar cvfz bar.tgz bar
bar/
bar/baz.txt
$ tar cvfz baz.tgz bar.tgz
bar.tgz
You can access files from the inner one like so:
>>> import tarfile
>>> baz = tarfile.open('baz.tgz')
>>> bar = tarfile.open(fileobj=baz.extractfile('bar.tgz'))
>>> bar.extractfile('bar/baz.txt').read()
'This is bar/baz.txt.\n'
and they're only ever extracted to memory.

I suspect that this is not possible and that you'll have to program it manually.
.tar.gz files are first tar'd then gzipped with what is essentially two different applications, in succession. To access the tar file, you're probably going to have to un-gzip it, first.
Also, once you do have access to the tar file after ungzipping it, it does not do random-access well. There is no central repository in the tar file that lists the contents.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I unpack multi-part archives (zip/rar) in Python? - python

Related

Combine two files chunked in this format XXXXX.csv.gz_1_2.tar & XXXXX.csv.gz_2_2.tar (with python or pyspark)

Can't extract gz file using the patool package

Python ZipFile giving different namelist than unzipping utility

How to compress with 7zip instead of zip, code changing

How to walk a tar.gz file that contains zip files without extraction

Categories

Resources