Reading multiple zip archive comments with python

Reading multiple zip archive comments with python - python

My zip file contains a lot of smaller zip files.
I want to iterate through all those files,
reading and printing each of their comments.
I've found out that zipfile file.zip or unzip -z file.zipcan do this to a file in separate, but I'm looking for a way to go through all of them.
Couldn't find anything perfect yet, but this post. However, the code is too advanced for me, and I need something very basic, to begin with :)
Any ideas or information would be great, thanks!

Not sure exactly what your looking for but here are a few ways I did it on an Ubuntu Linux machine.
for i in `ls *.zip`; do unzip -l $i; done
or
unzip -l myzip.zip
or
unzip -p myzip.zip | python -c 'import zipfile,sys,StringIO;print "\n".join(zipfile.ZipFile(StringIO.StringIO(sys.stdin.read())).namelist())'

You can use the zipfile library to iterate through your files and
get their comments using zipinfo.comment
import zipfile
file = zipfile.ZipFile('filepath.zip')
infolist = file.infolist()
for info in infolist:
print(info.comment)
The example above prints the comment of each file in your zip file.
You could loop through your zip files and print their contents comments similiarly.
Check out the official zipfile documentation, its super clear.

A short and easy way to achieve this:
from zipfile import ZipFile
ziplist = ZipFile('parentzip.zip').namelist()
for childzip in ziplist:
zip_comment = ZipFile(childzip).comment
Reminder that if you want to do string based comparisons you should either encode your reference string as bytes, or convert the comment into a string. Ex:
from zipfile import ZipFile
paths = ['file1.zip', 'file2.zip', 'file3.zip']
bad_str = 'please ignore me'
new = []
for filename in paths:
zip_comment = zipfile.ZipFile(filename).comment
if not zip_comment == str.encode(bad_str):
new.append(filename)
paths = new

Related

Combine two files chunked in this format XXXXX.csv.gz_1_2.tar & XXXXX.csv.gz_2_2.tar (with python or pyspark)

I have two files with this format XXXX.csv.gz_1_2.tar & XXXX.csv.gz_2_2.tar, my goal is to combine those files to be able to unzip the complete file in order to get the csv file.
Can you help me please ?
I tried to use tar or cat function from linux cmd with import os like:
import os
cat="cat C:/Users/AAAA/XXXX.csv.gz_1_2.tar C:/Users/AAAA/XXXX.csv.gz_2_2.tar > C:/Users/AAAA/XXXX.csv.gz.tar "
os.system(cat)
Thank you !

The code below is (almost) completely stolen from Add files from one tar into another tar in python, with the obvious adaptation of using two (or any number) of original tar files.
import tarfile
old_tars = ("….tar", "….tar.gz", "….tar.xz", …)
with tarfile.open("new.tar", "w") as new_tar:
for old_tar in (tarfile.open(tar_name, "r") for tar_name in old_tars):
for member in old_tar.getmembers():
new_tar.addfile(member, old_tar.extractfile(member.name))
old_tar.close()
(of course in a real world program the names of the tar files wouldn't be not hard-coded into the source).

Need to read every file of directory for particular word around 172 directories,we have done for single directory

Here is the below code we have developed for single directory of files
from os import listdir
with open("/user/results.txt", "w") as f:
for filename in listdir("/user/stream"):
with open('/user/stream/' + filename) as currentFile:
text = currentFile.read()
if 'checksum' in text:
f.write('current word in ' + filename[:-4] + '\n')
else:
f.write('NOT ' + filename[:-4] + '\n')
I want loop for all directories
Thanks in advance

If you're using UNIX you can use grep:
grep "checksum" -R /user/stream
The -R flag allows for a recursive search inside the directory, following the symbolic links if there are any.

My suggestion is to use glob.
The glob module allows you to work with files. In the Unix universe, a directory is / should be a file so it should be able to help you with your task.
More over, you don't have to install anything, glob comes with python.
Note: For the following code, you will need python3.5 or greater
This should help you out.
import os
import glob
for path in glob.glob('/ai2/data/prod/admin/inf/**', recursive=True):
# At some point, `path` will be `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error`
if not os.path.isdir(path):
# Check the `id` of the file
# Do things with the file
# If there are files inside `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error` you will be able to access them here
What glob.glob does is, it Return a possibly-empty list of path names that match pathname. In this case, it will match every file (including directories) in /user/stream/. If these files are not directories, you can do whatever you want with them.
I hope this will help you!
Clarification
Regarding your 3 point comment attempting to clarify the question, especially this part we need to put appi dynamically in that path then we need to read all files inside that directory
No, you do not need to do this. Please read my answer carefully and please read glob documentation.
In this case, it will match every file (including directories) in /user/stream/
If you replace /user/stream/ with /ai2/data/prod/admin/inf/, you will have access to every file in /ai2/data/prod/admin/inf/. Assuming your app ids are 1, 2, 3, this means, you will have access to the following files.
/ai2/data/prod/admin/inf/inf_1_pvt/error
/ai2/data/prod/admin/inf/inf_2_pvt/error
/ai2/data/prod/admin/inf/inf_3_pvt/error
You do not have to specify the id, because you will be iterating over all files. If you do need the id, you can just extract it from the path.
If everything looks like this, /ai2/data/prod/admin/inf/inf_<$APP>_pvt/error, you can get the id by removing /ai2/data/prod/admin/inf/ and taking everything until you encounter _.

Unzipping a Zip File in Django

I'm trying to unzip a zip file in Django using the zipfile library.
This is my code:
if formtoaddmodel.is_valid():
content = request.FILES['content']
unzipped = zipfile.ZipFile(content)
print unzipped.namelist()
for libitem in unzipped.namelist():
filecontent = file(libitem,'wb').write(unzipped.read(libitem))
This is the output of print unzipped.namelist()
['FileName1.jpg', 'FileName2.png', '__MACOSX/', '__MACOSX/._FileName2.png']
Im wondering what the last two items are -- it looks like the path. I don't care about there -- so how is there a way to filter them out?

https://superuser.com/questions/104500/what-is-macosx-folder
if libitem.startswith('__MACOSX/'):
continue

Those files are tags added by the zip utility on MACS. You can assume the name starts with '__MACOSX/'
link

Extracting a tar file with folders starting with /

I am writing a program in python and using tarfile to extract tarfiles. Some of these tarfiles contain folders which start with a / or (Alternatively for windows \) which cause problems (files are extracted to wrong place). How can I get around this issue and make sure that the extraction ends up in correct place ?

The docs for tarfile explicitly warn about such a scenario. Instead you need to iterate over the content of the tar file and extract each file individually:
import os
import tarfile
extract_to = "."
tfile = tarfile.open('so.tar')
members = tfile.getmembers()
for m in members:
if m.name[0] == os.sep:
m.name = m.name[1:]
tfile.extract(m, path=extract_to)

Did you try extractall() method? As I remeber one of the this method arguments contains information where archive should be extracted.

How can I programmatically create a tar archive of nested directories and files solely from Python strings and without temporary files?

I want to create a tar archive with a hierarchical directory structure from Python, using strings for the contents of the files. I've read this question , which shows a way of adding strings as files, but not as directories. How can I add directories on the fly to a tar archive without actually making them?
Something like:
archive.tgz:
file1.txt
file2.txt
dir1/
file3.txt
dir2/
file4.txt

Extending the example given in the question linked, you can do it as follows:
import tarfile
import StringIO
import time
tar = tarfile.TarFile("test.tar", "w")
string = StringIO.StringIO()
string.write("hello")
string.seek(0)
info = tarfile.TarInfo(name='dir')
info.type = tarfile.DIRTYPE
info.mode = 0755
info.mtime = time.time()
tar.addfile(tarinfo=info)
info = tarfile.TarInfo(name='dir/foo')
info.size=len(string.buf)
info.mtime = time.time()
tar.addfile(tarinfo=info, fileobj=string)
tar.close()
Be careful with mode attribute since default value might not include execute permissions for the owner of the directory which is needed to change to it and get its contents.

A slight modification to the helpful accepted answer so that it works with python 3 as well as python 2 (and matches the OP's example a bit closer):
from io import BytesIO
import tarfile
import time
# create and open empty tar file
tar = tarfile.open("test.tgz", "w:gz")
# Add a file
file1_contents = BytesIO("hello 1".encode())
finfo1 = tarfile.TarInfo(name='file1.txt')
finfo1.size = len(file1_contents.getvalue())
finfo1.mtime = time.time()
tar.addfile(tarinfo=finfo1, fileobj=file1_contents)
# create directory in the tar file
dinfo = tarfile.TarInfo(name='dir')
dinfo.type = tarfile.DIRTYPE
dinfo.mode = 0o755
dinfo.mtime = time.time()
tar.addfile(tarinfo=dinfo)
# add a file to the new directory in the tar file
file2_contents = BytesIO("hello 2".encode())
finfo2 = tarfile.TarInfo(name='dir/file2.txt')
finfo2.size = len(file2_contents.getvalue())
finfo2.mtime = time.time()
tar.addfile(tarinfo=finfo2, fileobj=file2_contents)
tar.close()
In particular, I updated octal syntax following PEP 3127 -- Integer Literal Support and Syntax, switched to BytesIO from io, used getvalue instead of buf, and used open instead of TarFile to show zipped output as in the example. (Context handler usage (with ... as tar:) would also work in both python2 and python3, but cut and paste didn't work with my python2 repl, so I didn't switch it.) Tested on python 2.7.15+ and python 3.7.3.

Looking at the tar file format it seems doable. The files that go in each subdirectory get the relative pathname (e.g. dir1/file3.txt) as their name.
The only trick is that you must define each directory before the files that go into it (tar won't create the necessary subdirectories on the fly). There is a special flag you can use to identify a tarfile entry as a directory, but for legacy purposes, tar also accepts file entries having names that end with / as representing directories, so you should be able to just add dir1/ as a file from a zero-length string using the same technique.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.