I am writing a program in python and using tarfile to extract tarfiles. Some of these tarfiles contain folders which start with a / or (Alternatively for windows \) which cause problems (files are extracted to wrong place). How can I get around this issue and make sure that the extraction ends up in correct place ?
The docs for tarfile explicitly warn about such a scenario. Instead you need to iterate over the content of the tar file and extract each file individually:
import os
import tarfile
extract_to = "."
tfile = tarfile.open('so.tar')
members = tfile.getmembers()
for m in members:
if m.name[0] == os.sep:
m.name = m.name[1:]
tfile.extract(m, path=extract_to)
Did you try extractall() method? As I remeber one of the this method arguments contains information where archive should be extracted.
Related
Here is the below code we have developed for single directory of files
from os import listdir
with open("/user/results.txt", "w") as f:
for filename in listdir("/user/stream"):
with open('/user/stream/' + filename) as currentFile:
text = currentFile.read()
if 'checksum' in text:
f.write('current word in ' + filename[:-4] + '\n')
else:
f.write('NOT ' + filename[:-4] + '\n')
I want loop for all directories
Thanks in advance
If you're using UNIX you can use grep:
grep "checksum" -R /user/stream
The -R flag allows for a recursive search inside the directory, following the symbolic links if there are any.
My suggestion is to use glob.
The glob module allows you to work with files. In the Unix universe, a directory is / should be a file so it should be able to help you with your task.
More over, you don't have to install anything, glob comes with python.
Note: For the following code, you will need python3.5 or greater
This should help you out.
import os
import glob
for path in glob.glob('/ai2/data/prod/admin/inf/**', recursive=True):
# At some point, `path` will be `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error`
if not os.path.isdir(path):
# Check the `id` of the file
# Do things with the file
# If there are files inside `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error` you will be able to access them here
What glob.glob does is, it Return a possibly-empty list of path names that match pathname. In this case, it will match every file (including directories) in /user/stream/. If these files are not directories, you can do whatever you want with them.
I hope this will help you!
Clarification
Regarding your 3 point comment attempting to clarify the question, especially this part we need to put appi dynamically in that path then we need to read all files inside that directory
No, you do not need to do this. Please read my answer carefully and please read glob documentation.
In this case, it will match every file (including directories) in /user/stream/
If you replace /user/stream/ with /ai2/data/prod/admin/inf/, you will have access to every file in /ai2/data/prod/admin/inf/. Assuming your app ids are 1, 2, 3, this means, you will have access to the following files.
/ai2/data/prod/admin/inf/inf_1_pvt/error
/ai2/data/prod/admin/inf/inf_2_pvt/error
/ai2/data/prod/admin/inf/inf_3_pvt/error
You do not have to specify the id, because you will be iterating over all files. If you do need the id, you can just extract it from the path.
If everything looks like this, /ai2/data/prod/admin/inf/inf_<$APP>_pvt/error, you can get the id by removing /ai2/data/prod/admin/inf/ and taking everything until you encounter _.
My zip file contains a lot of smaller zip files.
I want to iterate through all those files,
reading and printing each of their comments.
I've found out that zipfile file.zip or unzip -z file.zipcan do this to a file in separate, but I'm looking for a way to go through all of them.
Couldn't find anything perfect yet, but this post. However, the code is too advanced for me, and I need something very basic, to begin with :)
Any ideas or information would be great, thanks!
Not sure exactly what your looking for but here are a few ways I did it on an Ubuntu Linux machine.
for i in `ls *.zip`; do unzip -l $i; done
or
unzip -l myzip.zip
or
unzip -p myzip.zip | python -c 'import zipfile,sys,StringIO;print "\n".join(zipfile.ZipFile(StringIO.StringIO(sys.stdin.read())).namelist())'
You can use the zipfile library to iterate through your files and
get their comments using zipinfo.comment
import zipfile
file = zipfile.ZipFile('filepath.zip')
infolist = file.infolist()
for info in infolist:
print(info.comment)
The example above prints the comment of each file in your zip file.
You could loop through your zip files and print their contents comments similiarly.
Check out the official zipfile documentation, its super clear.
A short and easy way to achieve this:
from zipfile import ZipFile
ziplist = ZipFile('parentzip.zip').namelist()
for childzip in ziplist:
zip_comment = ZipFile(childzip).comment
Reminder that if you want to do string based comparisons you should either encode your reference string as bytes, or convert the comment into a string. Ex:
from zipfile import ZipFile
paths = ['file1.zip', 'file2.zip', 'file3.zip']
bad_str = 'please ignore me'
new = []
for filename in paths:
zip_comment = zipfile.ZipFile(filename).comment
if not zip_comment == str.encode(bad_str):
new.append(filename)
paths = new
I want to extract all files with the pattern *_sl_H* from many tar.gz files, without extracting all files from the archives.
I found these lines, but it is not possible to work with wildcards (https://pymotw.com/2/tarfile/):
import tarfile
import os
os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
t.extractall('outdir', members=[t.getmember('README.txt')])
print os.listdir('outdir')
Does someone have an idea?
Many thanks in advance.
Take a look at TarFile.getmembers() method which returns the members of the archive as a list. After you have this list, you can decide with a condition which file is going to be extracted.
import tarfile
import os
os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
for member in t.getmembers():
if "_sl_H" in member.name:
t.extract(member, "outdir")
print os.listdir('outdir')
You can extract all files matching your pattern from many tar as follows:
Use glob to get you a list of all of the *.tar or *.gz files in a given folder.
For each tar file, get a list of the files in each tar file using the getmembers() function.
Use a regular expression (or a simple if "xxx" in test) to filter the required files.
Pass this list of matching files to the members parameter in the extractall() function.
Exception handling is added to catch badly encoded tar files.
For example:
import tarfile
import glob
import re
reT = re.compile(r'.*?_sl_H.*?')
for tar_filename in glob.glob(r'\my_source_folder\*.tar'):
try:
t = tarfile.open(tar_filename, 'r')
except IOError as e:
print(e)
else:
t.extractall('outdir', members=[m for m in t.getmembers() if reT.search(m.name)])
I want to create a tar archive with a hierarchical directory structure from Python, using strings for the contents of the files. I've read this question , which shows a way of adding strings as files, but not as directories. How can I add directories on the fly to a tar archive without actually making them?
Something like:
archive.tgz:
file1.txt
file2.txt
dir1/
file3.txt
dir2/
file4.txt
Extending the example given in the question linked, you can do it as follows:
import tarfile
import StringIO
import time
tar = tarfile.TarFile("test.tar", "w")
string = StringIO.StringIO()
string.write("hello")
string.seek(0)
info = tarfile.TarInfo(name='dir')
info.type = tarfile.DIRTYPE
info.mode = 0755
info.mtime = time.time()
tar.addfile(tarinfo=info)
info = tarfile.TarInfo(name='dir/foo')
info.size=len(string.buf)
info.mtime = time.time()
tar.addfile(tarinfo=info, fileobj=string)
tar.close()
Be careful with mode attribute since default value might not include execute permissions for the owner of the directory which is needed to change to it and get its contents.
A slight modification to the helpful accepted answer so that it works with python 3 as well as python 2 (and matches the OP's example a bit closer):
from io import BytesIO
import tarfile
import time
# create and open empty tar file
tar = tarfile.open("test.tgz", "w:gz")
# Add a file
file1_contents = BytesIO("hello 1".encode())
finfo1 = tarfile.TarInfo(name='file1.txt')
finfo1.size = len(file1_contents.getvalue())
finfo1.mtime = time.time()
tar.addfile(tarinfo=finfo1, fileobj=file1_contents)
# create directory in the tar file
dinfo = tarfile.TarInfo(name='dir')
dinfo.type = tarfile.DIRTYPE
dinfo.mode = 0o755
dinfo.mtime = time.time()
tar.addfile(tarinfo=dinfo)
# add a file to the new directory in the tar file
file2_contents = BytesIO("hello 2".encode())
finfo2 = tarfile.TarInfo(name='dir/file2.txt')
finfo2.size = len(file2_contents.getvalue())
finfo2.mtime = time.time()
tar.addfile(tarinfo=finfo2, fileobj=file2_contents)
tar.close()
In particular, I updated octal syntax following PEP 3127 -- Integer Literal Support and Syntax, switched to BytesIO from io, used getvalue instead of buf, and used open instead of TarFile to show zipped output as in the example. (Context handler usage (with ... as tar:) would also work in both python2 and python3, but cut and paste didn't work with my python2 repl, so I didn't switch it.) Tested on python 2.7.15+ and python 3.7.3.
Looking at the tar file format it seems doable. The files that go in each subdirectory get the relative pathname (e.g. dir1/file3.txt) as their name.
The only trick is that you must define each directory before the files that go into it (tar won't create the necessary subdirectories on the fly). There is a special flag you can use to identify a tarfile entry as a directory, but for legacy purposes, tar also accepts file entries having names that end with / as representing directories, so you should be able to just add dir1/ as a file from a zero-length string using the same technique.
I have written a piece of a code which is supposed to read the texts inside several files which are located in a directory. These files are basically text files but they do not have any extensions.But my code is not able to read them:
corpus_path = 'Reviews/'
for infile in glob.glob(os.path.join(corpus_path,'*.*')):
review_file = open(infile,'r').read()
print review_file
To test if this code works, I put a dummy text file, dummy.txt. which worked because it has extension. But i don't know what should be done so files without the extensions could be read.
can someone help me? Thanks
Glob patterns don't work the same way as wildcards on the Windows platform. Just use * instead of *.*. i.e. os.path.join(corpus_path,'*'). Note that * will match every file in the directory - if that's not what you want then you can revise the pattern accordingly.
See the glob module documentation for more details.
Just use * instead of *.*.
The latter requires an extension to be present (more precisely, there needs to be a dot in the filename), the former doesn't.
You could search for * instead of *.*, but this will match every file in your directory.
Fundamentally, this means that you will have to handle cases where the file you are opening is not a text file.
it seems that you need
from os import listdir
from filename in ( fn for fn in listdir(corpus_path) if '.' not in fn):
# do something
you could write
from os import listdir
for fn in listdir(corpus_path):
if '.' not in fn:
# do something
but the former with a generator spares one indentation level