I have a nested tarfile in the form of
tarfile.tar.gz
--tar1.gz
--tar1.txt
--tar2.gz
--tar3.gz
I wanted to write a little script in python to extract all tars breadth first in to the same order of folders i.e. tar1.txt should lie in tarfile/tar1/
Here's the script,
#!/usr/bin/python
import os
import re
import tarfile
data = os.path.join(os.getcwd(), 'data')
dirs = [data]
while len(dirs):
dirpath = dirs.pop(0)
for subpath in os.listdir(dirpath):
if not re.search('(.tar)?.gz$', subpath):
continue
with tarfile.open(os.path.join(dirpath, subpath)) as tarf:
tarf.extractall(path=dirpath)
for subpath in os.listdir(dirpath):
newpath = os.path.join(dirpath, subpath)
if os.path.isdir(newpath):
dirs.append(newpath)
elif dirpath != data or os.path.islink(newpath):
os.remove(newpath)
But when i run the script I get the following error:
Traceback (most recent call last):
File "./extract.py", line 16, in <module>
with tarfile.open(os.path.join(dirpath, subpath)) as tarf:
File "/usr/lib/python2.7/tarfile.py", line 1678, in open
raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully
The '.tar.gz' file is extracted fine but not the nested '.gz' files. What's up here? Does tarfile module not handle .gz files?
.gz denotes that the file is gzipped; .tar.gz means a tar file that has been gzipped. tarfile handles gzipped tars perfectly well, but it doesn't handle files that aren't tar archives (like your tar1.gz).
Related
I have watched a video to learn how to merge PDF files into one PDF file. I tried to modify a little in the code so as to deal with a folder which has the PDF files
The main folder (Spyder) has the Demo.py and this is the code
import os
from PyPDF2 import PdfFileMerger
source_dir = os.getcwd() + './PDF Files'
merger = PdfFileMerger()
for item in os.listdir(source_dir):
if item.endswith('pdf'):
merger.append(item)
merger.write('.PDF Files/Output/Complete.pdf')
merger.close()
I have a subfolder named PDF Files into the main folder Spyder and in this subfolder I put the PDF files and inside the subfolder PDF Files I created a folder named Output.
I got error file not found as for the 1.pdf although when printing the item inside the loop, I got the PDF names.
The Traceback of error
Traceback (most recent call last):
File "demo.py", line 9, in <module>
merger.append(item)
File "C:\Users\Future\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\merger.py", line 203, in append
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
File "C:\Users\Future\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\merger.py", line 114, in merge
fileobj = file(fileobj, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '1.pdf'
I could solve it like that
import os
from PyPDF2 import PdfFileMerger
source_dir = './PDF Files/'
merger = PdfFileMerger()
for item in os.listdir(source_dir):
if item.endswith('pdf'):
#print(item)
merger.append(source_dir + item)
merger.write(source_dir + 'Output/Complete.pdf')
merger.close()
I have to untar around fifty *.gz files in a directory. Inside each *.gz file there is a *.TAR file and some other files.
I am trying a python script which extracts the contents of the *.gz files to a directory. But, I am not able to extract the *.TAR files inside the same directory to which the contents of *.gz are extracted.
This is how the script looks:
import tarfile
import os
import glob
basedir = "path_to _dir"
for i in glob.glob(basedir +"*.gz"):
a = os.path.basename(i)
b = os.path.splitext(a)[0]
c = os.path.splitext(b)[0]
os.mkdir(os.path.join(basedir,c))
t1 = tarfile.open(i)
t1.extractall(c)
for j in os.listdir(c):
if j.endswith('.TAR'):
print(j)
t2 = tarfile.open(j)
t2.extractall()
t2.close()
t1.close()
Its giving me the error:
Traceback (most recent call last):
File "./untar.py", line 16, in <module>
t2 = tarfile.open(j)
File "/usr/lib64/python2.7/tarfile.py", line 1660, in open
return func(name, "r", fileobj, **kwargs)
File "/usr/lib64/python2.7/tarfile.py", line 1722, in gzopen
fileobj = bltn_open(name, mode + "b")
IOError: [Errno 2] No such file or directory: '0299_0108060501.TAR'
0299_0108060501.TAR is the file contained inside the *.gz file
It seems to me that I am doing something very wrong fundamentally, but I don't know what.
Since tar.gz files are TAR archives compressed with gzip one should use
t1 = tarfile.open(i, 'r:gz')
as per documentation.
Also, you need to combine the path of the inner file with the directory being inspected, like so:
t2 = tarfile.open(os.path.join(c, j))
I have a simple code to zip files using zipfile module. I am able to zip some files but I get FileNotFound error for the others. I have checked if this is file size error but its not.
I can pack files with name like example file.py but when I have a file inside a directory like 'Analyze Files en-US_es-ES.xlsx' if fails.
It works when I change os.path.basename to os.path.join but I don't want to zip whole folder structure, I want to have flat structure in my zip.
Here is my code:
import os
import zipfile
path = input()
x=zipfile.ZipFile('new.zip', 'w')
for root, dir, files in os.walk(path):
for eachFile in files:
x.write(os.path.basename(eachFile))
x.close()
Error looks like this:
Traceback (most recent call last):
File "C:/Users/mypc/Desktop/Zip test.py", line 15, in <module>
x.write(os.path.basename(eachFile))
File "C:\Python34\lib\zipfile.py", line 1326, in write
st = os.stat(filename)
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'Analyze Files en-US_ar-SA.xlsx'*
Simply change working directory to add file without original directory structure.
import os
import zipfile
path = input()
baseDir = os.getcwd()
with zipfile.ZipFile('new.zip', 'w') as z:
for root, dir, files in os.walk(path):
os.chdir(root)
for eachFile in files:
z.write(eachFile)
os.chdir(baseDir)
I can't quite make the leap despite pre-existing similar questions. Help would be valued!
I am trying to recursively parse all xml files in the directory/sub directory
I am looking for the value that appears for the tag "Operator id"
Example source XML:
<Operators>
<Operator id="OId_LD">
<OperatorCode>LD</OperatorCode>
<OperatorShortName>ARRIVA THE SHIRES LIMIT</OperatorShortName>
This is the code I have thus far:
from xml.dom.minidom import parse
import os
def jarv(target_folder):
for root,dirs,files in os.walk(target_folder):
for targetfile in files:
if targetfile.endswith(".xml"):
print targetfile
dom=parse(targetfile)
name = dom.getElementsByTagName('Operator_id')
print name[0].firstChild.nodeValue
This is the terminal command I am running:
python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
And this is the error I receive:
tfl_64-31_-37434-y05.xml
encodings.xml
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "xml_tag.py", line 8, in jarv
dom=parse(targetfile)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: 'encodings.xml'
(frigo)andytmac:AtoB_GTFS admin$ python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
tfl_64-31_-37434-y05.xml
If I comment out the code after the 'print target file' line it does list all the xml files I have.
Thanks for your assistance,
Andy
You're not looking at the right place (relative path) : when you use for root, dirs, files in os.walk(target_folder):, files is a list of the file names in the directory root, and not their absolute path.
Try remplacing dom=parse(targetfile) by dom = parse(os.sep.join(root, targetfile))
I have a large directory structure, each directory containing multiple sub-directories, multiple .mbox files, or both. I need to rename all the .mbox files to the respective file name without the extension e.g.
bar.mbox -> bar
foo.mbox -> foo
Here is the script I've written:
# !/usr/bin/python
import os, sys
def walktree(top, callback):
for path, dirs, files in os.walk(top):
for filename in files:
fullPath = os.path.join(path, filename)
callback(fullPath)
def renameFile(file):
if file.endswith('.mbox'):
fileName, fileExt = os.path.splitext(file)
print file, "->", fileName
os.rename(file,fileName)
if __name__ == '__main__':
walktree(sys.argv[1], renameFile)
When I run this using:
python walktrough.py "directory"
I get the error:
Traceback (most recent call last):
File "./walkthrough.py", line 18, in <module>
walktree(sys.argv[1], renameFile)
File "./walkthrough.py", line 9, in walktree
callback(fullPath)
File "./walkthrough.py", line 15, in renameFile
os.rename(file,fileName)
OSError: [Errno 21] Is a directory
This was solved by adding an extra conditional statement to test if the name the file was to be changed to, was a current directory.
If this was true, the filename to-be had an underscore added to.
Thanks to WKPlus for the hint on this.
BCvery1