Extract a .TAR file within a .gz file - python

I have to untar around fifty *.gz files in a directory. Inside each *.gz file there is a *.TAR file and some other files.
I am trying a python script which extracts the contents of the *.gz files to a directory. But, I am not able to extract the *.TAR files inside the same directory to which the contents of *.gz are extracted.
This is how the script looks:
import tarfile
import os
import glob
basedir = "path_to _dir"
for i in glob.glob(basedir +"*.gz"):
a = os.path.basename(i)
b = os.path.splitext(a)[0]
c = os.path.splitext(b)[0]
os.mkdir(os.path.join(basedir,c))
t1 = tarfile.open(i)
t1.extractall(c)
for j in os.listdir(c):
if j.endswith('.TAR'):
print(j)
t2 = tarfile.open(j)
t2.extractall()
t2.close()
t1.close()
Its giving me the error:
Traceback (most recent call last):
File "./untar.py", line 16, in <module>
t2 = tarfile.open(j)
File "/usr/lib64/python2.7/tarfile.py", line 1660, in open
return func(name, "r", fileobj, **kwargs)
File "/usr/lib64/python2.7/tarfile.py", line 1722, in gzopen
fileobj = bltn_open(name, mode + "b")
IOError: [Errno 2] No such file or directory: '0299_0108060501.TAR'
0299_0108060501.TAR is the file contained inside the *.gz file
It seems to me that I am doing something very wrong fundamentally, but I don't know what.

Since tar.gz files are TAR archives compressed with gzip one should use
t1 = tarfile.open(i, 'r:gz')
as per documentation.
Also, you need to combine the path of the inner file with the directory being inspected, like so:
t2 = tarfile.open(os.path.join(c, j))

Related

Merge PDF files with same prefix using PyPDF2 Python

I have multiple PDF files that have different prefixes. I want to merge these pdf files based on the third prefix (third value in the underscore). I want to do this using python library PyPDF2.
This is the error message
Traceback (most recent call last):
File "C:/test2.py", line 12, in <module>
merger.append(filename)
File "C:\py\lib\site-packages\PyPDF2\merger.py", line 203, in append
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
File "C:\py\lib\site-packages\PyPDF2\merger.py", line 114, in merge
fileobj = file(fileobj, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '0_2021_564495_12345.pdf'
Process finished with exit code 1
For example:
0_2021_1_123.pdf
0_2021_1_1234.pdf
0_2021_1_12345.pdf
0_2021_2_123.pdf
0_2021_2_1234.pdf
0_2021_2_12345.pdf
Expected outcome
1_merged.pdf
2_merged.pdf
Here is what i tried but i am getting an error and it is not working. Any help is much appreciated.
from PyPDF2 import PdfFileMerger
import io
import os
files = os.listdir("C:\\test\\raw")
x=0
merger = PdfFileMerger()
for filename in files:
print(filename.split('_')[2])
prefix = filename.split('_')[2]
if filename.split('_')[2] == prefix:
merger.append(filename)
merger.write("C:\\test\\result" + prefix + "_merged.pdf")
merger.close()

tarfile doesn't work for .gz files

I have a nested tarfile in the form of
tarfile.tar.gz
--tar1.gz
--tar1.txt
--tar2.gz
--tar3.gz
I wanted to write a little script in python to extract all tars breadth first in to the same order of folders i.e. tar1.txt should lie in tarfile/tar1/
Here's the script,
#!/usr/bin/python
import os
import re
import tarfile
data = os.path.join(os.getcwd(), 'data')
dirs = [data]
while len(dirs):
dirpath = dirs.pop(0)
for subpath in os.listdir(dirpath):
if not re.search('(.tar)?.gz$', subpath):
continue
with tarfile.open(os.path.join(dirpath, subpath)) as tarf:
tarf.extractall(path=dirpath)
for subpath in os.listdir(dirpath):
newpath = os.path.join(dirpath, subpath)
if os.path.isdir(newpath):
dirs.append(newpath)
elif dirpath != data or os.path.islink(newpath):
os.remove(newpath)
But when i run the script I get the following error:
Traceback (most recent call last):
File "./extract.py", line 16, in <module>
with tarfile.open(os.path.join(dirpath, subpath)) as tarf:
File "/usr/lib/python2.7/tarfile.py", line 1678, in open
raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully
The '.tar.gz' file is extracted fine but not the nested '.gz' files. What's up here? Does tarfile module not handle .gz files?
.gz denotes that the file is gzipped; .tar.gz means a tar file that has been gzipped. tarfile handles gzipped tars perfectly well, but it doesn't handle files that aren't tar archives (like your tar1.gz).

Unzipping file results in "BadZipFile"

I know there are similar questions but neither of them provided a solution for my problem. I am using the following code:
import os, glob
import zipfile
root = 'E:\\xx\\fashion\\*'
directory = 'E:\\xx\\fashion\\'
extension = ".zip"
date_file_list = []
for folder in glob.glob(root):
if folder.endswith(extension): # check for ".zip" extension
print(folder)
zipfile.ZipFile(os.path.join(directory, folder)).extractall(os.path.join(directory, os.path.splitext(folder)[0]))
os.remove(folder) # delete zipped file_name
And I get the following error:
Traceback (most recent call last):
File "C:/Users/xx/unzip.py", line 12, in <module>
zipfile.ZipFile(os.path.join(directory, folder)).extractall(os.path.join(directory, os.path.splitext(folder)[0]))
File "C:\Users\xx\AppData\Local\Programs\Python\Python35\lib\zipfile.py", line 1026, in __init__
self._RealGetContents()
File "C:\Users\xx\AppData\Local\Programs\Python\Python35\lib\zipfile.py", line 1094, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Some of the files are compressed in winzip some of them are in 7zip. But there are too many files to unzip.
Anybody know why this error is occurring?

Parse XML Tag value for all files in directory using Python

I can't quite make the leap despite pre-existing similar questions. Help would be valued!
I am trying to recursively parse all xml files in the directory/sub directory
I am looking for the value that appears for the tag "Operator id"
Example source XML:
<Operators>
<Operator id="OId_LD">
<OperatorCode>LD</OperatorCode>
<OperatorShortName>ARRIVA THE SHIRES LIMIT</OperatorShortName>
This is the code I have thus far:
from xml.dom.minidom import parse
import os
def jarv(target_folder):
for root,dirs,files in os.walk(target_folder):
for targetfile in files:
if targetfile.endswith(".xml"):
print targetfile
dom=parse(targetfile)
name = dom.getElementsByTagName('Operator_id')
print name[0].firstChild.nodeValue
This is the terminal command I am running:
python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
And this is the error I receive:
tfl_64-31_-37434-y05.xml
encodings.xml
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "xml_tag.py", line 8, in jarv
dom=parse(targetfile)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: 'encodings.xml'
(frigo)andytmac:AtoB_GTFS admin$ python -c "execfile('xml_tag.py'); jarv('/Users/admin/Projects/AtoB_GTFS')"
tfl_64-31_-37434-y05.xml
If I comment out the code after the 'print target file' line it does list all the xml files I have.
Thanks for your assistance,
Andy
You're not looking at the right place (relative path) : when you use for root, dirs, files in os.walk(target_folder):, files is a list of the file names in the directory root, and not their absolute path.
Try remplacing dom=parse(targetfile) by dom = parse(os.sep.join(root, targetfile))

Renaming files recursively with Python

I have a large directory structure, each directory containing multiple sub-directories, multiple .mbox files, or both. I need to rename all the .mbox files to the respective file name without the extension e.g.
bar.mbox -> bar
foo.mbox -> foo
Here is the script I've written:
# !/usr/bin/python
import os, sys
def walktree(top, callback):
for path, dirs, files in os.walk(top):
for filename in files:
fullPath = os.path.join(path, filename)
callback(fullPath)
def renameFile(file):
if file.endswith('.mbox'):
fileName, fileExt = os.path.splitext(file)
print file, "->", fileName
os.rename(file,fileName)
if __name__ == '__main__':
walktree(sys.argv[1], renameFile)
When I run this using:
python walktrough.py "directory"
I get the error:
Traceback (most recent call last):
File "./walkthrough.py", line 18, in <module>
walktree(sys.argv[1], renameFile)
File "./walkthrough.py", line 9, in walktree
callback(fullPath)
File "./walkthrough.py", line 15, in renameFile
os.rename(file,fileName)
OSError: [Errno 21] Is a directory
This was solved by adding an extra conditional statement to test if the name the file was to be changed to, was a current directory.
If this was true, the filename to-be had an underscore added to.
Thanks to WKPlus for the hint on this.
BCvery1

Categories

Resources