Zipping a binary file in python - python

I am trying to include a binary file within a zip file and below is the code snippet:
I first unzip the zip contents into a temporary location and add few more files and zip it back to a new archive.
import zipfile
def test(fileName, tempDir):
# unzip the file contents,may contain binary files
myZipFile=zipfile.ZipFile(fileName,'r')
for name in myZipFile.namelist():
toFile = tempDir + '/' + name
fd = open(toFile, "w")
fd.write(myZipFile.read(name))
fd.close()
myZipFile.close()
# code which post processes few of the files goes here
#zip it back
newZip = zipfile.ZipFile(fileName, mode='w')
try:
fileList = os.listdir(tempDir)
for name in fileList:
name = tempDir + '/' + name
newZip.write(name,os.path.basename(name))
newZip.close()
except Exception:
print 'Exception occured while writing to PAR file: ' + fileName
Some of the files may be binary files. The zipping code works fine but when i try to unzip it using linux ' unzip or python's zip module , i get the below error:
zipfile corrupt. (please check that you have transferred or
created the zipfile in the appropriate BINARY mode and that you have
compiled UnZip properly)
And am using python 2.3
What's going wrong here ?

You might want to upgrade, as Python 2.3 is really outdated. 2.7.3 is the latest one from the 2.x-versions and 3.2.3 the latest python version.
See docs.python.org:
| extractall(self, path=None, members=None, pwd=None)
| Extract all members from the archive to the current working
| directory. `path' specifies a different directory to extract to.
| `members' is optional and must be a subset of the list returned
| by namelist().
(New in version 2.6)
Take a look at Zip a folder and its content.
You might also be interested in distutlis.archive_util.

Hmm not sure if its a bug in python 2.3. Current work environment do not allow me to upgrade to a higher version of python :-( :-( :-(
The below workaround worked:
import zipfile
def test(fileName, tempDir):
# unzip the file contents,may contain binary files
myZipFile=zipfile.ZipFile(fileName,'r')
for name in myZipFile.namelist():
toFile = tempDir + '/' + name
# check if the file is a binary file
#if binary file, open it in "wb" mode
fd = open(toFile, "wb")
#else open in just "w" mode
fd = open(toFile, "w")
fd.write(myZipFile.read(name))
fd.close()
myZipFile.close()
# code which post processes few of the files goes here
#zip it back
newZip = zipfile.ZipFile(fileName, mode='w')
try:
fileList = os.listdir(tempDir)
for name in fileList:
name = tempDir + '/' + name
newZip.write(name,os.path.basename(name))
newZip.close()
except Exception:
print 'Exception occured while writing to PAR file: ' + fileName

Related

Unable to save file in folder using python open()

My English is very poor, and the use of the Google translation, I am sorry for that. :)
Unable to save filename, error indicating no directory exists, but directory exists.
1.You can manually create the file in the resource manager --> the file name is legal.
2.You can manually create a directory in the resource manager --> the directory name is legal
3.You can save other file names such as aaa.png to this directory, that is, this directory can be written to other files --> The path path is legal, there is no permission problem, and there is no problem with the writing method.
4.The file can be written to the upper-level directory download_pictures --> It's not a file name problem.
thank you!!!
import os
path = 'download_pictures\\landscape[or]no people[or]nature[OrderBydata]\\'
download_name = '[6]772803-2500x1459-genshin+impact-lumine+(genshin+impact)-arama+(genshin+impact)-aranara+(genshin+impact)-arabalika+(genshin+impact)-arakavi+(genshin+impact).png'
filename = path + download_name
print('filename = ', filename)
# Create the folder make sure the path exists
if not os.path.exists(path):
os.makedirs(path)
try:
with open(filename, 'w') as f:
f.write('test')
except Exception as e:
print('\n【error!】First save file, failed, caught exception:', e)
print(filename)
filename = path + 'aaa.png'
with open(filename, 'w') as f:
print('\nThe second save file, changed the file name aaa.png, the path remains unchanged')
f.write('test')
print(filename)
path = 'download_pictures\\'
filename = path + download_name
with open(filename, 'w') as f:
print('\nThe third save file, the file name is unchanged, but the directory has changed')
f.write('test')
console
filename = download_pictures\landscape[or]no people[or]nature[OrderBydata]\[6]772803-2500x1459-genshin+impact-lumine+(genshin+impact)-arama+(genshin+impact)-aranara+(genshin+impact)-arabalika+(genshin+impact)-arakavi+(genshin+impact).png
【error!】First save file, failed, caught exception: [Errno 2] No such file or directory: 'download_pictures\\landscape[or]no people[or]nature[OrderBydata]\\[6]772803-2500x1459-genshin+impact-lumine+(genshin+impact)-arama+(genshin+impact)-aranara+(genshin+impact)-arabalika+(genshin+impact)-arakavi+(genshin+impact).png'
download_pictures\landscape[or]no people[or]nature[OrderBydata]\[6]772803-2500x1459-genshin+impact-lumine+(genshin+impact)-arama+(genshin+impact)-aranara+(genshin+impact)-arabalika+(genshin+impact)-arakavi+(genshin+impact).png
The second save file, changed the file name aaa.png, the path remains unchanged
download_pictures\landscape[or]no people[or]nature[OrderBydata]\aaa.png
The third save file, the file name is unchanged, but the directory has changed
Process finished with exit code 0
I couldn't replicate your error (I'm using linux and I think you have a Windows system), but anyway, you should not try to join paths manually. Instead try to use os.path.join to join multiple paths to one valid path. This will also ensure that based on your operating system the correct path separators are used (forward slash on unix and backslash on Windows).
I have adapted the code until the first saving attempt accordingly and it writes a file correctly. Also, the code gets cleaner this way and it's easier to see the separate folder names.
import os
if __name__ == '__main__':
path = os.path.join('download_pictures', 'landscape[or]no people[or]nature[OrderBydata]')
download_name = '[6]772803-2500x1459-genshin+impact-lumine+(genshin+impact)-arama+(genshin+impact)-aranara+(genshin+impact)-arabalika+(genshin+impact)-arakavi+(genshin+impact).png'
filename = os.path.join(path, download_name)
print('filename = ', filename)
# Create the folder make sure the path exists
if not os.path.exists(path):
os.makedirs(path)
try:
with open(filename, 'w') as f:
f.write('test')
except Exception as e:
print('\n【error!】First save file, failed, caught exception:', e)
print(filename)
I hope this helps. I think the issue with your approach is related to the path separators \\ under Windows.

os.path.join issue Windows

I am having some issues with os.path.join and a Windows system. I have created a script that recursively reads files containing unstructured JSON data, creates a directory named "converted_json", and prints the content of each unstructured JSON file in a structured format into a new file within the "converted_json" directory.
I have tested the script below on macOS and upon execution, the structured JSON data is printed to new files and the new files are output to the "converted_json" directory. However, when I execute the script on a Windows system, the JSON data is printed to new files, but the files are not output to the "converted_json" directory.
Essentially, the following os.path.join code does not appear to be working on Windows in the following section:
conv_json = open(os.path.join(converted_dir, str(file_name[-1]) + '_converted'), 'wb')
The files are created, however they are not stored within the "converted_json" directory that is specified by the converted_dir variable.
The following output is from printing the "conv_json" variable:
open file 'C:\Users\test\Desktop\test\file_name.json.gz.json_converted', mode 'wb' at 0x0000000002617930
As seen from above, the file path contained within the "conv_json" variable does not contain the "converted_json" directory (it should be there from using os.path.join and the converted_dir variable.
Any assistance as to how to get the structured data to output to the "converted_json" directory would be greatly appreciated.
Code below:
argparser = argparse.ArgumentParser()
argparser.add_argument('-d', '--d', dest='dir_path', type=str, default=None, required=True, help='Directory path to Archive/JSON files')
args = argparser.parse_args()
dir_path = args.dir_path
converted_dir = os.path.join(dir_path, 'converted_json')
os.mkdir(converted_dir, 0777)
for subdir1, dirs1, files1 in os.walk(dir_path):
for file in files1:
try:
if file.endswith(".json"):
file = open(os.path.join(subdir1, file))
file_name = str.split(file.name, '/')
conv_json = open(os.path.join(converted_dir, str(file_name[-1]) + '_converted'), 'wb')
conv_json.write('#################################################################################################################################')
conv_json.write('\n')
conv_json.write('File Name: ' + file_name[-1])
conv_json.write('\n')
conv_json.write('#################################################################################################################################')
conv_json.write('\n')
parsed_json = json.load(file)
s = cStringIO.StringIO()
pprint.pprint(parsed_json, s)
conv_json.write(s.getvalue())
conv_json.close()
except:
print 'JSON Files Not Found'
print 'JSON Processing Completed: ' + str(datetime.datetime.now())
I think that this line is bad on Windows:
file_name = str.split(file.name, '/')
The split on '/' will not split at all. You should use os.path.sep instead.
I think os.path.join reacts so confusing since the second part you try to join is already a full file path (since the split failed).

Cannot find a file in my tempfile.TemporaryDirectory() for Python3

I'm having trouble working with Python3's tempfile library in general.
I need to write a file in a temporary directory, and make sure it's there. The third party software tool I use sometimes fails so I can't just open the file, I need to verify it's there first using a 'while loop' or other method before just opening it. So I need to search the tmp_dir (using os.listdir() or equivalent).
Specific help/solution and general help would be appreciated in comments.
Thank you.
Small sample code:
import os
import tempfile
with tempfile.TemporaryDirectory() as tmp_dir:
print('tmp dir name', tmp_dir)
# write file to tmp dir
fout = open(tmp_dir + 'file.txt', 'w')
fout.write('test write')
fout.close()
print('file.txt location', tmp_dir + 'lala.fasta')
# working with the file is fine
fin = open(tmp_dir + 'file.txt', 'U')
for line in fin:
print(line)
# but I cannot find the file in the tmp dir like I normally use os.listdir()
for file in os.listdir(tmp_dir):
print('searching in directory')
print(file)
That's expected because the temporary directory name doesn't end with path separator (os.sep, slash or backslash on many systems). So the file is created at the wrong level.
tmp_dir = D:\Users\foo\AppData\Local\Temp\tmpm_x5z4tx
tmp_dir + "file.txt"
=> D:\Users\foo\AppData\Local\Temp\tmpm_x5z4txfile.txt
Instead, join both paths to get a file inside your temporary dir:
fout = open(os.path.join(tmp_dir,'file.txt'), 'w')
note that fin = open(tmp_dir + 'file.txt', 'U') finds the file, that's expected, but it finds it in the same directory where tmp_dir was created.

How to extract the title of a PDF document from within a script for renaming?

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. "aluminum carbonate" for a0001.pdf, "aluminum nitrate" in a0002.pdf, etc., which I'd like to extract to rename my files.
I use this program to rename a file:
path=r"C:\Users\YANN\Desktop\..."
old='string 1'
new='string 2'
def rename(path,old,new):
for f in os.listdir(path):
os.rename(os.path.join(path, f), os.path.join(path, f.replace(old, new)))
rename(path,old,new)
I would like to know if there is/are solution(s) to extract the title embedded in the PDF file to rename the file?
Installing the package
This cannot be solved with plain Python. You will need an external package such as pdfrw, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip.
On Windows, first make sure you have a recent version of pip using the shell command:
python -m pip install -U pip
On Linux:
pip install -U pip
On both platforms, install then the pdfrw package using
pip install pdfrw
The code
I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:
import os
from pdfrw import PdfReader
path = r'C:\Users\YANN\Desktop'
def renameFileToPDFTitle(path, fileName):
fullName = os.path.join(path, fileName)
# Extract pdf title from pdf file
newName = PdfReader(fullName).Info.Title
# Remove surrounding brackets that some pdf titles have
newName = newName.strip('()') + '.pdf'
newFullName = os.path.join(path, newName)
os.rename(fullName, newFullName)
for fileName in os.listdir(path):
# Rename only pdf files
fullName = os.path.join(path, fileName)
if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
continue
renameFileToPDFTitle(path, fileName)
What you need is a library that can actually read PDF files. For example pdfrw:
In [8]: from pdfrw import PdfReader
In [9]: reader = PdfReader('example.pdf')
In [10]: reader.Info.Title
Out[10]: 'Example PDF document'
You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :
[{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`
Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import os
start = "0000"
def convert(var):
while len(var) < 4:
var = "0" + var
return var
for i in range(1,3622):
var = str(i)
var = convert(var)
file_name = "a" + var + ".pdf"
fp = open(file_name, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fp.close()
metadata = doc.info # The "Info" metadata
print metadata
metadata = metadata[0]
for x in metadata:
if x == "Title":
new_name = metadata[x] + ".pdf"
os.rename(file_name,new_name)
You can look at only the metadata using a ghostscript tool pdf_info.ps. It used to ship with ghostscript but is still available at https://r-forge.r-project.org/scm/viewvc.php/pkg/inst/ghostscript/pdf_info.ps?view=markup&root=tm
Building on Ciprian Tomoiagă's suggestion of using pdfrw, I've uploaded a script which also:
renames files in sub-directories
adds a command-line interface
handles when file name already exists by appending a random string
strips any character which is not alphanumeric from the new file name
replaces non-ASCII characters (such as á è í ò ç...) for ASCII (a e i o c) in the new file name
allows you to set the root dir and limit the length of the new file name from command-line
show a progress bar and, after the script has finished, show some statistics
does some error handling
As TextGeek mentioned, unfortunately not all files have the title metadata, so some files won't be renamed.
Repository: https://github.com/favict/pdf_renamefy
Usage:
After downloading the files, install the dependencies by running pip:
$pip install -r requirements.txt
and then to run the script:
$python -m renamefy <directory> <filename maximum length>
...in which directory is the full path you would like to look for PDF files, and filename maximum length is the length at which the filename will be truncated in case the title is too long or was incorrectly set in the file.
Both parameters are optional. If none is provided, the directory is set to the current directory and filename maximum length is set to 120 characters.
Example:
$python -m renamefy C:\Users\John\Downloads 120
I used it on Windows, but it should work on Linux too.
Feel free to copy, fork and edit as you see fit.
has some issues with defined solutions, here is my recipe
from pathlib import Path
from pdfrw import PdfReader
import re
path_to_files = Path(r"C:\Users\Malac\Desktop\articles\Downloaded")
# Exclude windows forbidden chars for name <>:"/\|?*
# Newlines \n and backslashes will be removed anyway
exclude_chars = '[<>:"/|?*]'
for i in path_to_files.glob("*.pdf"):
try:
title = PdfReader(i).Info.Title
except Exception:
# print(f"File {i} not renamed.")
pass
# Some names was just ()
if not title:
continue
# For some reason, titles are returned in brackets - remove brackets if around titles
if title.startswith("("):
title = title[1:]
if title.endswith(")"):
title = title[:-1]
title = re.sub(exclude_chars, "", title)
title = re.sub(r"\\", "", title)
title = re.sub("\n", "", title)
# Some names are just ()
if not title:
continue
try:
final_path = (path_to_files / title).with_suffix(".pdf")
if final_path.exists():
continue
i.rename(final_path)
except Exception:
# print(f"Name {i} incorrect.")
pass
Once you have installed it, open the app and go to the Download folder. You will see your downloaded files there. Just long press the file you wish to rename and the Rename option will appear at the bottom.

Python - Creating bad zip files with ZipFile

My script is downloading files from URLs located in a text file, saving them temporarily to a given location, and then adding them to an already existing zip file in the same directory. The files are being downloaded successfully, and no errors are raised when adding to the zip files, but for some reason, most of the resulting zip files are un-openable by the OS, and when I z.printdir() on them, they do not contain all the expected files.
relevant code:
for root, dirs, files in
os.walk(os.path.join(downloadsdir,dir_dictionary['content']), False):
if "artifacts" in root:
solution_name = root.split('/')[-2]
with open(os.path.join(root,'non-local-files.txt')) as file:
for line in file:
if "string" in line:
print('\tDownloading ' + urllib.unquote(urllib.unquote(line.rstrip())))
file_name = urllib.unquote(urllib.unquote(line.rstrip())).split('/')[-1]
r = requests.get(urllib.unquote(urllib.unquote(line.rstrip())))
with open(os.path.join(root,file_name), 'wb') as temp_file:
temp_file.write(r.content)
z = zipfile.ZipFile(os.path.join(root, solution_name + '.zip'), 'a')
z.write(os.path.join(root,file_name), os.path.join('Dropoff', file_name))
I guess my question is: am I doing something inherently wrong in the code, or do I have to look at the actual files being added to the zip files? The files are all OS-readable and appear normal as far as I can tell. Kind of at a loss as to how to proceed.
for root, dirs, files in
os.walk(os.path.join(downloadsdir,dir_dictionary['content']), False):
if "artifacts" in root:
solution_name = root.split('/')[-2]
with open(os.path.join(root,'non-local-files.txt')) as file:
for line in file:
if "string" in line:
print('\tDownloading ' + urllib.unquote(urllib.unquote(line.rstrip())))
file_name = urllib.unquote(urllib.unquote(line.rstrip())).split('/')[-1]
r = requests.get(urllib.unquote(urllib.unquote(line.rstrip())))
with open(os.path.join(root,file_name), 'wb') as temp_file:
temp_file.write(r.content)
z = zipfile.ZipFile(os.path.join(root, solution_name + '.zip'), 'a')
try:
z.write(os.path.join(root,file_name), os.path.join('Dropoff', file_name))
finally:
z.close()
PS:
https://docs.python.org/2/library/zipfile.html
Note
Archive names should be relative to the archive root, that is, they should not start with a path separator.
here is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets all file names as encoded in CP437, also known as DOS Latin.

Categories

Resources