Opening PDF within a zip folder fitz.open()

Opening PDF within a zip folder fitz.open() - python

I have a function that opens a zip file, finds a pdf with a given filename, then reads the first page of the pdf to get some specific text. My issue is that after I locate the correct file, I can't open it to read it. I have tried to use a relative path within the zip folder and a absolute path in my downloads folder and I keep getting the error:
no such file: 'Deliverables_Rev B\Plans_Rev B.pdf'
no such file: 'C:\Users\MyProfile\Downloads\Deliverables_Rev B\Plans_Rev B.pdf'
I have been commenting out the os.path.join line to change between the relative and absolute path as self.prefs['download_path'] returns my download folder.
I'm not sure what the issue with with the relative path is, any insight would be helpful, as I think it has to do with trying to read out of a zipped folder.
import zipfile as ZipFile
import fitz
def getjobcode(self, filename):
if '.zip' in filename:
with ZipFile(filename, 'r') as zipObj:
for document in zipObj.namelist():
if 'plans' in document.lower():
document = os.path.join(self.prefs['download_path'], document)
doc = fitz.open(document)
page1 = doc.load_page(0)
page1text = page1.get_text('text')
jobcode = page1text[page1text.index(
'PROJECT NUMBER'):page1text.index('PROJECT NUMBER') + 30][-12:]
return jobcode

I ended up extracting the zip folder into the downloads folder then parsing the pdf to get the data I needed. Afterwords I created a job folder where I wanted it and moved the extracted folder into it from the downloads folder.

Related

Copy and rename pictures based on xml nodes

I'm trying to copy all pictures from one directory (also including subdirectories) to another target directory. Whenever the exact picture name is found in one of the xml files the tool should grap all information (attributes in the parent and child nodes) and create subdirectories based on those node informations, also it should rename the picture file.
The part when it extracts all the information from the nodes is already done.
from bs4 import BeautifulSoup as bs
path_xml = r"path\file.xml"
content = []
with open(res, "r") as file:
content = file.readlines()
content = "".join(content)
def get_filename(_content):
bs_content = bs(_content, "html.parser")
# some code
picture_path = f'{pm_1}{pm_2}\{pm_3}\{pm_4}\{pm_5}_{pm_6}_{pm_7}\{pm_8}\{pm_9}.jpg'
get_filename(content)
So in the end I get a string value with the directory path and the file name I want.
Now I struggle with opening all xml files in one directory instead of just opening one file. I tryed this:
import os
dir_xml = r"path"
res = []
for path in os.listdir(dir_xml):
if os.path.isfile(os.path.join(dir_xml, path)):
res.append(path)
with open(res, "r") as file:
content = file.readlines()
but it gives me this error: TypeError: expected str, bytes or os.PathLike object, not list
How can i read through all xml files instead of just one? I have hundreds of xml files so that will take a wile :D
And another question: How can i create directories base on string?
Lets say the value of picture_path is AB\C\D\E_F_G\H\I.jpg
I would need another directory path for the destination of the created folders and a function that somehow creates folders based on that string. How can I do that?

To read all XML files in a directory, you can modify your code as follows:
import os
dir_xml = r"path"
for path in os.listdir(dir_xml):
if path.endswith(".xml"):
with open(os.path.join(dir_xml, path), "r") as file:
content = file.readlines()
content = "".join(content)
get_filename(content)
This code uses the os.listdir() function to get a list of all files in the directory specified by dir_xml. It then uses a for loop to iterate over the list of files, checking if each file ends with the .xml extension. If it does, it opens the file, reads its content, and passes it to the get_filename function.
To create directories based on a string, you can use the os.makedirs function. For example:
import os
picture_path = r'AB\C\D\E_F_G\H\I.jpg'
dest_path = r'path_to_destination'
os.makedirs(os.path.join(dest_path, os.path.dirname(picture_path)), exist_ok=True)
In this code, os.path.join is used to combine the dest_path and the directory portion of picture_path into a full path. os.path.dirname is used to extract the directory portion of picture_path. The os.makedirs function is then used to create the directories specified by the path, and the exist_ok argument is set to True to allow the function to succeed even if the directories already exist.
Finally, you can use the shutil library to copy the picture file to the destination and rename it, like this:
import shutil
src_file = os.path.join(src_path, picture_path)
dst_file = os.path.join(dest_path, picture_path)
shutil.copy(src_file, dst_file)
Here, src_file is the full path to the source picture file and dst_file is the full path to the destination. The shutil.copy function is then used to copy the file from the source to the destination.

You can use os.walk() for recursive search of files:
import os
dir_xml = r"path"
for root, dirs, files in os.walk(dir_xml): #topdown=False
for names in files:
if ".xml" in names:
print(f"file path: {root}\n XML-Files: {names}")
with open(names, 'r') as file:
content = file.readlines()

How to modify this script so that all of my files are not deleted when trying to delete files that do not have XML files with them?

I am trying to delete all .JPG files that do not have .xml files with the same name attached to them. However, when I run this script, all of my files are deleted in my directory and not just the desired images. How can I change this script so that I can just delete the images without corresponding .xml files?
Note: The only files I have in the directory are .JPG and .XML
import os
from tqdm import tqdm
path = 'C:\\users\\my_username\\path_to_directory_with_xml_and_jpg_images'
files = os.listdir(path)
for file in tqdm(files):
filename, filetype = file.split('.')
if filetype == 'xml':
continue
imgfile = os.path.join(path, file)
xmlfile = os.path.join(path, filename + '.xml')
if not os.path.exists(xmlfile):
print('{} deleted.'.format(imgfile))
os.remove(imgfile)

It's hard to tell why your code doesn't work as we don't know the exact contents of the directory. But a simpler way to do what you want could be to use the amazing pathlib library (Python >= 3.4). The method Path.with_suffix() will make the task quite easy, together with Path.glob():
from pathlib import Path
path = Path('C:\\users\\my_username\\path_to_directory_with_xml_and_jpg_images')
for imgfile in path.glob("*.jpg"):
xmlfile = imgfile.with_suffix(".xml")
if not xmlfile.exists():
imgfile.unlink()
print(imgfile, 'deleted.')

Extract all files from a zipped folder inside a directory to other directory without folder using python

# importing required modules
from zipfile import ZipFile
# specifying the zip file name
file_name = "C:\\OPR\\109521P.zip"
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip:
# printing all the contents of the zip file
result =zip.printdir()
# extracting all the files
print('Extracting all the files now...')
zip.extractall('images')
print('Done!')
I have around 10 images zipped inside a sub folder in a zipped folder , now i want to extract all the images directly to other directory without the sub folders , I have tried using os.path.basename(name) , but i'm getting severel errors.
After the above code , I', getting all images inside a folder ,,
C:\images\109521P
Above is the output location where all 10 images are being extracted , Now i want the images to be directly extracted at
C:\images
So i want to omit the sub folder 109521P and want the images to be directly extracted at above loaction.

my_dir = r"C:\OPR"
my_zip = r"C:\OPR\109521P.zip"
with zipfile.ZipFile(my_zip) as zip_file:
for member in zip_file.namelist():
filename = os.path.basename(member)
# skip directories
if not filename:
continue
# copy file (taken from zipfile's extract)
source = zip_file.open(member)
target = open(os.path.join(my_dir, filename), "wb")
with source, target:
shutil.copyfileobj(source, target)
I got the answer , just posting it

Trying to combine PDFs from multiple folders into one PDF for each folder

I'm very new to Python/programming, and am trying to automate an office task that is very time consuming.
I have multiple folders with PDFs. For each folder I need to combine the PDFs into one PDF, and save it inside the folder whose contents it's the sum of. I've gotten the contents of one folder combined, and saved to my desktop successfully using this:
import PyPDF2
import os
Path = '/Users/jlaw/Desktop/Testing/FolderName/'
filelist = os.listdir(Path)
pdfMerger = PyPDF2.PdfFileMerger(strict=False)
for file in filelist:
if file.endswith('.pdf'):
pdfMerger.append(Path+file)
pdfOutput = open('Tab C.pdf', 'wb')
pdfMerger.write(pdfOutput)
pdfOutput.close()`
With the below code I'm trying to do the above, but for all the folders in a specific directory. When I run this, I get "Tab C.pdf" files appearing correctly, but I'm unable to open them.
import PyPDF2
import os
Path = '/Users/jlaw/Desktop/Testing/'
folders = os.listdir(Path)
def pdf_merge(filelist, foldername):
pdfMerger = PyPDF2.PdfFileMerger()
for file in filelist:
if file.endswith('.pdf'):
pdfMerger.append(Path+foldername+"/"+file)
pdfOutput = open(Path+foldername+'/Tab C.pdf', 'wb')
pdfMerger.write(pdfOutput)
pdfOutput.close()
for folder in folders:
pdf_merge(Path+'/'+folder, folder)`
I'm using Python Version: 3.8
The Tab C.pdf files are only 1kb in size. When I try and open with Adobe Acrobat, a pop up says, "There was an error opening this document. This file cannot be opened because it has no pages. If I try Chrome, it will open, but it's just an empty PDF, and with Edge (Chromium based) it says, 'We can't open this file. Something went Wrong"
Any pieces of advice or hints are much appreciated.

The below works. I'm not experienced enough yet to know why this is works, while the above doesn't.
import PyPDF2
import os
Path = 'C:/Users/jlaw/Desktop/Testing/'
folders = os.listdir(Path)
pdfMerger = PyPDF2.PdfFileMerger()
def pdf_merge(filelist): #Changed to just one argument
pdfMerger = PyPDF2.PdfFileMerger()
for file in os.listdir(filelist): #added os.listdir()
if file.endswith('.pdf'):
pdfMerger.append(filelist+'/'+file) #replaced Path+foldername with filelist
pdfOutput = open(Path+folder+'/Tab C.pdf', 'wb') #Moved back one tab to prevent infinite loop
pdfMerger.write(pdfOutput) #Moved back one tab to prevent infinite loop
pdfOutput.close() #Moved back one tab to prevent infinite loop
for folder in folders:
pdf_merge(Path+folder)` #Removed redundant + "/"

How can I extract all .zip extension in a folder without retaining directory using python?

Here is my code I don't know how can I loop every .zip in a folder, please help me: I want all contents of 5 zip files to extracted in one folder, not including its directory name
import os
import shutil
import zipfile
my_dir = r"C:\\Users\\Guest\\Desktop\\OJT\\scanner\\samples_raw"
my_zip = r"C:\\Users\\Guest\\Desktop\\OJT\\samples\\001-100.zip"
with zipfile.ZipFile(my_zip) as zip_file:
zip_file.setpassword(b"virus")
for member in zip_file.namelist():
filename = os.path.basename(member)
# skip directories
if not filename:
continue
# copy file (taken from zipfile's extract)
source = zip_file.open(member)
target = file(os.path.join(my_dir, filename), "wb")
with source, target:
shutil.copyfileobj(source, target)

repeated question, please refer below link.
How to extract zip file recursively in Pythonn

What you are looking for is glob. Which can be used like this:
#<snip>
import glob
#assuming all your zip files are in the directory below.
for my_zip in glob.glob(r"C:\\Users\\Guest\\Desktop\\OJT\\samples\\*.zip"):
with zipfile.ZipFile(my_zip) as zip_file:
zip_file.setpassword(b"virus")
for member in zip_file.namelist():
#<snip> rest of your code here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Opening PDF within a zip folder fitz.open() - python

I ended up extracting the zip folder into the downloads folder then parsing the pdf to get the data I needed. Afterwords I created a job folder where I wanted it and moved the extracted folder into it from the downloads folder.

Related

Copy and rename pictures based on xml nodes

How to modify this script so that all of my files are not deleted when trying to delete files that do not have XML files with them?

Extract all files from a zipped folder inside a directory to other directory without folder using python

Trying to combine PDFs from multiple folders into one PDF for each folder

How can I extract all .zip extension in a folder without retaining directory using python?

Categories

Resources