Copy and rename pictures based on xml nodes - python

I'm trying to copy all pictures from one directory (also including subdirectories) to another target directory. Whenever the exact picture name is found in one of the xml files the tool should grap all information (attributes in the parent and child nodes) and create subdirectories based on those node informations, also it should rename the picture file.
The part when it extracts all the information from the nodes is already done.
from bs4 import BeautifulSoup as bs
path_xml = r"path\file.xml"
content = []
with open(res, "r") as file:
content = file.readlines()
content = "".join(content)
def get_filename(_content):
bs_content = bs(_content, "html.parser")
# some code
picture_path = f'{pm_1}{pm_2}\{pm_3}\{pm_4}\{pm_5}_{pm_6}_{pm_7}\{pm_8}\{pm_9}.jpg'
get_filename(content)
So in the end I get a string value with the directory path and the file name I want.
Now I struggle with opening all xml files in one directory instead of just opening one file. I tryed this:
import os
dir_xml = r"path"
res = []
for path in os.listdir(dir_xml):
if os.path.isfile(os.path.join(dir_xml, path)):
res.append(path)
with open(res, "r") as file:
content = file.readlines()
but it gives me this error: TypeError: expected str, bytes or os.PathLike object, not list
How can i read through all xml files instead of just one? I have hundreds of xml files so that will take a wile :D
And another question: How can i create directories base on string?
Lets say the value of picture_path is AB\C\D\E_F_G\H\I.jpg
I would need another directory path for the destination of the created folders and a function that somehow creates folders based on that string. How can I do that?

To read all XML files in a directory, you can modify your code as follows:
import os
dir_xml = r"path"
for path in os.listdir(dir_xml):
if path.endswith(".xml"):
with open(os.path.join(dir_xml, path), "r") as file:
content = file.readlines()
content = "".join(content)
get_filename(content)
This code uses the os.listdir() function to get a list of all files in the directory specified by dir_xml. It then uses a for loop to iterate over the list of files, checking if each file ends with the .xml extension. If it does, it opens the file, reads its content, and passes it to the get_filename function.
To create directories based on a string, you can use the os.makedirs function. For example:
import os
picture_path = r'AB\C\D\E_F_G\H\I.jpg'
dest_path = r'path_to_destination'
os.makedirs(os.path.join(dest_path, os.path.dirname(picture_path)), exist_ok=True)
In this code, os.path.join is used to combine the dest_path and the directory portion of picture_path into a full path. os.path.dirname is used to extract the directory portion of picture_path. The os.makedirs function is then used to create the directories specified by the path, and the exist_ok argument is set to True to allow the function to succeed even if the directories already exist.
Finally, you can use the shutil library to copy the picture file to the destination and rename it, like this:
import shutil
src_file = os.path.join(src_path, picture_path)
dst_file = os.path.join(dest_path, picture_path)
shutil.copy(src_file, dst_file)
Here, src_file is the full path to the source picture file and dst_file is the full path to the destination. The shutil.copy function is then used to copy the file from the source to the destination.

You can use os.walk() for recursive search of files:
import os
dir_xml = r"path"
for root, dirs, files in os.walk(dir_xml): #topdown=False
for names in files:
if ".xml" in names:
print(f"file path: {root}\n XML-Files: {names}")
with open(names, 'r') as file:
content = file.readlines()

Related

Opening PDF within a zip folder fitz.open()

I have a function that opens a zip file, finds a pdf with a given filename, then reads the first page of the pdf to get some specific text. My issue is that after I locate the correct file, I can't open it to read it. I have tried to use a relative path within the zip folder and a absolute path in my downloads folder and I keep getting the error:
no such file: 'Deliverables_Rev B\Plans_Rev B.pdf'
no such file: 'C:\Users\MyProfile\Downloads\Deliverables_Rev B\Plans_Rev B.pdf'
I have been commenting out the os.path.join line to change between the relative and absolute path as self.prefs['download_path'] returns my download folder.
I'm not sure what the issue with with the relative path is, any insight would be helpful, as I think it has to do with trying to read out of a zipped folder.
import zipfile as ZipFile
import fitz
def getjobcode(self, filename):
if '.zip' in filename:
with ZipFile(filename, 'r') as zipObj:
for document in zipObj.namelist():
if 'plans' in document.lower():
document = os.path.join(self.prefs['download_path'], document)
doc = fitz.open(document)
page1 = doc.load_page(0)
page1text = page1.get_text('text')
jobcode = page1text[page1text.index(
'PROJECT NUMBER'):page1text.index('PROJECT NUMBER') + 30][-12:]
return jobcode
I ended up extracting the zip folder into the downloads folder then parsing the pdf to get the data I needed. Afterwords I created a job folder where I wanted it and moved the extracted folder into it from the downloads folder.

How can I extract all .zip extension in a folder without retaining directory using python?

Here is my code I don't know how can I loop every .zip in a folder, please help me: I want all contents of 5 zip files to extracted in one folder, not including its directory name
import os
import shutil
import zipfile
my_dir = r"C:\\Users\\Guest\\Desktop\\OJT\\scanner\\samples_raw"
my_zip = r"C:\\Users\\Guest\\Desktop\\OJT\\samples\\001-100.zip"
with zipfile.ZipFile(my_zip) as zip_file:
zip_file.setpassword(b"virus")
for member in zip_file.namelist():
filename = os.path.basename(member)
# skip directories
if not filename:
continue
# copy file (taken from zipfile's extract)
source = zip_file.open(member)
target = file(os.path.join(my_dir, filename), "wb")
with source, target:
shutil.copyfileobj(source, target)
repeated question, please refer below link.
How to extract zip file recursively in Pythonn
What you are looking for is glob. Which can be used like this:
#<snip>
import glob
#assuming all your zip files are in the directory below.
for my_zip in glob.glob(r"C:\\Users\\Guest\\Desktop\\OJT\\samples\\*.zip"):
with zipfile.ZipFile(my_zip) as zip_file:
zip_file.setpassword(b"virus")
for member in zip_file.namelist():
#<snip> rest of your code here.

Python's re.findAll() function won't work as expected

I'm trying to create a python script that will find all the files from a working directory with a certain name pattern.
I have stored all files in a list and then I have tried applying the re.findall method on the list to obtain only a list of files with that name pattern.
I have written this code:
# Create the regex object that we will use to find our files
fileRegex = re.compile(r'A[0-9]*[a-z]*[0-9]*.*')
all_files = []
# Recursevly read the contents of the working_dir/Main folder #:
for folderName, subfolders, filenames in os.walk(working_directory + "/Main"):
for filename in filenames:
all_files.append(filename)
found_files = fileRegex.findall(all_files)
I get this error at the last line of the code:
TypeError: expected string or bytes-like object
I have also tried re.findall(all_files) instead of using the 'fileRegex' created prior to that line. Same error. Please tell me what am I doing wrong. Thank you so much for reading my post!
Edit(second question):
I have followed the answers and it's now working fine. I'm trying to create an archive with the files that match that pattern after I've found them. The archive was created however the way I wrote the code the whole path to the file gets included in the archive (all the folders from / up to the file). I just want the file to be included in the final .zip not the whole directories and subdirectories that make the path to it.
Here is the code. The generation of the .zipfile is at the bottom. Please give me a tip how could I solve this I've tried many things but none worked. Thanks:
# Project properties:
# Recursively read the contents of the 'Main' folder which contains files with different names.
# Select only the files whose name begin with letter A and contain digits in it. Use regexes for this.
# Archive these files in a folder named 'Created_Archive' in the project directory. Give the archive a name of your choosing.
# Files that you should find are:
# Aerials3.txt, Albert0512.txt, Alberto1341.txt
########################################################################################################################################
import os
import re
import zipfile
from pathlib import Path
# Get to the proper working directory
working_directory = os.getcwd()
if working_directory != "/home/paul/Desktop/Python_Tutorials/Projects/Files_And_Archive":
working_directory = "/home/paul/Desktop/Python_Tutorials/Projects/Files_And_Archive"
os.chdir(working_directory)
check_archive = Path(os.getcwd() + "/" + "files.zip")
if check_archive.is_file():
print("Yes. Deleting it and creating it.")
os.unlink(os.getcwd() + "/" + "files.zip")
else:
print("No. Creating it.")
# Create the regex object that we will use to find our files
fileRegex = re.compile(r'A[0-9]*[a-z]*[0-9]+.*')
found_files = []
# Create the zipfile object that we will use to create our archive
fileArchive = zipfile.ZipFile('files.zip', 'a')
# Recursevly read the contents of the working_dir/Main folder #:
for folderName, subfolders, filenames in os.walk(working_directory + "/Main"):
for filename in filenames:
if fileRegex.match(filename):
found_files.append(folderName + "/" + filename)
# Check all files have been found and create the archive. If the archive already exists
# delete it.
for file in found_files:
print(file)
fileArchive.write(file, compress_type=zipfile.ZIP_DEFLATED)
fileArchive.close()
re.findAll works on strings not on lists, so its better you use r.match over the list to filter the ones that actually matches:
found_files = [s for s in all_files if fileRegex.match(s)]
regex works on strings not lists. the following works
import re
import os
# Create the regex object that we will use to find our files
# fileRegex = re.compile(r'A[0-9]*[a-z]*[0-9]*.*')
fileRegex = re.compile(r'.*\.py')
all_files = []
found_files = []
working_directory = r"C:\Users\michael\PycharmProjects\work"
# Recursevly read the contents of the working_dir/Main folder #:
for folderName, subfolders, filenames in os.walk(working_directory):
for filename in filenames:
all_files.append(filename)
if fileRegex.search(filename):
found_files.append(filename)
print('all files\n', all_files)
print('\nfound files\n', found_files)
re.findall doesn't take a list of strings. You need re.match .
# Create the regex object that we will use to find our files
fileRegex = re.compile(r'A[0-9]*[a-z]*[0-9]*.*')
all_files = []
# Recursively read the contents of the working_dir/Main folder #:
for folderName, subfolders, filenames in os.walk(working_directory + "/Main"):
for filename in filenames:
all_files.append(filename)
found_files = [file_name for file_name in all_files if fileRegex.match(file_name)]

Applying a UDF into a for loop - Python

Example of PDF: "Smith#00$Consolidated_Performance.pdf"
The goal is to add a bookmark to page 1 of each PDF based on the filename.
(Bookmark name in example would be "Consolidated Performance")
import os
from openpyxl import load_workbook
from PyPDF2 import PdfFileMerger
cdir = "Directory of PDF" # Current directory
pdfcdir = [filename for filename in os.listdir(cdir) if filename.endswith(".pdf")]
def addbookmark(f):
output = PdfFileMerger()
name = os.path.splitext(os.path.basename(f))[0] # Split filename from .pdf extension
dp = name.index("$") + 1 # Find position of $ sign
bookmarkname = name[dp:].replace("_", " ") # replace underscores with spaces
output.addBookmark(bookmarkname, 0, parent=None) # Add bookmark
output.append(open(f, 'rb'))
output.write(open(f, 'wb'))
for f in pdfcdir:
addbookmark(f)
The UDF works fine when applied to individual PDFs, but it won't add the bookmarks when put into the loop at the bottom of the code. Any ideas on how to make the UDF loop through all PDFs within pdfcdir?
I'm pretty sure that the issue you're having has nothing to do with the loop. Rather, you're passing just the filenames and not including the directory path. It's trying to open these files in the script's current working directory (the directory the script is in, by default) rather than in the directory you read the filenames from.
So, join the directory name with each file name when calling your function.
for f in pdfcdir:
addbookmark(os.path.join(cdir, f))

How to open a file only using its extension?

I have a Python script which opens a specific text file located in a specific directory (working directory) and perform some actions.
(Assume that if there is a text file in the directory then it will always be no more than one such .txt file)
with open('TextFileName.txt', 'r') as f:
for line in f:
# perform some string manipulation and calculations
# write some results to a different text file
with open('results.txt', 'a') as r:
r.write(someResults)
My question is how I can have the script locate the text (.txt) file in the directory and open it without explicitly providing its name (i.e. without giving the 'TextFileName.txt'). So, no arguments for which text file to open would be required for this script to run.
Is there a way to achieve this in Python?
You could use os.listdir to get the files in the current directory, and filter them by their extension:
import os
txt_files = [f for f in os.listdir('.') if f.endswith('.txt')]
if len(txt_files) != 1:
raise ValueError('should be only one txt file in the current directory')
filename = txt_files[0]
You Can Also Use glob Which is easier than os
import glob
text_file = glob.glob('*.txt')
# wild card to catch all the files ending with txt and return as list of files
if len(text_file) != 1:
raise ValueError('should be only one txt file in the current directory')
filename = text_file[0]
glob searches the current directory set by os.curdir
You can change to the working directory by setting
os.chdir(r'cur_working_directory')
Since Python version 3.4, it is possible to use the great pathlib library. It offers a glob method which makes it easy to filter according to extensions:
from pathlib import Path
path = Path(".") # current directory
extension = ".txt"
file_with_extension = next(path.glob(f"*{extension}")) # returns the file with extension or None
if file_with_extension:
with open(file_with_extension):
...

Categories

Resources