I have the following problem.
I have folder structure like this:
vol1/
chap1/
01.jpg
02.JPG
03.JPG
chap2/
04.JPG
05.jpg
06.jpg
chap3/
07.JPG
08.jpg
09.JPG
vol2/
chap4/
01.JPG
02.jpg
03.jpg
chap5/
04.jpg
05.JPG
06.jpg
chap6/
07.jpg
08.JPG
09.jpg
Inside a single vol folder, the chapters have an increasing order, and the same happens for the jpg files inside each chap folder.
Now, I would like, for each vol folder to obtain a pdf, maintaining the ordering of the pictures. Think about it as a divided comics or manga volume to be put back into a single file.
How could I do it in bash or python?
I do not know how many volumes I have, or how many chapters are in a single volume, or how many jpg files are in a single chapter. In other words, I need it to work it for whatever number of volumes/chapters/jpgs.
An addition would be considering heterogeneous picture files, maybe having both jpg and png in a single chapter, but that's a plus.
I guess this should work like intended ! Tell me if you encounter issues
import os
from PIL import Image
def merge_into_pdf(paths, name):
list_image = []
# Create list of images from list of path
for i in paths:
list_image.append(Image.open(i).convert("RGB"))
# merge into one pdf
if len(list_image) == 0:
return
# get first element of list and pop it from list
img1 = list_image[0]
list_image.pop(0)
# append all images and save as pdf
img1.save(f"{name}.pdf",save_all=True, append_images=imagelist)
def main():
# List directory
directory = os.listdir(".")
for i in directory:
# if directory start with 'vol' iterate through it
if i.find("vol") != -1:
path = []
sub_dir = os.listdir(i)
# for each subdirectory
for j in sub_dir:
files = os.listdir(f"{i}/{j}")
for k in files:
# if file ends with jpg or png append to list
if k.endswith((".jpg", ".JPG", ".png", ".PNG")):
path.append(f"{i}/{j}/{k}")
# merge list into one pdf
merge_into_pdf(path, i)
if __name__ == "__main__":
main()
Related
I’m fairly new to deep learning and learning as I got so sorry if this is very basic, but I’m working on a model for detecting invasive coconut rhinoceros beetles destroying palm trees using drone photography. The 1080p photos I’m given were taken 250ft AGL and were cropped into equal size smaller images with some having one or more palm trees and some having none. I’m using labelStudio to generate the XML files that point to their jpg counterparts path.
My current problem is to input the XML into a CSV for training and validation on Keras. Each of the cropped images is named the same such as:
Drone_img1
11.jpg
12.jpg
13.jpg
…
46.jpg
Drone_img2
11.jpg
12.jpg
13.jpg
…
46.jpg
Drone_img1000
11.jpg
12.jpg
13.jpg
…
46.jpg
I’m using a python script written by a previous student before me that is supposed to split the data for training and validation into different directories and create the csv file and the model. But when I run it, it appears to have a problem with the cropped images having the same naming scheme. My test and validation directories now look like this:
Test dir & validation dir
11.jpg
11(1).jpg
11(2).jpg
12.jpg
13.jpg
13(1).jpg
152.jpg
…
999.jpg
999(1).jpg
1000.jpg
Note: the cropped images all had the same naming scheme but were in separate directories. However, when using a script to split into test & validation groups, it’s getting a duplicate photo and adds a number in parenthesis.
My question: Is there a better way to preprocess image data with XML annotations into csv without me having to change the 1000 image names manually? Keep in mind that XML notations also point to their jpg names path so if I change the jpg names I’d have to change their XML annotations too.
The only thing I can think of is to write a new cropping script that ensures that the names are all different for the next time I get image data, but I would prefer to not go backward with the current data.
Edit:
Update: Looks like I need to make sure the path slashes are consistent.
Here is a picture of the Cropped Img Directories.
This is an image of the training and validation sets that were created
Here is an image of the csv files generated.
Script I created(mostly GPT) to edit XML <path> tags:
import os
import tkinter as tk
from tkinter import filedialog
from xml.etree import ElementTree as ET
def browse_directory():
root = tk.Tk()
root.withdraw()
xml_directory = filedialog.askdirectory(parent=root, title='Choose the directory of the XML files')
jpg_directory = filedialog.askdirectory(parent=root, title='Choose the directory of the JPG files')
batch_edit_xml(xml_directory, jpg_directory)
def headless_mode():
xml_directory = input("Enter the path of the XML folder: ")
jpg_directory = input("Enter the path of the JPG folder: ")
batch_edit_xml(xml_directory, jpg_directory)
def batch_edit_xml(xml_directory, jpg_directory):
count = 1 # initializing count to 1
for root, dirs, files in os.walk(xml_directory):
for file in files:
if file.endswith(".xml"):
file_path = os.path.join(root, file) # creating a file path by joining the root and the file name
xml_tree = ET.parse(file_path) # parsing the XML file
xml_root = xml_tree.getroot() # getting the root of the XML file
filename = os.path.splitext(file)[0] # getting the file name without the extension
jpg_path = os.path.join(jpg_directory, os.path.basename(root), filename + '.jpg') # creating a jpg path
xml_root.find('./path').text = jpg_path # finding the path element in the XML file and updating it with the jpg_path
xml_tree.write(file_path) # writing the changes back to the XML file
print(f"{count} of {len(files)}: {file_path}") # printing the current count and the total number of files processed
count += 1
if count > len(files): # checking if the count has reached the length of the files
count = 1 # resetting the count back to 1
print("Edit Complete") # indicating that the edit is complete
mode = input("Enter 1 for headless mode or 2 for desktop mode: ")
if mode == '1':
headless_mode()
elif mode == '2':
browse_directory()
else:
print("Invalid input. Please enter 1 or 2.")
It is not hard for you to write another python script to read all images in test dir and save them into a csv file. A sample code in python is as below:
import os
import pandas as pd
images = []
# suppose test_dir holds all test images
for path, subdirs, files in os.walk(test_dir):
for image_name in files:
images.append(os.path.join(path, image_name))
dict = {'image name': images}
df = pd.DataFrame(dict)
df.to_csv('your.csv')
During one of my projects, I faced this challenge: There is a folder named Project, and inside that, there are multiple images (say 100 images), and each has been named sequentially like the first image name is imag_0, 2nd image name is img_2,....imag_99.
Now, based on some conditions, I need to separate out some images say img_5, img_10, img_30, img_88, img_61. My question will be, is there any way to filter out these images and make a folder inside the folder Project named "the odd ones" and store those specified images?
One extra help will be in my case. Suppose I have hundreds of such Projects folders in a sequential way Projects_1, Projects_2, Projects_3,....., Projects_99, and each contains hundreds of pictures. Can it be possible to separate all the specified photos and store them inside a separate folder inside each Projects_n folder, assuming the photos we have to separate out and store differently will be the same for each Projects_n folder?
Please help me with this. Thank you!
For the first problem you can lookup to the below pseudo-code (you have to specify the target function). Instead, for the second problem you should provide more details;
from glob import glob
import itertools
import shutil
import os
# Creating a funtion to check if filename
# is a target file which has to be moved:
def is_target(filename):
if ... return True
else return False
dirname = "some/path/to/project"
# Creating a list of all files in dir which
# could be moved based on type extension:
types = ('*.png', '*.jpeg')
filepaths = list(itertools.chain(*[glob(os.path.join(dirname, f"*.{t}")) for t in types]))
# Finding the files to move:
filepaths_to_move = []
for filepath in filepaths:
if is_target(os.path.basename(filepath)):
filepaths_to_move.append(filepath)
# Creating the new subfolder:
new_folder_name = "odd_images"
new_dir = os.path.join(dirname, new_folder_name)
if not os.path.exists(new_dir): os.makedirs(new_dir)
# Moving files into subfolder:
for filepath in filepaths_to_move:
basename = os.path.basename(filepath)
shutil.move(source, os.path.join(filepath, os.path.join(dirname, basename)))
Here is the logic.make necessary improvements for your use case
project_dir = "project_dir"
move_to_dir = os.path.join(project_dir,"move_to_dir")
files = [os.path.join(project_dir,file) for file in os.listdir(project_dir)]
filenames_to_filter = "test1.txt,test2.txt"
if not os.path.exists(move_to_dir):
os.makedirs(move_to_dir)
for(file in files):
if os.path.basename(file) in filenames_to_filter:
shutil.move(file,move_to_dir)
`
This code basically already create the PDFs. After it created the PDF it is copied in its own folder. What I am trying to do is merge what is in the folder. Then it would go to the next folder and do the merge. Then on to the next folder and do the merge. And such. But when I do it, it's
just merging the last PDF and not all the PDFs.
import os
import shutil
import time
from PyPDF2 import PdfFileMerger
from reportlab.pdfgen.canvas import Canvas
path = input("Paste the folder path of which all the PDFs are located to begin the automation.\n")
# Only allowed to use your C or H drive.
while True:
if "C" in path[0]:
break
elif "H" in path[0]:
break
else:
print("Sorry you can only use your C drive or H drive\n")
path = input("Paste the folder path of which all the PDFs are located to begin the automation.\n")
moving_path = path + "\\Script"
new_path = moving_path + "\\1"
folder_name = {}
# List all directories or files in the specific path
list_dir = ["040844_135208_3_192580_Sample_010.pdf", "040844_135208_3_192580_Sample_020.pdf",
"040844_135208_3_192580_Sample_050.pdf", "058900_84972_3_192163_Sample_010.pdf",
"058900_84972_3_192163_Sample_020.pdf", "058900_84972_3_192163_Sample_030.pdf"]
# Pauses the program
def wait(num):
time.sleep(num)
# Change and make directory
def directory():
os.chdir(path)
for i in list_dir:
canvas = Canvas(i)
canvas.drawString(72, 72, "Hello, World")
canvas.save()
os.makedirs("Script")
os.chdir(path + "\\Script")
os.makedirs("1")
os.makedirs("Merge")
os.chdir(new_path)
def main():
match = []
for i in list_dir:
search_zero = i.split("_")[2]
if search_zero != "0":
match.append((i.split("_", 3)[-1][:6]))
else:
match.append((i.split("_", 0)[-1][:6]))
new_match = []
for i, x in enumerate(match):
if "_" in match[i]:
new_match.append(x[:-1])
else:
new_match.append(x)
for i in list_dir:
key = i.split("_", 3)[-1][:6]
if key in folder_name:
folder_name[key].append(i)
else:
folder_name[key] = [i]
for i, x in enumerate(list_dir):
# Skips over the error that states that you can't make duplicate folder name
try:
os.makedirs((new_match[i]))
except FileExistsError:
pass
# Moves the file that doesn't contain "PDFs" into the "1" folder and the one that does in the "Merge" folder
if "PDFs" not in list_dir[i]:
shutil.copy(f"{path}\\{list_dir[i]}", f"{new_path}\\{new_match[i]}")
os.chdir(f"{new_path}\\{new_match[i]}")
merger = PdfFileMerger(x)
merger.append(x)
merger.write(f"{new_match[i]}.pdf")
merger.close()
os.chdir(new_path)
else:
shutil.copy(f"{path}\\{list_dir[i]}", f"{moving_path}\\Merge\\{x}")
directory()
wait(0.7)
main()
print("Done!")
wait(2)
I have these 4 PDFs:
pg1.pdf
pg2.pdf
pg3.pdf
pg4.pdf
Here's a starter-script to merge Pg1 and Pg2 into one PDF, and Pg3 and Pg4 into another:
from PyPDF2 import PdfMerger
# Create merger object
merger = PdfMerger()
for pdf in ["pg1.pdf", "pg2.pdf"]:
merger.append(pdf)
merger.write("merged_1-2.pdf")
merger.close()
# Re-create merger object
merger = PdfMerger()
for pdf in ["pg3.pdf", "pg4.pdf"]:
merger.append(pdf)
merger.write("merged_3-4.pdf")
merger.close()
Now we extend that idea and wrap up the data so it will drive a loop that does the same thing:
page_sets = [
# Individaul PDFs , final merged PDF
[["pg1.pdf", "pg2.pdf"], "merged_1-2.pdf"],
[["pg3.pdf", "pg4.pdf"], "merged_3-4.pdf"],
]
for pdfs, final_pdf in page_sets:
merger = PdfMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write(final_pdf)
merger.close()
I get the following for either the straight-down script, or the loop-y script:
merged_1-2.pdf
merged_3-4.pdf
As best I understand your larger intent, that loop represents you writing groups of PDFs into a merged PDF (in separate directories?), and the structure of:
create merger object
append to merger object
write merger object
closer merger object
Back to Step 1
works, and as far as I can tell is the way to approach this problem.
As an aside from the issue of getting the merging of the PDFs working... try creating the on-disk folder structure first, then create a data structure like page_sets that represents that on-disk structure, then (finally) pass off the data to the loop to merge. That should also make debugging easier:
"Do I have the on-disk folders correct?", "Yes", then move on to
"Do I have page_sets correct?", "Yes", then move on to
the actual appending/writing
And if the answer to 1 or 2 is "No", you can inspect your file system or just look at a print-out of page_sets and spot any disconnects. From there, merging the PDFs should be really trivial.
Once that's working correctly, if you want to go back and refactor to try and get folders/data/merge in one pass for each set of PDFs, then go for it, but at least you have a working example to fall back on and start to ask where you missed something if you run into problems.
Whenever you end up with something that only contains the last value from a loop, check your loop logic. In this case, your merger loop looks like this:
for i, x in enumerate(list_dir):
...
if "PDFs" not in list_dir[i]:
...
merger = PdfFileMerger(x)
merger.append(x)
merger.write(f"{new_match[i]}.pdf")
merger.close()
So for each file in list_dir you create a new merger, add the file, and write out the PDF. Unsurprisingly, each PDF file you write contains exactly one input pdf.
Move the merger creation and merger.write out of the innermost loop, so that all of the files to be merged are appended together and written out as a single PDF. Your naming logic is a bit convoluted, but it seems that you want to be looping over the variable folder_name, and merging the corresponding files. So, maybe like this:
for key in folder_name:
merger = PdfFileMerger()
for x in folder_name[key]:
merger.append(x)
merger.write(key+".pdf")
You'll need to add your own path and naming logic; I won't try to guess what you intended.
I have Folder Named A, which includes some Sub-Folders starting name with Alphabet A.
In these Sub-Folders different images are placed (some of the image formats are .png, jpeg, .giff and .webp, having different names
like item1.png, item1.jpeg, item2.png, item3.png etc). From these Sub-Folders I want to get list of path of those images which endswith 1.
Along with that I want to only get 1 image file format like for example only for jpeg. In some Sub-Folders images name endswith 1.png, 1.jpeg, 1.giff and etc.
I only want one image from every Sub-Folder which endswith 1.(any image format). I am sharing the code which returns image path of items (ending with 1) for all images format.
CODE:
here is the code that can solve your problem.
import os
img_paths = []
for top,dirs, files in os.walk("your_path_goes_here"):
for pics in files:
if os.path.splitext(pics)[0] == '1':
img_paths.append(os.path.join(top,pics))
break
img_paths will have the list that you need
it will have the first image from the subfolder with the name of 1 which can be any format.
Incase if you want with specific formats,
import os
img_paths = []
for top,dirs, files in os.walk("your_path_goes_here"):
for pics in files:
if os.path.splitext(pics)[0] == '1' and os.path.splitext(pics)[1][1:] in ['png','jpg','tif']:
img_paths.append(os.path.join(top,pics))
break
Thanks, to S3DEV for making it more optimized
I am trying to create a separate array for each pass of the for loop in order to store the values of 'signal' which are generated by the wavefile.read function.
Some background as to how the code works / how Id like it to work:
I have the following file path:
Root directory
Labeled directory
Irrelevant multiple directories
Multiple .wav files stored in these subdirectories
Labeled directory
Irrelevant multiple directories
Multiple .wav files stored in these subdirectories
Now for each Labeled Folder, Id like to create an array that holds the values of all the .wav files contained in its respective sub directories.
This is what I attempted:
for label in df.index:
for path, directories, files in os.walk('voxceleb1/wav_dev_files/' + label):
for file in files:
if file.endswith('.wav'):
count = count + 1
rate,signal = wavfile.read(os.path.join(path, file))
print(count)
Above is a snapshot of dataframe df
Ultimately, the reason for these arrays is that I would like to calculate the mean average length of time of the wav files contained in each labeled subdirectory and add this as a column vector to the dataframe.
Note that the index of the dataframe corresponds to the directory names. I appreciate any and all help!
The code snippet you've posted can be simplified and modernized a bit. Here's what I came up with:
I've got the following directory structure:
I'm using text files instead of wav files in my example, because I don't have any wav files on hand.
In my root, I have A and B (these are supposed to be your "labeled directories"). A has two text files. B has one immediate text file and one subfolder with another text file inside (this is meant to simulate your "irrelevant multiple directories").
The code:
def main():
from pathlib import Path
root_path = Path("./root/")
labeled_directories = [path for path in root_path.iterdir() if path.is_dir()]
txt_path_lists = []
# Generate lists of txt paths
for labeled_directory in labeled_directories:
txt_path_list = list(labeled_directory.glob("**/*.txt"))
txt_path_lists.append(txt_path_list)
# Print the lists of txt paths
for txt_path_list in txt_path_lists:
print(txt_path_list)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The output:
[WindowsPath('root/A/a_one.txt'), WindowsPath('root/A/a_two.txt')]
[WindowsPath('root/B/b_one.txt'), WindowsPath('root/B/asdasdasd/b_two.txt')]
As you can see, we generated two lists of text file paths, one for each labeled directory. The glob pattern I used (**/*.txt) handles multiple nested directories, and recursively finds all text files. All you have to do is change the extension in the glob pattern to have it find .wav files instead.