I have two image folders for skin cancer benign and malignant, I want to get the CSV file contains the first column is a path of the image and the second column is a label of the image in python language. how I can do that?
paths of dataset
'../input/skin-cancer-malignant-vs-benign/train/benign'
'../input/skin-cancer-malignant-vs-benign/train/malignant'
Check out the glob module:
benign = glob.glob('{path to benign folder}/*.png')
malignant = glob.glob('{path to malignant folder}/*.png')
the * here just means take the file path for all .png files in this folder. of course change .png to whatever image format you are using.
Then it's just a matter of writing the data
import glob
benign = glob.glob('../input/skin-cancer-malignant-vs-benign/train/benign/*.png')
malignant = glob.glob('../input/skin-cancer-malignant-vs-benign/train/malignant/*.png')
CSV_FILE_NAME = 'my_file.csv'
with open(CSV_FILE_NAME, 'w') as f:
for path in benign:
f.write(path) # write the path in the first column
f.write(',') # separate first and second item by a comma
f.write('benign') # write the label in the second column
f.write('\n') # start a new line
for path in malignant:
f.write(path)
f.write(',')
f.write('malignant')
f.write('\n')
You can definitely write this more succinctly, but this is a bit more readable
Related
I’m fairly new to deep learning and learning as I got so sorry if this is very basic, but I’m working on a model for detecting invasive coconut rhinoceros beetles destroying palm trees using drone photography. The 1080p photos I’m given were taken 250ft AGL and were cropped into equal size smaller images with some having one or more palm trees and some having none. I’m using labelStudio to generate the XML files that point to their jpg counterparts path.
My current problem is to input the XML into a CSV for training and validation on Keras. Each of the cropped images is named the same such as:
Drone_img1
11.jpg
12.jpg
13.jpg
…
46.jpg
Drone_img2
11.jpg
12.jpg
13.jpg
…
46.jpg
Drone_img1000
11.jpg
12.jpg
13.jpg
…
46.jpg
I’m using a python script written by a previous student before me that is supposed to split the data for training and validation into different directories and create the csv file and the model. But when I run it, it appears to have a problem with the cropped images having the same naming scheme. My test and validation directories now look like this:
Test dir & validation dir
11.jpg
11(1).jpg
11(2).jpg
12.jpg
13.jpg
13(1).jpg
152.jpg
…
999.jpg
999(1).jpg
1000.jpg
Note: the cropped images all had the same naming scheme but were in separate directories. However, when using a script to split into test & validation groups, it’s getting a duplicate photo and adds a number in parenthesis.
My question: Is there a better way to preprocess image data with XML annotations into csv without me having to change the 1000 image names manually? Keep in mind that XML notations also point to their jpg names path so if I change the jpg names I’d have to change their XML annotations too.
The only thing I can think of is to write a new cropping script that ensures that the names are all different for the next time I get image data, but I would prefer to not go backward with the current data.
Edit:
Update: Looks like I need to make sure the path slashes are consistent.
Here is a picture of the Cropped Img Directories.
This is an image of the training and validation sets that were created
Here is an image of the csv files generated.
Script I created(mostly GPT) to edit XML <path> tags:
import os
import tkinter as tk
from tkinter import filedialog
from xml.etree import ElementTree as ET
def browse_directory():
root = tk.Tk()
root.withdraw()
xml_directory = filedialog.askdirectory(parent=root, title='Choose the directory of the XML files')
jpg_directory = filedialog.askdirectory(parent=root, title='Choose the directory of the JPG files')
batch_edit_xml(xml_directory, jpg_directory)
def headless_mode():
xml_directory = input("Enter the path of the XML folder: ")
jpg_directory = input("Enter the path of the JPG folder: ")
batch_edit_xml(xml_directory, jpg_directory)
def batch_edit_xml(xml_directory, jpg_directory):
count = 1 # initializing count to 1
for root, dirs, files in os.walk(xml_directory):
for file in files:
if file.endswith(".xml"):
file_path = os.path.join(root, file) # creating a file path by joining the root and the file name
xml_tree = ET.parse(file_path) # parsing the XML file
xml_root = xml_tree.getroot() # getting the root of the XML file
filename = os.path.splitext(file)[0] # getting the file name without the extension
jpg_path = os.path.join(jpg_directory, os.path.basename(root), filename + '.jpg') # creating a jpg path
xml_root.find('./path').text = jpg_path # finding the path element in the XML file and updating it with the jpg_path
xml_tree.write(file_path) # writing the changes back to the XML file
print(f"{count} of {len(files)}: {file_path}") # printing the current count and the total number of files processed
count += 1
if count > len(files): # checking if the count has reached the length of the files
count = 1 # resetting the count back to 1
print("Edit Complete") # indicating that the edit is complete
mode = input("Enter 1 for headless mode or 2 for desktop mode: ")
if mode == '1':
headless_mode()
elif mode == '2':
browse_directory()
else:
print("Invalid input. Please enter 1 or 2.")
It is not hard for you to write another python script to read all images in test dir and save them into a csv file. A sample code in python is as below:
import os
import pandas as pd
images = []
# suppose test_dir holds all test images
for path, subdirs, files in os.walk(test_dir):
for image_name in files:
images.append(os.path.join(path, image_name))
dict = {'image name': images}
df = pd.DataFrame(dict)
df.to_csv('your.csv')
My first post on StackOverflow, so please be nice. In other words, a super beginner to Python.
So I want to read multiple files from a folder, divide the text and save the output as a new file. I currently have figured out this part of the code, but it only works on one file at a time. I have tried googling but can't figure out a way to use this code on multiple text files in a folder and save it as "output" + a number, for each file in the folder. Is this something that's doable?
with open("file_path") as fReader:
corpus = fReader.read()
loc = corpus.find("\n\n")
print(corpus[:loc], file=open("output.txt","a"))
Possibly work with a list, like:
from pathlib import Path
source_dir = Path("./") # path to the directory
files = list(x for x in filePath.iterdir() if x.is_file())
for i in range(len(files)):
file = Path(files[i])
outfile = "output_" + str(i) + file.suffix
with open(file) as fReader, open(outfile, "w") as fOut:
corpus = fReader.read()
loc = corpus.find("\n\n")
fOut.write(corpus[:loc])
** sorry for multiple editting....
welcome to the site. Yes, what you are asking above is completely doable and you are on the right track. You will need to do a little research/practice with the os module which is highly useful when working with files. The two commands that you will want to research a bit are:
os.path.join()
os.listdir()
I would suggest you put two folders within your python file, one called data and the other called output to catch the results. Start and see if you can just make the code to list all the files in your data directory, and just keep building that loop. Something like this should list all the files:
# folder file lister/test writer
import os
source_folder_name = 'data' # the folder to be read that is in the SAME directory as this file
output_folder_name = 'output' # will be used later...
files = os.listdir(source_folder_name)
# get this working first
for f in files:
print(f)
# make output folder names and just write a 1-liner into each file...
for f in files:
output_filename = f.split('.')[0] # the part before the period
output_filename += '_output.csv'
output_path = os.path.join(output_folder_name, output_filename)
with open(output_path, 'w') as writer:
writer.write('some data')
I have an unstructured dataset that consists of audio files. How do I iterate through all files in a given directory (including all files in my subfolders) and label them according to their filenames then store this information in a CSV file?
I am expecting the CSV file to look something like this CSV File:
The purpose is i want to get
the filename and create a label the way i want (for all my files) and
then save this information in a csv file
You can use glob, and pandas to_csv() for this task, i.e.:
from os import path
from glob import glob
import pandas as pd
f_filter = ["mp3", "ogg"] # a list containing the desired file extensions to be matched
m = [] # final match list
for f_path in glob('D:/museu_do_fado/mp3/**', recursive=True): # loop directory recursively
f_name = path.basename(f_path) # get the filename
f_ext = f_name.split(".")[-1].lower() # get the file extension and lower it for comparison.
if f_ext in f_filter: # filter files by f_filter
label = "Your choice"
#label = f_name[0] + f_ext[-1] # as per your example, first char of file_name and last of file_ext
m.append([f_path, f_name, f_ext, label]) # append to match list
#print(f_path, f_name, f_name, label)
df = pd.DataFrame(m, columns=['f_path', 'f_name', 'f_ext', 'label']) # create a dataframe from match list
df.to_csv("my_library.csv", index=False) # create csv from df
Sample csv:
f_path,f_name,f_ext,label
D:\museu_do_fado\mp3\MDF0001_39.mp3,MDF0001_39.mp3,mp3,Your choice
D:\museu_do_fado\mp3\MDF0001_40.mp3,MDF0001_40.mp3,mp3,Your choice
...
Notes:
Pandas allows several export formats, including to_json(), to_pickle() and to_csv() used in the example above, it's a great library to create several types of data analysis/visualization of your library. I'd definitively advise you to learn pandas if you can.
This answer should give you a starting point, make sure you read the docs if something is off, GL.
Let's say I have a biiig garden, and I'm a total flower nerd, and I keep a monthly folder of csv files where I keep track of the different kinds of flowers I have and their numbers in individual files. Not every flower blooms every month, so if I were to make a list of all the flower files I have, it might look like this:
['Roses','Lilies','Tulips','Cornflowers','Sunflowers','Hydrangea','Daisies','Dahlias','Primroses','Hibiscus']
etc. (with many more actual files in there), but the folder for March might look like this:
['Tulips','Primroses']
while the folder for June might look like this:
['Roses','Primroses','Daisies','Dahlias','Hibiscus']
Now, I run the same analyses over these files every month, because I want to see how my flowers grew, which different colours I have, etc. But I don't want to have to redo the whole file opening block every month to fit the subset of flower files I have in my specific folder - especially because I have 30+ files.
So, is there a simple, effective way to tell Python "look, this is the pool of file names I would want to load data from, pick what's there in the folder and load it" without having it create any files that aren't there and without having to write 30+ load statements?
I would really appreciate any help!
The easiest way to approach this would be to list the contents of your monthly directory using os.listdir(directory) and check whether the flower names are in your list of acceptable names:
import os
path = '/path/to/the/flower/directory'
flowers = ['Roses','Lilies','Tulips','Cornflowers','Sunflowers','Hydrangea','Daisies','Dahlias','Primroses','Hibiscus']
for file in os.listdir(path):
if file in flowers: # if the file name is in `flowers`
with open(path + file, 'r') as flower_file:
# do your analysis on the contents
The file name would need to match the string in flowers exactly, though. I would guess it is more likely that the file name is something like hydrangea.csv, so you might want to do some extra filtering, e.g.
flowers = ['roses','lilies','tulips','cornflowers']
for file in os.listdir(path):
# file has extension .csv and the file name minus the last 4 chars is in `flowers`
if file.endswith(".csv") and file[0:-4] in flowers:
with open(path + file, 'r') as flower_file:
# do your analysis on the contents
If you have your flower folders organised by date (or any other grouping), e.g. like this:
/home/flower_data/
2018-04/
2018-05/
2018-06/
you could do something like this from your top level path directory:
path = '/home/flower_data'
# for every item in the directory
for subf in os.scandir(path):
# if the item is a directory
if subf.is_dir():
# for every file in path/subfolder
for file in os.listdir( subf.path ):
if file.endswith('.csv') and file[0:-4] in flowers:
# print out the full path to the file and the file name
fullname = os.path.join(subf.path, file)
print('Now looking at ' + fullname)
with open(fullname, 'r') as flower_file:
# analyse away!
I'm trying to come up with a way for the filenames that I'm reading to have the same filename as what I'm writing. The code is currently reading the images and doing some processing. My output will be extracting the data from that process into a csv file. I want both the filenames to be the same. I've come across fname for matching, but that's for existing files.
So if your input file name is in_file = myfile.jpg do this:
my_outfile = "".join(infile.split('.')[:-1]) + 'csv'
This splits infile into a list of parts that are separated by '.'. It then puts them back together minus the last part, and adds csv
your my_outfile will be myfile.csv
Well in python it's possible to do that but, the original file might be corrupted if we were to have the same exact file name i.e BibleKJV.pdf to path BibleKJV.pdf will corrupt the first file. Take a look at this script to verify that I'm on the right track (if I'm totally of disregard my answer):
import os
from PyPDF2 import PdfFileReader , PdfFileWriter
path = "C:/Users/Catrell Washington/Pride"
input_file_name = os.path.join(path, "BibleKJV.pdf")
input_file = PdfFileReader(open(input_file_name , "rb"))
output_PDF = PdfFileWriter()
total_pages = input_file.getNumPages()
for page_num in range(1,total_pages) :
output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.join(path, "BibleKJV.pdf")
output_file = open(output_file_name , "wb")
output_PDF.write(output_file)
output_file.close()
When I ran the above script, I lost all data from the original path "BibleKJV.pdf" thus proving that if the file name and the file delegation i.e .pdf .cs .word etc, are the same then the data, unless changed very minimal, will be corrupted.
If this doesn't give you any help please, edit your question with a script of what you're trying to achieve.