I have dataset of images and it's corresponding csv files (converted to dataframe) containing names and other information of these images. The actual number of images are about 7000 but after pre-processing the dataframe, I have left just 3000 image names in this dataframe. Now I want to load only those images which are available in the dataframe only.
The image names in dataframe are like below
| images |
1_IM-0001-4001.dcm.png
2_IM-0001-4001.dcm.png
3_IM-0001-4001.dcm.png
but the full path of these images are like below including directory path which is also called absolute path
/content/ChestXR/images/images_normalized/1004_IM-0005-1001.dcm.png
Now I want to run a loop that read images from the dataframe column only for this I need absolute path plus image names mentioned in dataframe column
for images in os.listdir(path):
if (images.endswith(".png") or images.endswith(".jpg") or images.endswith(".jpeg")):
image_path = path + {df["images"]}
where image directory path is below
path = "/content/drive/MyDrive/IU-Xray/images/images_normalized"
and the respective data frame column name is below
df["images"]
but the below line does not work in my loop and generates error that "TypeError: unhashable type: 'Series'"
image_path = path + {df["images"]}
This may not be the fullest answer but I think will get you close..
path = "/content/drive/MyDrive/IU-Xray/images/images_normalized"
filestodownload = []
for images in df["images"]:
if (images.endswith(".png") or images.endswith(".jpg") or images.endswith(".jpeg")):
filestodownload.append(path + '//' + images)
Then you'll have a list of images you need to download from etc.
You may have to check if df["images"] will work to iterate through, you can turn that column into a list as well if that's easier
as you are using pandas you can do something like this:
path = "/content/drive/MyDrive/IU-Xray/images/images_normalized/"
mask = df['images'].str.contains(r'\.(?:png|jpg|jpeg)$')
full_path = path + df.images[mask]
print(full_path[1])
# /content/drive/MyDrive/IU-Xray/images/images_normalized/2_IM-0001-4001.dcm.png
Related
There are many folders (each for a patient), and each folder has many image files. There is a CSV file that contains folder names and their corresponding labels.
I want Python to consider every folder’s labels for all files in them and load them (filenames are the same as the parent folder, but the difference is within the [ ] mark). But I don’t know how to assign folder labels for all files in it as machine input.
I'm using this code but the [ */ *] seems to be not correct.
data = pd.read_csv("COAD_CMS_label.csv")
training_data, testing_data = train_test_split(data, test_size=0.25, random_state=25)
y = data['CMS_Subtype']
#add all the training images, store them in a list, and finally convert that list into a numpy array
train_image = []
for i in tqdm(range(data.shape[0])):
img = image.load_img('tiles/'+training_data['folder_name'][i]+ [*/*] +'.jpg', target_size=(256,256,3),
grayscale=False)
img = image.img_to_array(img)
img = img/255
train_image.append(img)
X = np.array(train_image)
CSV File head:
folder_name,CMS_Subtype
TCGA-A6-2683-01Z-00-DX1.0dfc5d0a-68f4-45e1-a879-0428313c6dbc,CMS2
TCGA-F4-6459-01Z-00-DX1.80a78213-1137-4521-9d60-ac64813dec4c,CMS4
TCGA-A6-6653-01Z-00-DX1.e130666d-2681-4382-9e7a-4a4d27cb77a4,CMS1
File name example in its folder:
filelist = []
for root, dirs, files in os.walk(path):
for file in files:
#append the file name to the list
filelist.append(os.path.join(root,file))
tiles\TCGA-3L-AA1B-01Z-00-DX1.8923A151-A690-40B7-9E5A-FCBEDFC2394F\TCGA-3L-AA1B-01Z-00-DX1.8923A151-A690-40B7-9E5A-FCBEDFC2394F [d=1.97863,x=30166,y=17368,w=1013,h=1013].jpg’
Here is the screenshot of the result:
enter image description here
os.walk should solve the problem for you.
In python you have to use "/" as separator between folders or files not the backslash .
You can use os.path.join() to join two files or folders or folder and file, or you can use pathlib library for this.
If you provide the root path to os.walk() library, underneath subfolders or files will automatically gets added.
I have Folder Named A, which includes some Sub-Folders starting name with Alphabet A.
In these Sub-Folders different images are placed (some of the image formats are .png, jpeg, .giff and .webp, having different names
like item1.png, item1.jpeg, item2.png, item3.png etc). From these Sub-Folders I want to get list of path of those images which endswith 1.
Along with that I want to only get 1 image file format like for example only for jpeg. In some Sub-Folders images name endswith 1.png, 1.jpeg, 1.giff and etc.
I only want one image from every Sub-Folder which endswith 1.(any image format). I am sharing the code which returns image path of items (ending with 1) for all images format.
CODE:
here is the code that can solve your problem.
import os
img_paths = []
for top,dirs, files in os.walk("your_path_goes_here"):
for pics in files:
if os.path.splitext(pics)[0] == '1':
img_paths.append(os.path.join(top,pics))
break
img_paths will have the list that you need
it will have the first image from the subfolder with the name of 1 which can be any format.
Incase if you want with specific formats,
import os
img_paths = []
for top,dirs, files in os.walk("your_path_goes_here"):
for pics in files:
if os.path.splitext(pics)[0] == '1' and os.path.splitext(pics)[1][1:] in ['png','jpg','tif']:
img_paths.append(os.path.join(top,pics))
break
Thanks, to S3DEV for making it more optimized
I have the following problem.
I have folder structure like this:
vol1/
chap1/
01.jpg
02.JPG
03.JPG
chap2/
04.JPG
05.jpg
06.jpg
chap3/
07.JPG
08.jpg
09.JPG
vol2/
chap4/
01.JPG
02.jpg
03.jpg
chap5/
04.jpg
05.JPG
06.jpg
chap6/
07.jpg
08.JPG
09.jpg
Inside a single vol folder, the chapters have an increasing order, and the same happens for the jpg files inside each chap folder.
Now, I would like, for each vol folder to obtain a pdf, maintaining the ordering of the pictures. Think about it as a divided comics or manga volume to be put back into a single file.
How could I do it in bash or python?
I do not know how many volumes I have, or how many chapters are in a single volume, or how many jpg files are in a single chapter. In other words, I need it to work it for whatever number of volumes/chapters/jpgs.
An addition would be considering heterogeneous picture files, maybe having both jpg and png in a single chapter, but that's a plus.
I guess this should work like intended ! Tell me if you encounter issues
import os
from PIL import Image
def merge_into_pdf(paths, name):
list_image = []
# Create list of images from list of path
for i in paths:
list_image.append(Image.open(i).convert("RGB"))
# merge into one pdf
if len(list_image) == 0:
return
# get first element of list and pop it from list
img1 = list_image[0]
list_image.pop(0)
# append all images and save as pdf
img1.save(f"{name}.pdf",save_all=True, append_images=imagelist)
def main():
# List directory
directory = os.listdir(".")
for i in directory:
# if directory start with 'vol' iterate through it
if i.find("vol") != -1:
path = []
sub_dir = os.listdir(i)
# for each subdirectory
for j in sub_dir:
files = os.listdir(f"{i}/{j}")
for k in files:
# if file ends with jpg or png append to list
if k.endswith((".jpg", ".JPG", ".png", ".PNG")):
path.append(f"{i}/{j}/{k}")
# merge list into one pdf
merge_into_pdf(path, i)
if __name__ == "__main__":
main()
I am working with Chest X-Ray14 dataset. The data contains about 112,200 images grouped in 12 folders (i.e. images1 to images12) The image labels are in a csv file called Data_Entry_2017.csv. I want to split the images base on the csv labels (attribute "Finding Labels) into their their various train and test folders.
Can anyone help me with Python or Jupyter-notebook split code? I will be grateful.
df = pd.rread_csv("Data_Entry_2017.csv")
infiltration_df = df[df["Finding Label"]=="Infiltration"]
list_infiltration = infiltration_df .index.values.tolist() # This will be a list of image names
Then you can parse each folder and check if image name is in the list of infiltration labels, you can put that in different folders.
To read all image filenames in a folder, you can use os.listdir
from os import listdir
from os.path import isfile, join
imagefiles = [f for f in listdir(image_folder_name) if isfile(join(image_folder_name, f))]
For train test split you can refer here
I am trying to create a separate array for each pass of the for loop in order to store the values of 'signal' which are generated by the wavefile.read function.
Some background as to how the code works / how Id like it to work:
I have the following file path:
Root directory
Labeled directory
Irrelevant multiple directories
Multiple .wav files stored in these subdirectories
Labeled directory
Irrelevant multiple directories
Multiple .wav files stored in these subdirectories
Now for each Labeled Folder, Id like to create an array that holds the values of all the .wav files contained in its respective sub directories.
This is what I attempted:
for label in df.index:
for path, directories, files in os.walk('voxceleb1/wav_dev_files/' + label):
for file in files:
if file.endswith('.wav'):
count = count + 1
rate,signal = wavfile.read(os.path.join(path, file))
print(count)
Above is a snapshot of dataframe df
Ultimately, the reason for these arrays is that I would like to calculate the mean average length of time of the wav files contained in each labeled subdirectory and add this as a column vector to the dataframe.
Note that the index of the dataframe corresponds to the directory names. I appreciate any and all help!
The code snippet you've posted can be simplified and modernized a bit. Here's what I came up with:
I've got the following directory structure:
I'm using text files instead of wav files in my example, because I don't have any wav files on hand.
In my root, I have A and B (these are supposed to be your "labeled directories"). A has two text files. B has one immediate text file and one subfolder with another text file inside (this is meant to simulate your "irrelevant multiple directories").
The code:
def main():
from pathlib import Path
root_path = Path("./root/")
labeled_directories = [path for path in root_path.iterdir() if path.is_dir()]
txt_path_lists = []
# Generate lists of txt paths
for labeled_directory in labeled_directories:
txt_path_list = list(labeled_directory.glob("**/*.txt"))
txt_path_lists.append(txt_path_list)
# Print the lists of txt paths
for txt_path_list in txt_path_lists:
print(txt_path_list)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The output:
[WindowsPath('root/A/a_one.txt'), WindowsPath('root/A/a_two.txt')]
[WindowsPath('root/B/b_one.txt'), WindowsPath('root/B/asdasdasd/b_two.txt')]
As you can see, we generated two lists of text file paths, one for each labeled directory. The glob pattern I used (**/*.txt) handles multiple nested directories, and recursively finds all text files. All you have to do is change the extension in the glob pattern to have it find .wav files instead.