"Combine" raw image data set and csv contain labels

"Combine" raw image data set and csv contain labels - python

Right now I have a image dataset which is very coarse, but the "content" and "label" is sperated in different file/folers(image dataset have files like 00001, while another csv file have patterns like 00001,class a etc), if I want to use the keras image loading function, the dataset should have the structure like below, then I can split that dataset into 'X' and 'y'. I tried to "combine" content and label. I found a function based on shutil module, which can conditionally move files to different folders. but for some compatibility issue I cannot install shutil module(I tried update python). Can you guys guide me some directions? Thanks!
training_data/ ...class_a/ ......a_image_1.jpg ......a_image_2.jpg ...class_b/ ......b_image_1.jpg ......b_image_2.jpg
I found a function based on shutil module, which can conditionally move files to different folders. but for some compatibility issue I cannot install shutil module(I tried update python).
def match_label(source,dest):
files = os.listdir(source)
for file in files:
num = int(file.split('.')[0])
if num in label:
shutil.move(os.path.join(source,file),os.path.join(dest,file))
match_label(train_dir,homogeneous_dir)

Related

How to import a whole folder of CSVs in python (pandas) from UCI ML Repo?

this is the link from which I want the csv files:http://archive.ics.uci.edu/ml/datasets/selfBACK
My approach right now is to download it locally, by simply clicking it. But, this folder has a lot of different folders with many CSVs in it. How I do i import it in an efficient manner?
I know how to do it one by one but I feel there has to be a more efficient way.

You can first read all paths in that folder, and filter for csv files (or add other filters e.g. for specific file names). After that combine the files, here i use pandas if the data is tabular and structured in the same way.
import os
import pandas as pd
path = 'your_folder_path'
dfs = [pd.read_csv(f) for f in os.listdir(path) if f.endswith('.csv')]
# combine them (if they have the same format) like this:
df = pd.concat(dfs)
Note: you could also make a dictionary instead (key=filename, value=dataframe) and then access the data by using the filename.

Python files pdf rename

I have a file .pdf in a folder and I have a .xls with two-column. In the first column I have the filename without extension .pdf and in the second column, I have a value.
I need to open file .xls, match the value in the first column with all filenames in the folder and rename each file .pdf with the value in the second column.
Is it possible?
Thank you for your support
Angelo

You'll want to use the pandas library within python. It has a function called pandas.read_excel that is very useful for reading excel files. This will return a dataframe, which will allow you to use iloc or other methods of accessing the values in the first and second columns. From there, I'd recommend using os.rename(old_name, new_name), where old_name and new_name are the paths to where your .pdf files are kept. A full example of the renaming part looks like this:
import os
# Absolute path of a file
old_name = r"E:\demos\files\reports\details.txt"
new_name = r"E:\demos\files\reports\new_details.txt"
# Renaming the file
os.rename(old_name, new_name)
I've purposely left out a full explanation because you simply asked if it is possible to achieve your task, so hopefully this points you in the right direction! I'd recommend asking questions with specific reproducible code in the future, in accordance with stackoverflow guidelines.

I would encourage you to do this with a .csv file instead of a xls, as is a much easier format (requires 0 formatting of borders, colors, etc.).
You can use the os.listdir() function to list all files and folders in a certain directory. Check os built-in library docs for that. Then grab the string name of each file, remove the .pdf, and read your .csv file with the names and values, and the rename the file.
All the utilities needed are built-in python. Most are the os lib, other are just from csv lib and normal opening of files:
with open(filename) as f:
#anything you have to do with the file here
#you may need to specify what permits are you opening the file with in the open function

How to run my Python code for every Excel file contained in a folder?

I have a folder named with a certain acronym, and inside this folder you can find a certain number of Excel files.
The folder's name indicates the name of the apartment (for ex. UDC06_45) and, inside this folder, all of the Excel files' name are composed by:
the name of the apartment, followed by the name of the appliance that is located in that apartment (for ex. UDC06_45_Oven).
These Excel files are very simple DataFrames, they contain energy consumption measurements: one column named "timestamps" and one column named "Energy" (all of these measurements have a 15 min frequency). All of the Excel files inside the folder are made with the same identical structure.
My Python code takes as input only one of these Excel files at a time and makes few operations on them (resampling, time interpolation, etc.) starting with the command "pd.read_excel()", and creates an output Excel file with "df.to_excel()" after giving it a name.
What I want to do is to apply my code automatically to all of the files in that folder.
The code should take as input only the name of the folder ("UDC06_45") and create as many output files as needed.
So if the folder contains only two appliances:
"UDC06_45_Oven"
"UDC06_45_Fridge"
the code will elaborate them both, one after the other, and I should obtain two dinstinct Excel files as output. Their name is just composed by the input file's name followed by "_output":
"UDC06_45_Oven_output"
"UDC06_45_Fridge_output".
In general, this must be done for every Excel file contained in that folder. If the folder contains 5 appliances, meaning 5 input Excel files, I should obtain 5 output Excel files... and so on.
How can I do it?

In the following code only assing your path, in my case I have used a test folder path path=r'D:\test' this code will create a new folder automatically in the same path.
import pandas as pd
import os
from glob import glob
path=r'D:\test' # add whatever your path is in place of 'D:\test'
input_folder='UDC06_45' # name of input folder
output_folder=input_folder+'_out'
new_path=path+'/'+output_folder
if not os.path.exists(new_path):
os.makedirs(new_path)
files=glob(path+'/'+input_folder+'/'+'*.xlsx')
for file in files:
name=file.split(path+'/'+input_folder+'\\')[-1].rsplit('.')[0]
df=pd.read_excel(file)
#do all your operations here
df.to_excel(new_path+'/'+name+'_output.xlsx')

Needs python or jupyter notebook helper hand image splitter

I am working with Chest X-Ray14 dataset. The data contains about 112,200 images grouped in 12 folders (i.e. images1 to images12) The image labels are in a csv file called Data_Entry_2017.csv. I want to split the images base on the csv labels (attribute "Finding Labels) into their their various train and test folders.
Can anyone help me with Python or Jupyter-notebook split code? I will be grateful.

df = pd.rread_csv("Data_Entry_2017.csv")
infiltration_df = df[df["Finding Label"]=="Infiltration"]
list_infiltration = infiltration_df .index.values.tolist() # This will be a list of image names
Then you can parse each folder and check if image name is in the list of infiltration labels, you can put that in different folders.
To read all image filenames in a folder, you can use os.listdir
from os import listdir
from os.path import isfile, join
imagefiles = [f for f in listdir(image_folder_name) if isfile(join(image_folder_name, f))]
For train test split you can refer here

Python Pillow Library opening and editing images ending with specific names

Currently I am using Python Pillow Library to edit images. Since I am dealing with large data-sets and need to edit some images with only specific name endings (say only image names that end with cropped or images of specific file type like png or bmp), is there a way to write code in such a way that allows me to open and edit these images? If so please give me hints or suggestions! Thanks!
Also Pillow version is 5.0.0 and Python version is 3.6.

If your question is to only know if is there a way to write code to only allow you to edit image files with specific file type and specific end names. Then the answer is YES. You can do it with python.
A Sample Code:
import os
from PIL import Image #Pillow
directory = os.fsencode("images_folder")
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".png") or filename.endswith(".bmp") or "cropped" in filename:
# Do the editing using pillow
# img = Image.open(filename)
continue

Certainly you can do this in python, but the specific way of doing this obviously depends on the specifics of the problem. Are all your images stored in one directory or many? Will you be running the script from the same directory as the images or from some other directory? Etc.
To get you started, take a look at the os module here.
In this module, there is a listdir method that returns a list of all files inside a directory. You can iterate through that list and find all the filenames that ends with a specific set of characters by using the built in endswith method on strings. For example:
import os
fileslist = [f for f in os.listdir(path) if f.endswith('.jpg')]
Now that you have a filelist of all the files in a directory that ends with some certain characters, you can then use pillow to open the images from that list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.