how to create data frame from multiple files located in different subfolders - python

I have a folder which contain several sub-folders(a,b and c), each sub-folder contains 5 files , each files contain a single column of data and can be treated as an array. I want to create a data frame that contain the mean, standard deviation and median of each file.
The data frame should contain the following columns: subfolder name, file name, mean, std, median
I was able to write the following code by using defaultdict and received the following output shape {'b': ([array1]),([array2]),([array3]),([array4]),([array5]),'c': [array([array1]),([array2]),([array3]),([array4]),([array5]),'a': [array([array1]),([array2]),([array3]),([array4]),([array5])
import os
from collections import defaultdict
root = "/My data"
# Map labels (subdirectories of root) to data
data_per_label = defaultdict(list)
# Get all top-level directories within `root`
label_dirs = [name for name in os.listdir(root) if os.path.isdir(os.path.join(root, name))]
#print(f"{label_dirs}")
# Loop over each label directory
for label in label_dirs:
label_dir = os.path.join(root, label)
# Loop over each filename in the label directory
for filename in os.listdir(label_dir):
# Take care to only look at .data files
if filename.endswith(".data"):
filepath = os.path.join(label_dir, filename)
#print(f"{filename}_{label}")
data = np.loadtxt(filepath)
data_per_label[label].append(data)
print(data_per_label)
I then used the following code to transform the defaultdict into dataframe
df = pd.DataFrame([[k] + j for k,v in data_per_label.items() for j in v], columns=['#', 'Distribution', 'Sample_1', 'Sample_2', 'Sample_3', 'Sample_4', 'Sample_5'])
print(df)
but received an error
UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U1'), dtype('float64')) -> None
Will appreciate any insight one can give me on what I'm doing wrong.

Related

How to extract a specific value from multiple csv of a directory, and append them in a dataframe?

I have a directory with hundreds of csv files that represent the pixels of a thermal camera (288x383), and I want to get the center value of each file (e.g. 144 x 191), and with each one of the those values collected, add them in a dataframe that presents the list with the names of each file.
Follow my code, where I created the dataframe with the lists of several csv files:
import os
import glob
import numpy as np
import pandas as pd
os.chdir("/Programming/Proj1/Code/Image_Data")
!ls
Out:
2021-09-13_13-42-16.csv
2021-09-13_13-42-22.csv
2021-09-13_13-42-29.csv
2021-09-13_13-42-35.csv
2021-09-13_13-42-47.csv
2021-09-13_13-42-53.csv
...
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
files = glob.glob('*.csv')
all_df = pd.DataFrame(all_filenames, columns = ['Full_name '])
all_df.head()
**Full_name**
0 2021-09-13_13-42-16.csv
1 2021-09-13_13-42-22.csv
2 2021-09-13_13-42-29.csv
3 2021-09-13_13-42-35.csv
4 2021-09-13_13-42-47.csv
5 2021-09-13_13-42-53.csv
6 2021-09-13_13-43-00.csv
You can loop through your files one by one, reading them in as a dataframe and taking the center value that you want. Then save this value along with the file name. This list of results can then be read in to a new dataframe ready for you to use.
result = []
for file in files:
# read in the file, you may need to specify some extra parameters
# check the pandas docs for read_csv
df = pd.read_csv(file)
# now select the value you want
# this will vary depending on what your indexes look like (if any)
# and also your column names
value = df.loc[row, col]
# append to the list
result.append((file, value))
# you should now have a list in the format:
# [('2021-09-13_13-42-16.csv', 100), ('2021-09-13_13-42-22.csv', 255), ...
# load the list of tuples as a dataframe for further processing or analysis...
result_df = pd.DataFrame(result)

Python for loop over non-numeric folders in hdf5 file

I want to pull the numbers from a .HDF5 data file, which is in folders with increasing numbers:
Folder_001, Folder_002, Folder_003, ... Folder_100.
In each folder, the data I want to pull has same name: 'Time'. So in order for me to pull the numbers from each folders, I am trying to use for loop over the name of folders to pull numbers in files; yet, still can't figure out how to structure the code. I did the following
f = h5.File('name.h5'.'r')
folders = list(f.keys())
for i in folders:
dataset_folder = f['i']
f = h5.File('name.h5', 'r') # comma
groups = f.keys()
adict = {}
for key in groups:
agroup = f[key]
ds = agroup['Time'] # a dataset
arr = ds[:] # download the dataset to array
adict[key] = arr
Now adict should be a dictionary with keys like 'Folder_001', and values being the respective Time array. You could also collect those arrays in a list.

Manipulating the values of each file in a folder using a dictionary and loop

How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.

How to generate a dataframe from .txt files in multiple directories?

I have a directory ".../data" in which have multiple subdirectories whose names are a serial number plus some useless information - e.g. "17448_2017_Jul_2017_Oct", where the first number on it is the serial number. Inside each subdirectory, I have four ".txt" files whose lines/rows have the information of date and time, and an attribute of a certain type, say humidity, all named the same way in each subdirectory - e.g. "2019-01-29 03:11:26 54.7". The first eight lines on each .txt file top should be dropped as well.
What I am trying to program: A code that generates a data frame for each serial number with the subdirectory serial number in the subdirectory name in a column called 'Machine', date/time as the data frame index and each type of an attribute as a column such as atr1, atr2, atr3, and atr4.
My first trial was something as:
path = "/home/marlon/Shift One/Projeto Philips/Consolidação de Arquivos/dados"
for i in os.listdir(path):
if os.path.isfile(os.path.join(path,i)) and '17884' in i:
with open(path + i, 'r') as f:
But, as you can see, I'm completely lost... :/
Thank you so much for your help!
IIUC, you could try sth like this (note this is intended to be a start for test and feedback, because I can't test this on my mobile at the moment):
import os
import pandas as pd
path = "/home/marlon/Shift One/Projeto Philips/Consolidação de Arquivos/dados/"
df = pd.DataFrame()
for fld in os.listdir(path):
subfld = path + fld
if os.path.isdir(subfld):
aux = pd.DataFrame()
sn = fld.split('_')[0]
for file in os.listdir(subfld):
filepath = os.path.join(subfld, file)
if os.path.isfile(filepath):
new_col = pd.read_fwf(filepath, colspecs=[(0, 19), (20, -1)], skiprows=8, header=None, parse_dates=[0], index_col=0)
aux = pd.concat([aux, new_col], axis=1)
aux['Machine'] = sn
df = df.append(aux)
However, I wonder if your 4 measurement files per folder all have the same index time values, otherwise there will be a problem concatenating them.

Find folders within a directory using a list, and copy them to a different directory

I need some help with creating a python script for this problem.
Basically I have an excel sheet with a list of patient medical record numbers:
10000
10001
10002
10003
etc...
And I have a drive with this basic format:
-AllImages
--A
---A1
---A2
----10004
---A3
----10005
----10006
----10007
--B
---B1
----10008
----10009
-----10009_MRI
-----10009_CT
---B2...
And the desired output would be:
-OutputImages
--10000
--10001
--10002
---10002_MRI
---10002_CT
--10003
etc...
They are not always in exact order though. So, these terminal patient folders are what I need to copy to a different directory, but they also contain other folders that also contain the medical record number in the file name as illustrated in patient 10009. I do NOT want to pull those subfolders out separately from the main patient folder, so when I search I want to stop at the highest folder with the patient medical record in the name.
I wrote a script that FINDS the folders and outputs a csv next to each medical record number saying where the image can be found or if it couldn't be found at all. However, I cannot figure out how to get it to copy those folders to a new location. This seems like a super simple operation but I can't figure it out!
Here is the current script I am running, I tried to just modify the other script I wrote with some code I found on this site, but its not working and I don't understand it well enough to know why.
import os
import shutil
import xlrd
import easygui
import numpy as np
import csv
#get the excel sheet
print ('Choose patient data sheet')
master_ws = 'TestDemo/TestPatientList.xlsx'
#easygui.fileopenbox()
workbook = xlrd.open_workbook(master_ws)
ws = workbook.sheet_by_name('Sheet1')
num_rows = ws.nrows - 1
#get correct MRN column
col = int(input ('Enter the column with patient MRNs (A=0, B=1, etc): '))
#file browser for choosing which directory to index
print ('Choose directory for indexing')
RootDir1 = r'TestDemo/TestDirectory'
#easygui.diropenbox()
#choose output folder
print ('Create output folder')
TargetFolder = r'Scripts/TestDemo/TestOutputDirectory'
#easygui.diropenbox()
#sorts directory titles into array of strings
folders = [f for f in sorted(os.listdir(RootDir1))]
folders = np.asarray(folders, dtype=str)
#gets worksheet row values and puts into an array of strings
arr = [ws.row(0)]
for i in range(1,num_rows+1):
row = np.asarray(ws.row_values(i))
arr = np.append(arr, [row], axis = 0)
#matching between folders and arr, ie. between directory index and master sheet
for y in range(1, len(arr)):
for root, dirs, files in os.walk((os.path.normpath(RootDir1)), topdown=False):
for name in dirs:
if name.find(str(int(float(str(arr[y, col]))))):
print ("Found" + name)
SourceFolder = os.path.join(root,name)
shutil.copy(SourceFolder, TargetFolder) #copies to new folder

Categories

Resources