Python for loop over non-numeric folders in hdf5 file - python

I want to pull the numbers from a .HDF5 data file, which is in folders with increasing numbers:
Folder_001, Folder_002, Folder_003, ... Folder_100.
In each folder, the data I want to pull has same name: 'Time'. So in order for me to pull the numbers from each folders, I am trying to use for loop over the name of folders to pull numbers in files; yet, still can't figure out how to structure the code. I did the following
f = h5.File('name.h5'.'r')
folders = list(f.keys())
for i in folders:
dataset_folder = f['i']

f = h5.File('name.h5', 'r') # comma
groups = f.keys()
adict = {}
for key in groups:
agroup = f[key]
ds = agroup['Time'] # a dataset
arr = ds[:] # download the dataset to array
adict[key] = arr
Now adict should be a dictionary with keys like 'Folder_001', and values being the respective Time array. You could also collect those arrays in a list.

Related

How to import multiple excel files and manipulate them individually

I have to analyze 13 different Excel files and I want to read them al in Jupyter at once, instead of reading them al individually. Also I want to be able to acces the contents individually. So far I have this:
path = r"C:\Users\giova\PycharmProjects\DAEB_prijzen\data"
filenames = glob.glob(path + "\*.xlsx")
df_list = []
for file in filenames:
df = pd.read_excel(file, usecols=['Onderhoudsordernr.', 'Oorspronkelijk aantal', 'Bedrag (LV)'])
print(file)
print(df)
df_list.append(df)
When I'm running the code it seems to be like 1 big list, with some data missing, which I dont want. Can anyone help? :(
It seems a problem that can be solved with a for loop and a dictionary.
Read the path location of your files:
path = 'C:/your path'
paths = os.listdir(path)
Initialize an empty dictionary:
my_files = {}
for i, p in enumerate(paths):
my_files[i] = pd.read_excel(p)
Then you can acces to your files individually simply calling the key in the dictionary:
my_files[i]
Where i = 1, 2 ..., 13
Alternatively, if you want to assign a name to each file, you can either create a list of name or derive it from the filepath through some slice/regex function on the strings.
Assuming the first case:
names = ['excel1', ...]
for name, p in zip(names, paths):
my_files[name] = pd.read_excel(p)

How to automatically merge multiple CSVs from multiple sub-folders, without specifying sub-folder name?

I have a main folder of experiment with 2 condition folders in it. Condition 1 and Condition 2 folders both have 3 repetition folders in them and within these folders there is the final one of contact events containing a CSV file. How can I automatically merge all 3 CSV files from the repetition folders per condition, without specifying the path and assuming condition number? So far my code loops through all folders and combines all CSV files with the name contact_events.csv. The code is as follows:
path = "C:\\Users\\victo\\OneDrive\\Bureaublad\\data\\"
contact_csv = []
for filename in os.listdir(path):
f = os.path.join(path,filename)
if os.path.isdir(f):
for conditions in os.listdir(f):
k = os.path.join(f,conditions)
if os.path.isdir(k):
for repetitions in os.listdir(k):
r = os.path.join(k, repetitions)
if os.path.isdir(r):
for contacts in os.listdir(r):
c = os.path.join(r, contacts)
if os.path.isdir(c):
if os.path.exists(os.path.join(c, 'contact_events.csv')):
df_results = pd.read_csv(os.path.join(c, 'contact_events.csv'))
contact_csv.append(df_results)
combined_csvs_for_conditions = pd.concat(contact_csv)
combined_csvs_for_conditions
You already have the code to loop through the conditions and concatenate the files. To get a separated merge of the data files for each condition, you only need to move the assignation contact_csv = [] inside the loop of that condition.
Now, the real problem is how to store the result of each of this iterations, specially if we don't assume the number of conditions. We could use a list, but we would loose the identity of each dataframe (conditions are not necessarily goin to be read in alphabetical order).
So, we can use a dictionary and store each merge with the condition name as key.
conditons_data = {}
for filename in os.listdir(path):
f = os.path.join(path,filename)
if os.path.isdir(f):
for conditions in os.listdir(f):
k = os.path.join(f,conditions)
if os.path.isdir(k):
# create a new container for the files of each condition
contact_csv = []
for repetitions in os.listdir(k):
r = os.path.join(k, repetitions)
if os.path.isdir(r):
for contacts in os.listdir(r):
c = os.path.join(r, contacts)
if os.path.isdir(c):
if os.path.exists(os.path.join(c, 'contact_events.csv')):
df_results = pd.read_csv(os.path.join(c, 'contact_events.csv'))
contact_csv.append(df_results)
combined_csvs_for_conditions = pd.concat(contact_csv)
# create an entry called as the current condition to stored the combined csvs
conditons_data[conditions] = combined_csvs_for_conditions

how to create data frame from multiple files located in different subfolders

I have a folder which contain several sub-folders(a,b and c), each sub-folder contains 5 files , each files contain a single column of data and can be treated as an array. I want to create a data frame that contain the mean, standard deviation and median of each file.
The data frame should contain the following columns: subfolder name, file name, mean, std, median
I was able to write the following code by using defaultdict and received the following output shape {'b': ([array1]),([array2]),([array3]),([array4]),([array5]),'c': [array([array1]),([array2]),([array3]),([array4]),([array5]),'a': [array([array1]),([array2]),([array3]),([array4]),([array5])
import os
from collections import defaultdict
root = "/My data"
# Map labels (subdirectories of root) to data
data_per_label = defaultdict(list)
# Get all top-level directories within `root`
label_dirs = [name for name in os.listdir(root) if os.path.isdir(os.path.join(root, name))]
#print(f"{label_dirs}")
# Loop over each label directory
for label in label_dirs:
label_dir = os.path.join(root, label)
# Loop over each filename in the label directory
for filename in os.listdir(label_dir):
# Take care to only look at .data files
if filename.endswith(".data"):
filepath = os.path.join(label_dir, filename)
#print(f"{filename}_{label}")
data = np.loadtxt(filepath)
data_per_label[label].append(data)
print(data_per_label)
I then used the following code to transform the defaultdict into dataframe
df = pd.DataFrame([[k] + j for k,v in data_per_label.items() for j in v], columns=['#', 'Distribution', 'Sample_1', 'Sample_2', 'Sample_3', 'Sample_4', 'Sample_5'])
print(df)
but received an error
UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U1'), dtype('float64')) -> None
Will appreciate any insight one can give me on what I'm doing wrong.

how to make a pandas take values from multiple lists within a list

I have a folder with multiple html files.I want the code to go through each and every file and pick the subject verb object triplets using nlp. I then want pandas to list all of them under the headings of subject verb object for all the files together in one data frame. The problem I face is panda lists only the the subject verb object from the last file and not the first two. When I print the sub_verb_obj in loop it shows 3 lists within a list. But pandas does not pick the 3 lists triplets. Can someone tell me what mistake am I doing?
sub_verb_obj=[]
folder_path = 'C:/Users/user3/.ipynb_checkpoints/xyz/xyz_2018'
for filename in glob.glob(os.path.join(folder_path, '*.html')):
with open(filename, 'r',encoding='utf-8') as f:
pat = f.read()
doc=nlp(text)
text_ext = textacy.extract.subject_verb_object_triples(doc)
sub_verb_obj=list(text_ext)
sao=pd.DataFrame(sub_verb_obj)
sao.columns=['subject','verb','object']
sao=sao.set_index('subject')
print(sao)```
how can I make sure the pandas lists all the subject verb object from all the files in a folder in a single dataframe?
Because your data looks to be a list of tuples each iteration, and that worked for a single run, I'd suggest building a dataframe each loop, storing it in a list, then concatenating the list of dataframes
df_hold_list=[]
folder_path = 'C:/Users/user3/.ipynb_checkpoints/xyz/xyz_2018'
for filename in glob.glob(os.path.join(folder_path, '*.html')):
with open(filename, 'r',encoding='utf-8') as f:
pat = f.read()
soup = BeautifulSoup(pat, 'html.parser')
claim_section = soup.find_all('section', attrs={"itemprop":"claims"})
str_sect = claim_section[0]
claim_text=str_sect.get_text()
#print(str(type(claim_section)))
clean_lower=claim_text.lower()
text=clean_lower
doc=nlp(text)
text_ext = textacy.extract.subject_verb_object_triples(doc)
sub_verb_obj=list(text_ext)
df_hold_list.append(pd.DataFrame(sub_verb_obj)) # add each new dataframe here
sao=pd.concat(df_hold_list, axis=0) # this should concat all dfs on top of one another using axis=0
sao.columns=['subject','verb','object'] # change your columns on teh final df
sao=sao.set_index('subject')
print(sao)

How to read multiple files and find the difference between the files?

I have multiple CSV files, I want to compare them. The file contents are the same except for some additional changes, and I want to list those additional changes.
For eg:
files =[1.csv, 2.csv,3.csv]
I want to compare 1.csv and 2.csv, get the difference and store somewhere, then compare 2.csv and 3.csv, store the diff somewhere.
for dirs in glob.glob(INPUT_PATH+"*"):
if (os.path.isdir(dirs)):
for files in glob.glob(dirs+'*/'+'/*.csv'):
## list all the csv files but how to read them to get difference.
you can use pandas to read csv as dataframe in a list then compare them from that list :
import pandas as pd
dfList = []
dfList.append(pd.read_csv('FilePath'))
dfList[0] contains the content of first csv file and so on
So, for comparing between first and 2nd csv you have to compare between dfList[0] and dfList[1]
The first fonction compare 2 files and the second fonction create a additional file with the difference between the 2 files.
import os
def compare(file_compared,file_master):
"""
A = [100,200,300]
B = [400,500,100]
compare(A,B) = [200,300]
"""
file_compared_list = []
file_master_list = []
with open(file_compared,'r') as fc:
for line in fc:
file_compared_list.append(line.strip())
with open(file_master,'r') as fm:
for line in fm:
file_master_list.append(line.strip())
return list(set(file_compared_list) - set(file_master_list))
def create_file(filename):
diff = compare("file1.csv","file2.csv")
with open(filename,'w') as f:
for element in diff:
f.write(element)
create_file("test.csv")

Categories

Resources