Reading multiple hdf5 files from a folder

Reading multiple hdf5 files from a folder - python

I currently have a code that imports an hdf5 file, and then computes a function for an area under the curve.
import h5py
file = h5py.File('/Users/hansari/Desktop/blabla', 'r')
xdata = file.get('data')
xdata= np.array(xdata)
xdata_df = pd.DataFrame(xdata)
table = pd.DataFrame(xdata_df).reset_index()
This is the code I use to fetch the file.
I currently have a folder than has 25 hdf5 files. Is there a way to have it so that I can have the code run all 25 files and spit out the result of the function for all?
I'm hoping to have it import the file, run through the whole script, and then repeat it with the next hdf5 file, instead of importing all the data first and then running through the code with a mass amt of data.
I'm currently using glob.glob, but it's importing all of the files at one go and giving me a huge dataset that is hard to work with.

Without more code, I can't tell you what you are doing wrong. To demonstrate the process, I created a simple example that reads multiple HDF5 files and loads into a Pandas dataframe using glob.iglob() and h5py. See the code below. The table dataframe is created inside the 2nd loop and only contains data from 1 HDF5 file. You should add your function to compute the area under the curve inside the for file in glob.iglob() loop.
# First, create 3 simple H5 files
for fcnt in range(1,4,1):
fname = f'file_{fcnt}.h5'
with h5py.File(fname,'w') as h5fw:
arr = np.random.random(10*10).reshape(10,10)
h5fw.create_dataset('data',data=arr)
# Loop over H5 files and load into a dataframe
for file in glob.iglob('file*.h5'):
with h5py.File(file, 'r') as h5fr:
xdata = h5fr['data'][()]
table = pd.DataFrame(xdata).reset_index()
print(table)
# add code to compute area under the curve here

Related

Is there an easier way to filter out rows from multiple CSV files and paste them into a new csv file? IM having issues doing that using a for loop

#Purpose: to read CSV files from every every csv files in the directory. Filter the rows with the column that say 'fail" from the csv file. Then copy and paste those rows onto a new CSV file.
# import necessary libraries
from sqlite3 import Row
import pandas as pd
import os
import glob
import csv
# Writing to a CSV
# using Python
import csv
# the path to your csv file directory.
mycsvdir = 'C:\Users\'' #this is where all the csv files will be housed.
# use glob to get all the csv files
# get all the csv files in that directory (assuming they have the extension .csv)
csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))
# loop through the files and read them in with pandas
dataframes = [] # a list to hold all the individual pandas DataFrames
for csvfile in csv_files:
# read the csv file
df = pd.read_csv(csvfile)
dataframes.append(df)
#print(row['roi_id'], row['result']) #roi_id is the column label for the first cell on the csv, result is the Jth column label
dataframes = dataframes[dataframes['result'].str.contains('fail')]
# print out to a new csv file
dataframes.to_csv('ROI_Fail.csv') #rewrite this to mirror the variable you want to save the failed rows in.
I tried running this script but im getting a couple of errors. First off, i know my indentation is off(newbie over here), and im getting a big error under my for loop saying that "csv_files" is not defined. Any help would be greatly appreciated.

There are two issues here:
The first one is kind of easy - The variable in the for loop should be csvfiles, not csv_files.
The second one (Which will show up when you fix the one above) is that you are treating a list of dataframes as a dataframe.
The object "dataframes" in your script is a list to which you are appending the dataframes created from the CSV files. As such, you cannot index them by the column name as you are trying to do.
If your dataframes have the same layout I'd recommend using df.concat to join all dataframes into a single one, and then filtering the rows as you did here.
full_dataframe = pd.concat(dataframes, axis=0)
full_dataframe = full_dataframe[full_dataframe['result'].str.contains('fail')]
As a tip for further posts I'd recommend you also post the full traceback from your program. It helps us understand exactly what error you had when executing your code.

How load huge data json in jupyter on VSCode

I'm doing a sentiment analysis for my master's degree and i'm working with jupyter nootebook on VSCode on Ubuntu 20.04. I have a problem: when I try to load my file (12gb) my kernel dieds. So I splitted my file into 6 of 2 gb each, but also in this case I can't load all file to create a dataframe in order to work with it. So i would to ask how can I load each file, create a database and then storage all together into one dataframe to work with it?
I tried to load one file in this way:
import pandas as pd
filename = pd.read_json("xaa.json", lines=True, chunksize= 200000)
and in this case the kernel didn't die. From this point, how could I save this filename into a dataframe? I know that in this way I splitted one file into many files of 200000 lines, but I don't know how storage all this chunks into a first dataframe.
Thank you for the attention and I'm sorry for the banal question.

I want to post my solution: first of all I chose to make my IDE read all data in this way:
import glob
import json
files = list(glob.iglob('Tesi/Resources/Twitter/*.json'))
tweets_data = []
for file in files:
tweets_file = open(file, "r", encoding='utf-8')
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)
tweets_file.close()
Then I defined a function to flat all tweets in order to load all in one dataframe.

How to create an array of several .txt files

I am trying to create some plots from some data in my research lab. Each data file is saved into tab-delimited text files.
My goal is to write a script that can read the text file, add the columns within each file to an array, and then eventually slice through the array at different points to create my plots.
My problem is that I'm struggling with just starting the script. Rather than hardcode each txt file to be added to the same array, is there a way to loop over each file in my directory to add the necessary files to the array, then slice through them?
I apologize if my question is not clear; I am new to Python and it is quite a steep learning curve for me. I can try to clear up any confusion if what I am asking doesn't make sense.
I am also using Canopy to write my script if this matters.

You could do something like:
from csv import DictReader # CSV reader can be used as TSV reader
from glob import iglob
readers = []
for path in iglob("*.txt"):
reader = DictReader(open(path), delimiter='\t')
readers.append(reader)
glob.iglob("*.txt") returns an iterator over all files with extension .txt in the current working directory.
csv.DictReader reads a CSV file as a iterator of dicts. A tab-delimited text file the same thing with a different delimiter.

How to create a hierarchical csv file?

I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:

Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.

Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.

Pyspark: write df to file with specific name, plot df

I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.

I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading multiple hdf5 files from a folder - python

Related

Is there an easier way to filter out rows from multiple CSV files and paste them into a new csv file? IM having issues doing that using a for loop

How load huge data json in jupyter on VSCode

How to create an array of several .txt files

How to create a hierarchical csv file?

Pyspark: write df to file with specific name, plot df

Categories

Resources