How to access all the data saved in a h5py file - python

I save a number of numpy arrays into a h5py file with different names corresponding to different datasets. Assuming I don't know those dataset names, how to access the saved data after reading the h5py file. For example:
f = h5py.file('filename','w')
f.create_dataset('file1',data=data1)
....
F = h5py.file('filename','r')
#next how to read out all the datasets without knowing their names in a prior

filenames = list(F.keys()) #which contains all the dataset names
data1 = F[filenames[0]].value.astype('float32')
...
See also the post How to know HDF5 dataset name in python

Related

Reading multiple hdf5 files from a folder

I currently have a code that imports an hdf5 file, and then computes a function for an area under the curve.
import h5py
file = h5py.File('/Users/hansari/Desktop/blabla', 'r')
xdata = file.get('data')
xdata= np.array(xdata)
xdata_df = pd.DataFrame(xdata)
table = pd.DataFrame(xdata_df).reset_index()
This is the code I use to fetch the file.
I currently have a folder than has 25 hdf5 files. Is there a way to have it so that I can have the code run all 25 files and spit out the result of the function for all?
I'm hoping to have it import the file, run through the whole script, and then repeat it with the next hdf5 file, instead of importing all the data first and then running through the code with a mass amt of data.
I'm currently using glob.glob, but it's importing all of the files at one go and giving me a huge dataset that is hard to work with.
Without more code, I can't tell you what you are doing wrong. To demonstrate the process, I created a simple example that reads multiple HDF5 files and loads into a Pandas dataframe using glob.iglob() and h5py. See the code below. The table dataframe is created inside the 2nd loop and only contains data from 1 HDF5 file. You should add your function to compute the area under the curve inside the for file in glob.iglob() loop.
# First, create 3 simple H5 files
for fcnt in range(1,4,1):
fname = f'file_{fcnt}.h5'
with h5py.File(fname,'w') as h5fw:
arr = np.random.random(10*10).reshape(10,10)
h5fw.create_dataset('data',data=arr)
# Loop over H5 files and load into a dataframe
for file in glob.iglob('file*.h5'):
with h5py.File(file, 'r') as h5fr:
xdata = h5fr['data'][()]
table = pd.DataFrame(xdata).reset_index()
print(table)
# add code to compute area under the curve here

Read a partitioned parquet dataset from multiple files with PyArrow and add a partition key based on the filename

I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.
I can read them all and subsequently convert to a pandas dataframe:
files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
files,
metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()
This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.
As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.
Is this feasible?
The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).
There is no way today (in pyarrow) to get the filename in the returned results.

Python - re-write a netcdf file after calculation

I have a netcdf4 file called test.nc
I am calculating monthly median (from the daily values) with the following code:
import xarray as xr
os.chdir(inbasedir)
data = xr.open_dataset('test.nc')
monthly_data = data.resample(freq='m', dim ='time', how = 'median')
My question is how can I write this output to a new netcdf file, without having to re-write all the variables and the metadata already included in the input netcdf file.
Not sure if it is what you want. But this creates a new netcdf file from the created Dataset:
monthly_data.to_netcdf('newfile.nc')
You might use .drop() on the Dataset to remove data which you don't want in the output.

How to import csv or arff to scikit?

I have two dataset in csv and arff format which I have been using in classification models in weka. I was wondering if this formats can be used in scikit to try others classification methods in python.
This is how my dataset looks like:
ASSAY_CHEMBLID...MDEN.23...MA,TARGET_TYPE...No...MA,TARGET_TYPE...apol...MA,TARGET_TYPE...ATSm5...MA,TARGET_TYPE...SCH.6...MA,TARGET_TYPE...SPC.6...MA,TARGET_TYPE...SP.3...MA,TARGET_TYPE...MDEN.12...MA,TARGET_TYPE...MDEN.22...MA,TARGET_TYPE...MLogP...MA,TARGET_TYPE...R...MA,TARGET_TYPE...G...MA,TARGET_TYPE...I...MA,ORGANISM...No...MA,ORGANISM...C2SP1...MA,ORGANISM...VC.6...MA,ORGANISM...ECCEN...MA,ORGANISM...khs.aasC...MA,ORGANISM...MDEC.12...MA,ORGANISM...MDEC.13...MA,ORGANISM...MDEC.23...MA,ORGANISM...MDEC.33...MA,ORGANISM...MDEO.11...MA,ORGANISM...MDEN.22...MA,ORGANISM...topoShape...MA,ORGANISM...WPATH...MA,ORGANISM...P...MA,Lij
0.202796,0.426972,0.117596,0.143818,0.072542,0.158172,0.136301,0.007245,0.016986,0.488281,0.300438,0.541931,0.644161,0.048149,0.02002,0,0.503415,0.153457,0.288099,0.186024,0.216833,0.184642,0,0.011592,0.00089,0,0.209406,0
where Lij is my class identificator (0 or 1). I was wondering if a previous transformation with numpy is needed.
To read ARFF files, you'll need to install liac-arff. see the link for details.
once you have that installed, then use the following code to read the ARFF file
import arff
import numpy as np
# read arff data
with open("file.arff") as f:
# load reads the arff db as a dictionary with
# the data as a list of lists at key "data"
dataDictionary = arff.load(f)
f.close()
# extract data and convert to numpy array
arffData = np.array(dataDictionary['data'])
There are several ways in which csv data can be read, I found that the easiest is using the function read_csv from the Python's module Pandas. See the link for details regarding installation.
The code for reading a csv data file is below
# read csv data
import pandas as pd
csvData = pd.read_csv("filename.csv",sep=',').values
In either cases, you'll have a numpy array with your data. since the last column represents the (classes/target /ground truth/labels). you'll need to separate the data to a features array X and target vector y. e.g.
X = arffData[:, :-1]
y = arffData[:, -1]
where X contains all the data in arffData except for the last column and y contains the last column in arffData
Now you can use any supervised learning binary classifier from scikit-learn.

Pyspark: write df to file with specific name, plot df

I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)

Categories

Resources