Appending to h5py groups - python

I have files with the following structure:
time 1
index 1
value x
value y
time 1
index 2
value x
value y
time 2
index 1
...
I wish to convert the file to the hdf5 format using h5py, and sort the values from each index into separate groups.
My approach is
f = h5py.File(filename1,'a')
trajfile = open(filename2, 'rb')
for i in range(length_of_filw):
time = struct.unpack('>d', filename2.read(8))[0]
index = struct.unpack('>i', filename2.read(4))[0]
x = struct.unpack('>d', filename2.read(8))[0]
y = struct.unpack('>d', filename2.read(8))[0]
f.create_dataset('/'+str(index), data=[time,x,y,z])
But in this way I am not able to append to the groups (I am only able to write to each group once...). The error message is "RuntimeError: Unable to create link (name already exists)".
Is there a way to append to the groups?

You can write to a dataset as many times as you want - you just can't have twice a dataset with the same name. This is the error you're getting. Note that you are creating a dataset and at the same time you are putting some data inside of it. In order to write other data to it, it has to be large enough to accomodate it.
Anyway, I believe you are confusing groups and datasets.
Groups are created with e.g.
grp = f.create_group('bar') # this create the group '/bar'
and you want to store datasets in a dataset, created like you said with:
dst = f.create_dataset('foo',shape=(100,)) # this create the dataset 'foo', with enough space for 100 elements.
you only need to create groups and datasets once - but you can refer to them through their handles, (grp and dst), in order to write in them.
I suggest you first go through your file once, create your desired groups and datasets using the 'shape' parameter to properly size it, and then populate the datasets with actual data.

Related

Fast/efficient way to extract data from multiple large NetCDF files

I need to extract data from a global grid only for a specific set of nodes, given by lat/lon coordinates (in the order of 5000-10000). The data are time-series of hydraulic parameters, for example wave height.
The global data set is huge so it is divided into many NetCDF files. Each NetCDF file is around 5GB and contains data for the entire global grid, but only for one variable (e.g. wave height) and one year (e.g. 2020). Say I want to extract the full time series (42 years) of 6 variables at a certain location, I need to extract data form 6x42 = 252 NC files, each 5GB in size.
My current approach is a triple loop through years, variables, and nodes. I use Xarray to open each NC file, extract the data for all the required nodes and store it in a dictionary. Once I've extracted all the data in the dictionary I create one pd.dataframe for each location, which I store as a pickle file. With 6 variables and 42 years, this results in a pickle file of around 7-9 MB for each location (so not very large actually).
My approach works perfectly fine if I have a small amount of locations, but as soon as it grows to a few hundred, this approach takes extremely long. My gut feeling is that it is a memory problem (since all the extracted data is first stored in a single dictionary, until every year and variable are extracted). But one of my colleagues said that Xarray is actually quite inefficient and that this might lead to the long duration.
Does anyone here have experience with similar issues or know of an efficient way to extract data from a multitude of NC files? I put the code I currently use below. Thanks for any help!
# set conditions
vars = {...dictionary which contains variables}
years = np.arange(y0, y1 + 1) # year range
ndata = {} # dictionary which will contain all data
# loop through all the desired variables
for v in vars.keys():
ndata[v] = {}
# For each variable, loop through each year, open the nc file and extract the data
for y in years:
# Open file with xarray
fname = 'xxx.nc'
data = xr.open_dataset(fname)
# loop through the locations and load the data for each node as temp
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
lon = nodes.lon.iloc[n]
lat = nodes.lat.iloc[n]
temp = data.sel(longitude=lon, latitude=lat)
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
# merge the variables for the current location into one dataset
for n in range(len(nodes)):
node = nodes.node_id.iloc[n]
dset = xr.merge(ndata[v][node] for v in variables.keys())
df = dset.to_dataframe()
# save dataframe as pickle file, named by the node id
df.to_pickle('%s.xz'%(node)))
This is a pretty common workflow so I'll give a few pointers. A few suggested changes, with the most important ones first
Use xarray's advanced indexing to select all points at once
It looks like you're using a pandas DataFrame nodes with columns 'lat', 'lon', and 'node_id'. As with nearly everything in python, remove an inner for loop whenever possible, leveraging array-based operations written in C. In this case:
# create an xr.Dataset indexed by node_id with arrays `lat` and `lon
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# select all points from each file simultaneously, reshaping to be
# indexed by `node_id`
node_data = data.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# dump this reshaped data to pandas, with each variable becoming a column
node_df = node_data.to_dataframe()
Only reshape arrays once
In your code, you are looping over many years, and every year after
the first one you are allocating a new array with enough memory to
hold as many years as you've stored so far.
# For the first year, store the data into the ndata dict
if y == years[0]:
ndata[v][node] = temp
# For subsequent years, concatenate the existing array in ndata
else:
ndata[v][node] = xr.concat([ndata[v][node],temp], dim='time')
Instead, just gather all the years worth of data and concatenate
them at the end. This will only allocate the needed array for all the data once.
Use dask, e.g. with xr.open_mfdataset to leverage multiple cores. If you do this, you may want to consider using a format that supports multithreaded writes, e.g. zarr
All together, this could look something like this:
# build nested filepaths
filepaths = [
['xxx.nc'.format(year=y, variable=v) for y in years
for v in variables
]
# build node indexer
node_indexer = nodes.set_index('node_id')[['lat', 'lon']].to_xarray()
# I'm not sure if you have conflicting variable names - you'll need to
# tailor this line to your data setup. It may be that you want to just
# concatenate along years and then use `xr.merge` to combine the
# variables, or just handle one variable at a time
ds = xr.open_mfdataset(
filepaths,
combine='nested',
concat_dim=['variable', 'year'],
parallel=True,
)
# this will only schedule the operation - no work is done until the next line
ds_nodes = ds.sel(lat=node_indexer.lat, lon=node_indexer.lon)
# this triggers the operation using a dask LocalCluster, leveraging
# multiple threads on your machine (or a distributed Client if you have
# one set up)
ds_nodes.to_netcdf('all_the_data.zarr')
# alternatively, you could still dump to pandas:
df = ds_nodes.to_dataframe()

Looping a function over huge dataset

A (already defined) function takes ISIN (unique identifier in finance) as input and gets the corresponding RIC (another identifier) as output by looking at a particular internal web app where this data is available in tabular form. The Key limitation of this website is that it can't take more than 500 input ID at a time. So when 500 or less number of ISINs are entered as input it returns a dataframe containing 500 input ISIN and their corresponding RIC codes from the website.
Task is to take a csv as input containing 30k ISINs and batch them in group of 500 IDs so that it can pass through the function and then store the produced output (dataframe). Keep looping input and appending output incrementally.
Can someone please suggest how to break this data of 30K into size of 500 and then loop through function and store all results? Many thanks in advance!
.iloc is the method you want to use.
df.iloc[firstrow:lastrow , firstcol:lastcol]
if you put it in a for loop such as
for x in range (0, 30000, 50):
a = x #first row
b = x+50 #last row
thisDF = bigdf.iloc[a:b , firstcol:lastcol]
Try it and implement it in your code. You should make questions with some code you tried, so you get helped better.
Assuming you read in the .csv file as a pandas Series (e.g., by using something like this: pd.Series.from_csv('ISINs.csv')) or that you have a list, you could split these up as thus:
import pandas as pd
import numpy as np
# mock series of ISINs
isins = pd.Series(np.arange(0, 30002, 1))
data = pd.DataFrame()
for i in range(0, len(isins)//500):
isins_for_function = isins.iloc[i*500: i*500+500]
# if you have a list instead of a series, you will need to split it like this instead
isins_for_function = isins[i*500: i*500+500]
df = func(isins_for_function)
data = pd.concat([data, df])
# for Series
df = func(isins.iloc[-(len(isins)%500):]
# for list
df = func(isins[-(len(isins)%500):]
data = pd.concat([data, df])
This will concatenate your dataframes together into data. isins is your Series/list of isins. You will need the last bit after the for loop for any index values that are after the last chunk of 500 (in the Series above, which has 30002 rows, the last two are not included in the chunks of 500 so still need to be entered into the function func).

How can I periodically skip rows reading txt with pandas?

I need to process data measured every 20 seconds during the whole 2018 year, the raw file has following structure:
date time a lot of trash
in several rows
amount of samples trash again
data
date time a lot of trash
etc.
I want to make one pandas dataframe of it or at least one dataframe per every block (its size is coded as amount of samples) of data saving the time of measurement.
How can I ignore all other data trash? I know that it is written periodically (period = amount of samples), but:
- I don't know how many strings are in file
- I don't want to use explicit method file.getline() in cycle, because it would work just endlessly (especially in python) and I have no enough computing power to use it
Is there any method to skip rows periodically in pandas or another lib? Or how else can I resolve it?
There is an example of my data:
https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing
I want to get dataframe similar to datatable on the pic + additional column with date-time without technical rows
Use itertools.islice, where N below means read every N lines
from itertools import islice
N = 3
sep = ','
with open(file_path, 'r') as f:
lines_gen = islice(f, None, None, N)
df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])
I repeated your data three times. It sounds like you need every 4th row (not starting at 0) because that is where your data lies. In the documentation for skipsrows it says.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
So what if we pass a not in to the lambda function? that is what I am doing below.
I am creating a list of the values i want to keep. and passing the not in to the skiprows argument. In English, skip all the rows that are not every 4th line.
import pandas as pd
# creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number
list_of_rows_to_keep = list(range(0,1000000))[3::4]
# passing this list to the lambda function using not in.
df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep)
df.head()
#output
0 data
1 data
2 data
Just count how many lines are in file and put the list of them (may it calls useless_rows) which are supposed to be skiped in pandas.read_csv(..., skiprows=useless_rows).
My problem was a chip rows counting.
There are few ways to do it:
On Linux command "wc -l" (here is an instruction how to put it into your code: Running "wc -l <filename>" within Python Code)
Generators. I have a key in my relevant rows: it is in last column. Not really informative, but rescue for me. So I can count string with it, appears it's abour 500000 lines and it took 0.00011 to count
with open(filename) as f:
for row in f:
if '2147483647' in row:
continue
yield row

How to properly save each large chunk of data as a pandas dataframe and concatenate them with each other

I have a dataframe that has over 400K rows and several hundred columns that I have decided to read in with chunks because it does not fit into Memory and gives me MemoryError.
I have managed to read it in in chunks like this:
x = pd.read_csv('Training.csv', chunksize=10000)
and afterwards I can get each of the chunks by doing this:
a = x.get_chunk()
b = x.get_chunk()
etc etc keep doing this over 40 times which is obviously slow and bad programming practice.
When I try doing the following in an attempt to create a loop that can save each chunk into a dataframe and somehow concatenate them:
for x in pd.read_csv('Training.csv', chunksize=500):
x.get_chunk()
I get:
AttributeError: 'DataFrame' object has no attribute 'get_chunk'
What is the easiest way I can read in my file and concatenate all my chunks during the import?
Also, how do I do further manipulation on my dataset to avoid memory error issues (particularly, imputing null values, standardizing/normalizing the dataframe, and then running machine learning models on it using scikit learn?
When you specify chunksize in a call to pandas.read_csv you get back a pandas.io.parsers.TextFileReader object rather than a DataFrame. Try this to go through the chunks:
reader = pd.read_csv('Training.csv',chunksize=500)
for chunk in reader:
print(type(chunk)) # chunk is a dataframe
Or grab all the chunks (which probably won't solve your problem!):
reader = pd.read_csv('Training.csv',chunksize=500)
chunks = [chunk for chunk in reader] # list of DataFrames
Depending on what is in your dataset a great way of reducing memory use is to identify columns that can be converted to categorical data. Any column where the number of distinct values is much lower than the number of rows is a candidate for this. Suppose a column contains some sort of status with limited values (e.g. 'Open','Closed','On hold') do this:
chunk['Status'] = chunk.assign(Status=lambda x: pd.Categorical(x['Status']))
This will now store just an integer for each row and the DataFrame will hold a mapping (e.g 0 = 'Open', 1 = 'Closed etc. etc.)
You should also look at whether or not any of your data columns are redundant (they effectively contain the same information) - if any are then delete them. I've seen spreadsheets containing dates where people have generated columns for year, week, day as they find it easier to work with. Get rid of them!

Is there a more efficient tool than iterrows() in this situation?

Okay so, here's the thing. I'm working with a lot of pandas data frames and arrays. Often times, I need to pair up a value from one frame with a value from another, ideally combining the information into one frame in the end.
Say I'm looking at image files. There's a set of information specific to each file. Sometimes there's certain types of image files that share the same kind of information. Simple example:
FILEPATH, TYPE, COLOR, VALUE_I,<br>
/img2.jpg, A, 'green', 0.6294<br>
/img45.jpg, B, 'green', 0.1846<br>
/img87.jpg, A, 'blue', 34.78<br>
Often, this information is indexed out by type/color/value etc and fed into some other function that gives me another important output, let's say VALUE_II. But I can't concatenate it directly onto the original dataframe because the indices won't match, either because of the nature of the output or because I only fed part of the frame.
Or another situation: I learn that images of a certain TYPE have a specific value attached to them, so I make a dictionary of types and their value. Again, this column doesn't exist, so in this case I would use iterrows() to march down the frame, see if the type matches a specific key, and if it does append it to an array. Then in the end, I convert that array to a dataframe and concatenate it onto the original.
Here's the worse offender. With up to 1800 rows in each frame, it takes FOREVER.:
newColumn = []
for index, row in originalDataframe.iterrows():
for indx, rw in otherDataframe.iterrows():
if row['filename'] in rw['filepath']:
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
newColumn = pd.DataFrame(newColumn, columns = ['VALUE_I', 'VALUE_II', 'VALUE_III'])
originalDataframe = pd.concat([originalDataframe, newColumn], axis=1)
Solutions would be appreciated!
If you can split filename from otherDataframe["filepath"], you can then just compare for equality with orinalDataframe's filename without need to check in. After that you can simplify calculation with pandas.DataFrame.join, which for each filename in originalDataframe will find the same filename in otherDataframe and add all other columns from it.
import os
otherDataframe["filename"] = otherDataframe["filepath"].map(os.path.basename)
joinedDataframe = originalDataframe.join(otherDataframe.set_index("filename"), on="filename")
If there are columns with the same name in originalDataframe and otherDataframe you should set lsuffix or rsuffix.
focusing on the second half of your question, as that's what you provided code for. Your program is checking every row of df1 against every row in df2, yielding potentially 1800 *1800, or 3240000 possible combinations. If there is only one possible match for each row then adding 'break' in will help some, but is not ideal.
newColumn.append([rw['VALUE_I'],rw['VALUE_II'], rw['VALUE_III']])
break
if the structure of you data allows it, i would try something like:
ref = {}
for i, path in enumerate(otherDataframe['filepath']):
*_, file = path.split('\\')
ref[file] = i
originalDataframe['VALUE_I'] = None
originalDataframe['VALUE_II'] = None
originalDataframe['VALUE_III'] = None
for i, file in enumerate(originalDataframe['filename']):
try:
j = ref[file]
originalDataframe.loc[i, 'VALUE_I'] = otherDataframe.loc[j, 'VALUE_I']
originalDataframe.loc[i, 'VALUE_II'] = otherDataframe.loc[j, 'VALUE_II']
originalDataframe.loc[i, 'VALUE_III'] = otherDataframe.loc[j, 'VALUE_III']
except:
pass
Here we we iterate through the paths in otherDataframe (I assume they follow a pattern of C:\asdf\asdf\file), split the path on \ to pull out file, and then construct a dictionary of files to row numbers. Next we initialize the 3 columns in originalDataframe that you want to write to.
Lastly we iterate through the files in originalDataframe, check to see if that file exists in our dictionary of files in otherDataframe (done inside a try to catch errors), and pull the row number (out of the dictionary) which we then use to write the values from other to original.
Side note, you describe you paths as being in the vein of 'C:/asd/fdg/img2.jpg', in which case you should use:
*_, file = path.split('/')

Categories

Resources