Adding rows in bulk to PyTables array - python

I have a script that collects data from an experiment and adds it to a PyTables table. The script gets data in batches (say, groups of 10). It's a little cumbersome in the code to add one row at a time via the normal method, e.g.:
data_batch = experiment.read()
last_time = time.time()
for data_row in data_batch:
row = table.row
row['timestamp'] = last_time
last_time += dt
row['column1'] = data_row[0]
row['column2'] = data_row[1]
row.append()
table.flush()
I would much rather do something like this:
data_batch = experiment.read()
start_index = len(table)
num_rows = len(data_batch)
table.append_n_rows(num_rows)
table.cols.timestamp[start_index:] = last_time + np.arange(num_rows) * dt
last_time += dt * num_rows
table.cols.column1[start_index:] = data_batch[:, 0]
table.cols.column2[start_index:] = data_batch[:, 1]
table.flush()
Does anyone know if there is some function that does the table.append_n_rows. Right now, all I can do is [table.row for i in range(num_rows)], which I feel is hacky and inefficient.

You are on the right track. In table.append(rows), the rows argument can be any object that can be converted to a structured array. This includes: "NumPy structured arrays, lists of tuples or array records, and a string or Python buffer". (I prefer NumPy arrays because I routinely work with them. Your answer shows how to use a list of tuples.)
There is a significant performance advantage adding data in batches instead of 1 row at a time. I ran some tests and posted to SO a few years ago. I/O performance is primarily related to number of batches, and not the batch size. Take a look at this answer for details: pytables writes much faster than h5py
Also, if you are going to create a large table, consider setting expectedrows parameter when you create the table. This will also improve I/O performance. This has the side benefit of setting an appropriate chunksize.
Recommended approach with your data.
data_batch = experiment.read()
last_time = time.time()
row_list = []
for data_row in data_batch:
row_list.append( (last_time, data_row[0], data_row[1] ) )
last_time += dt
your_table.append( row_list )
your_table.flush()

There is an example in the source code
I'm going to paste it here to avoid a dead link in the future.
import tables as tb
class Particle(tb.IsDescription):
name = tb.StringCol(16, pos=1) # 16-character String
lati = tb.IntCol(pos=2) # integer
longi = tb.IntCol(pos=3) # integer
pressure = tb.Float32Col(pos=4) # float (single-precision)
temperature = tb.FloatCol(pos=5) # double (double-precision)
fileh = tb.open_file('test4.h5', mode='w')
table = fileh.create_table(fileh.root, 'table', Particle,
"A table")
# Append several rows in only one call
table.append([("Particle: 10", 10, 0, 10 * 10, 10**2),
("Particle: 11", 11, -1, 11 * 11, 11**2),
("Particle: 12", 12, -2, 12 * 12, 12**2)])
fileh.close()

Related

What is the most efficient way of indexing Numpy matrices?

Question: What is the most efficient way to implement the equivalent of the following, using Pandas dataframes: temp = df[df.feature] == value] at scale (see below for context re: scale)?
Background: I have daily time series data for ~500 entities for 30 years, and for each entity and each day, need to create 90 features based on various look-backs, up to 240 days in the past. Currently, I'm looping through each day, manipulating all of the data from that day, then inserting it into a pre-allocated numpy matrix—but it's proving very slow for the size of my data set.
Naive approach:
df = pd.DataFrame()
for day in range(241, t_max):
temp_a = df_timeseries[df_timeseries.t] == day].copy()
temp_b = df_timeseries[df_timeseries.t] == day - 1].copy()
new_val = temp_a.feature_1/temp_b.feature_1
new_val['t'] = day
new_val['entity'] = temp_a.entity
df.concat([df, new_val])
Current approach (simplified):
df = np.matrix(np.zeros([num_days*num_entities, 3]))
col_dict = dict(zip(df_timeseries.columns, list(range(0,len(df_timeseries.columns)))))
mtrx_timeseries = np.matrix(df_timeseries.to_numpy())
for i, day in enumerate(range(241, t_max)):
interm = np.matrix(np.zeros([num_entities, 3]))
interm[:, 0] = day
temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :]
temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 1)[0], :]
temp_cr = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1
temp_a = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day - 5)[0], :]
temp_b = mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == t - 10)[0], :]
temp_or = temp_a[:, col_dict['feature_1']]/temp_b[:, col_dict['feature_1']] - 1
interm[:, 1:] = np.concatenate([temp_cr, temp_or], axis=1)
df[i*num_entities : (i + 1)*num_entities, :] = interm
Line profiling the full version of the code I have shows that each statement of the form mtrx_timeseries[np.where(mtrx_timeseries[:, col_dict['t']] == day)[0], :] takes up ~23% of the time of the run in total, hence my looking for a more streamlined solution. Since indexing takes the most time, and since the loop means that this operation is performed every iteration, perhaps one solution might be to index just once, storing each day's data in a separate array element, and then looping through array elements?
This isn't a complete solution to your problem, but I think it will get you where you need to be.
Consider the following code:
entity_dict = {}
entity_idx = 0
arr = np.zeros((num_entities, t_max-240))
for entity, day, feature in df_timeseries[['entity', 'day', 'feature_1']].values:
if entity not in entity_dict:
entity_dict[entity] = entity_idx
entity_idx += 1
arr[entity_dict[entity], day-240] = feature
This will convert df_timeseries into an num_entities*num_days shaped array organized by entities, very efficiently. You won't need to do any fancy indexing at all. The most efficient way to index a numpy array or matrix is to know what indices you need ahead of time and not search the array
for them. You can then perform array operations (it looks to me like your operation is simple elementwise division, which you can do in a couple lines with no extra loop).
Then convert back to the original format.

Running Scipy Linregress Across Dataframe Where Each Element is a List

I am working with a Pandas dataframe where each element contains a list of values. I would like to run a regression between the lists in the first column and the lists in each subsequent column for every row in the dataframe, and store the t-stats of each regression (currently using a numpy array to store them). I am able to do this using a nested for loop that loops through each row and column, but the performance is not optimal for the amount of data I am working with.
Here is a quick sample of what I have so far:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
{'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
}
)
Here is what the data looks like:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
My code to run the regressions and store the t-stats:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
The code above works just fine and is doing exactly what I need, however as I mentioned above the performance begins to slow down when the # of rows and columns in df increases substantially.
I'm hoping someone could offer advice on how to optimize my code for better performance.
Thank you!
I am newbie to this but I do optimization your original code:
by purely use python builtin list object (there is no need to use pandas and to be honest I cannot find a better way to solve your problem in pandas than you original code :D)
by using numpy, which should be (at least they claimed) faster than python builtin list.
You can jump to see the code, its in Jupyter notebook format so you need to install Jupyter first.
Conclusion
Here is the test result:
On a (100, 100) matrix containing (30,) length random lists,
the total time difference is around 1 second.
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
Refer to
test_perf
in sample code for result.
PS: During test only one thread is used, so maybe multi-thread will help to improve performance, but that's out of my ability...
Idea
I think numpy.nditer is suitable for your request, though the result of optimization is not that significant. Here is my idea:
Generate the input array
I have altered you first part of script, I think using list comprehension along is enough to build a matrix of random lists. Refer to
get_matrix_from_builtin.
Please note I have stored the random lists in another 1-element tuple to keep the shape as ndarray generate from numpy.
As a compare, you can also construct such matrix with numpy. Refer to
get_matrix_from_numpy.
Because ndarray try to boardcast list-like object (and I don't know how to stop it), I have to wrap it into a tuple to avoid auto boardcast from numpy.array constructor. If anyone have a better solution please note it, thanks :)
Calculate the result
I altered you original code using pandas.DataFrame to access element by row/col index, but it is not that way.
Pandas provides some iteration tool for DataFrame: pipe, apply, agg, and appymap, search API for more info, but it seems not suitable for your request here, as you want to obtain the current index of row and col during iteration.
I searched and found numpy.nditer can provide that needs: it return a iterator of ndarray, which have an attribution multi_index that provide the row/col pair of current element. see iterating-over-arrays
Explain on solve.ipynb
I use Jupyter Notebook to test this, you might need got one, here is the instruction of install.
I have altered your original code, which remove the request of pandas and purely used builtin list. Refer to
old_calc_tstat
in the sample code.
Also, I used numpy.nditer to calc your tstats matrix, Refer to
new_calc_tstat
in the sample code.
Then, I tested if the result of both methods are equal, I used same input array to ensure random won't affect the test. Refer to
test_equal
for result.
Finally, do the time performance. I am not patient so I only run it for one time, you may add the repeats count of test in the
test_perf function.
The code
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://stackoverflow.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

So I'm using Omnet++, a discrete time network simulator, to simulate different networking scenarios. At some point one can further process Omnet++ output statistics and store them in a .csv file.
The interesting thing about it is that for each time (vectime) there is a value (vecvalue). Those vectime/vecvalues are stored in a single cell of such .csv file. When imported into a Pandas Dataframe, I get something like this.
In [45]: df1[['module','vectime','vecvalue']]
Out[45]:
module vectime vecvalue
237 Tictoc13.tic[1] [2.542245319062, 3.066965320033, 4.78723506093... [0.334535581612, 0.390459633837, 0.50391696492...
249 Tictoc13.tic[4] [2.649303071938, 6.02527384362, 21.42434044990... [2.649303071938, 1.654927100273, 3.11051622577...
261 Tictoc13.tic[3] [4.28876656608, 16.104821448604, 19.5989313700... [2.245250432259, 3.201153958979, 2.39023520069...
277 Tictoc13.tic[2] [13.884917126016, 21.467263378748, 29.59962616... [0.411703261805, 0.764708518232, 0.83288346614...
289 Tictoc13.tic[5] [14.146524815409, 14.349744576545, 24.95022463... [1.732060647139, 8.66456377103, 2.275388282721...
For example, if I needed to plot each vectime/vecvalue for each module, today I'm doing the following...
%pylab
def runningAvg(x):
sigma_x = np.cumsum(x)
sigma_n = np.arange(1,x.size + 1)
return sigma_x / sigma_n
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
... to obtain this ...
My question is: what's best in terms of performance:
use the data as is, meaning using those arrays inside each cell, looping over the DF to plot each array;
convert those arrays as pd.Series. In this case, what would be better to still have the module as index?
would I benefit from unnesting those arrays into pd.Series?
thanks!
Well, I've wondered around and it seems that converting Omnet data into pd.Series might not be as efficient as I thought.
These are my two methods:
1) Using Omnet data as is, lists inside Pandas DF.
figure(1)
start = datetime.datetime.now()
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
When running the above, the total is 0.026571.
2) Converting Omnet data to pd.Series.
To obtain the same result, I had to transpose the series several times.
figure(2)
start = datetime.datetime.now()
t = df1.vectime
v = df1.vecvalue
t = t.apply(pd.Series)
v = v.apply(pd.Series)
t = t.T
v = v.T
sigma_v = np.cumsum(v)
sigma_n = np.arange(1,v.shape[0]+1)
sigma = sigma_v.T / sigma_n
plot(t,sigma.T)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
For the later, total is 0.57266.
So it seems that I'll stick to method 1, looping over the different rows.

How to speed up Numpy array filtering/selection?

I have around 40k rows and I want to test all kinds of selection combinations on the rows. By selection I mean boolean masks. The number of masks/filters is around 250MM.
The current simplified code:
np_arr = np.random.randint(1, 40000, 40000)
results = np.empty(250000000)
filters = np.random.randint(1, size=(250000000, 40000))
for i in range(250000000):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
I tried Numba and Multiprocessing, but since most of the processing is in the filter selection rather than the computation, that doesn't help much.
What would be the most efficient way to solve this? Is there any way to parallelize this? As far as I see I need to loop through each filter to then individually calculate the sum, prod, count etc because I can't apply filters in parallel (even though the calculations after applying the filters are very simple).
Appreciate any suggestions on performance improvement/speedup.
To get good performane within Numba simply avoid masking and therefore very costly array copies. You have to implement the filters yourself, but that shouldn't be any problem with the filters you mentioned.
Parallelization is also very easy to do.
Example
import numpy as np
import numba as nb
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
filters = np.random.randint(low=0,high=2, size=(max_num, max_num2)).astype(np.bool_)
#Implement your functions like this, avoid masking
#Sum Filter
#nb.njit(fastmath=True)
def sum_filter(filter,arr):
sum=0.
for i in range(filter.shape[0]):
if filter[i]==True:
sum+=arr[i]
return sum
#Implement your functions like this, avoid masking
#Prod Filter
#nb.njit(fastmath=True)
def prod_filter(filter,arr):
prod=1.
for i in range(filter.shape[0]):
if filter[i]==True:
prod*=arr[i]
return sum
#nb.njit(parallel=True)
def main_func(np_arr,filters):
results = np.empty(filters.shape[0])
for i in nb.prange(max_num):
results[i]=sum_filter(filters[i],np_arr)
#results[i]=prod_filter(filters[i],np_arr)
return results
One way to improve is to move the as_type outside the loop. In my tests it reduced the execution time by more than half.
For comparison, check the two codes below:
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2))
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i].astype(np.bool_)] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 2.12
while
import numpy as np
import time
max_num = 250000 #250000000
max_num2 = 4000#40000
np_arr = np.random.randint(1, max_num2, max_num2)
results = np.empty(max_num)
filters = np.random.randint(1, size=(max_num, max_num2)).astype(np.bool_)
start = time.time()
for i in range(max_num):
row_selection = np_arr[filters[i]] # Select rows based on next filter
# Performing simple calculations such as sum, prod, count on selected rows and saving to result
results[i] = row_selection.sum() # Save simple calculation result to results array
end = time.time()
print(end - start)
takes 0.940

Time series storage in HDF5 format

I want to store the results of time series (sensor data) into a HDF5 file. I cannot seem to be able to assign values to my dataset. Clearly, I am doing something wrong, I am just not sure what…
The code:
from datetime import datetime, timezone
import h5py
TIME_SERIES_FLOAT = np.dtype([("time", h5py.special_dtype(vlen=str)),
("value", np.float)])
h5 = h5py.File('balh.h5', "w")
dset = create_dataset('data', (1, 2), chunks=True, maxshape=(None, 2), dtype=TIME_SERIES_FLOAT)
dset[0]['time'] = datetime.now(timezone.utc).astimezone().isoformat()
dset[0]['value'] = 0.0
Then the update code resizes the dataset and adds more values. Clearly doing that per value is inefficient:
size = list(dset.shape)
size[0] += 1
dset.resize(tuple(size))
dset[size[0]-1]['time'] = datetime.now(timezone.utc).astimezone().isoformat()
dset[size[0]-1]['value'] = value
A much better method would be to collate some data into an np.array and then add that every so often…
Is this sensible?…
I need more coffee…
The defined type is a tuple containing a string (aka the time) and a float (aka the value) so to add one, I need:
dset[-1] = (datetime.now(timezone.utc).astimezone().isoformat(), value)
It is actually that simple!
Adding many entries is done this way:
l = [('stamp', x) for x in range(10)]
size = list(dset.shape)
tmp = size[0]
size[0] += len(l)
dset.resize(tuple(size))
for x in range(len(l)):
dset[tmp+x] = l[x]
Nonetheless, this feels somewhat clunky and sub-optimal…

Categories

Resources