Time series storage in HDF5 format

Time series storage in HDF5 format - python

I want to store the results of time series (sensor data) into a HDF5 file. I cannot seem to be able to assign values to my dataset. Clearly, I am doing something wrong, I am just not sure what…
The code:
from datetime import datetime, timezone
import h5py
TIME_SERIES_FLOAT = np.dtype([("time", h5py.special_dtype(vlen=str)),
("value", np.float)])
h5 = h5py.File('balh.h5', "w")
dset = create_dataset('data', (1, 2), chunks=True, maxshape=(None, 2), dtype=TIME_SERIES_FLOAT)
dset[0]['time'] = datetime.now(timezone.utc).astimezone().isoformat()
dset[0]['value'] = 0.0
Then the update code resizes the dataset and adds more values. Clearly doing that per value is inefficient:
size = list(dset.shape)
size[0] += 1
dset.resize(tuple(size))
dset[size[0]-1]['time'] = datetime.now(timezone.utc).astimezone().isoformat()
dset[size[0]-1]['value'] = value
A much better method would be to collate some data into an np.array and then add that every so often…
Is this sensible?…

I need more coffee…
The defined type is a tuple containing a string (aka the time) and a float (aka the value) so to add one, I need:
dset[-1] = (datetime.now(timezone.utc).astimezone().isoformat(), value)
It is actually that simple!
Adding many entries is done this way:
l = [('stamp', x) for x in range(10)]
size = list(dset.shape)
tmp = size[0]
size[0] += len(l)
dset.resize(tuple(size))
for x in range(len(l)):
dset[tmp+x] = l[x]
Nonetheless, this feels somewhat clunky and sub-optimal…

Related

Running Scipy Linregress Across Dataframe Where Each Element is a List

I am working with a Pandas dataframe where each element contains a list of values. I would like to run a regression between the lists in the first column and the lists in each subsequent column for every row in the dataframe, and store the t-stats of each regression (currently using a numpy array to store them). I am able to do this using a nested for loop that loops through each row and column, but the performance is not optimal for the amount of data I am working with.
Here is a quick sample of what I have so far:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
{'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
}
)
Here is what the data looks like:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
My code to run the regressions and store the t-stats:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
The code above works just fine and is doing exactly what I need, however as I mentioned above the performance begins to slow down when the # of rows and columns in df increases substantially.
I'm hoping someone could offer advice on how to optimize my code for better performance.
Thank you!

I am newbie to this but I do optimization your original code:
by purely use python builtin list object (there is no need to use pandas and to be honest I cannot find a better way to solve your problem in pandas than you original code :D)
by using numpy, which should be (at least they claimed) faster than python builtin list.
You can jump to see the code, its in Jupyter notebook format so you need to install Jupyter first.
Conclusion
Here is the test result:
On a (100, 100) matrix containing (30,) length random lists,
the total time difference is around 1 second.
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
Refer to
test_perf
in sample code for result.
PS: During test only one thread is used, so maybe multi-thread will help to improve performance, but that's out of my ability...
Idea
I think numpy.nditer is suitable for your request, though the result of optimization is not that significant. Here is my idea:
Generate the input array
I have altered you first part of script, I think using list comprehension along is enough to build a matrix of random lists. Refer to
get_matrix_from_builtin.
Please note I have stored the random lists in another 1-element tuple to keep the shape as ndarray generate from numpy.
As a compare, you can also construct such matrix with numpy. Refer to
get_matrix_from_numpy.
Because ndarray try to boardcast list-like object (and I don't know how to stop it), I have to wrap it into a tuple to avoid auto boardcast from numpy.array constructor. If anyone have a better solution please note it, thanks :)
Calculate the result
I altered you original code using pandas.DataFrame to access element by row/col index, but it is not that way.
Pandas provides some iteration tool for DataFrame: pipe, apply, agg, and appymap, search API for more info, but it seems not suitable for your request here, as you want to obtain the current index of row and col during iteration.
I searched and found numpy.nditer can provide that needs: it return a iterator of ndarray, which have an attribution multi_index that provide the row/col pair of current element. see iterating-over-arrays
Explain on solve.ipynb
I use Jupyter Notebook to test this, you might need got one, here is the instruction of install.
I have altered your original code, which remove the request of pandas and purely used builtin list. Refer to
old_calc_tstat
in the sample code.
Also, I used numpy.nditer to calc your tstats matrix, Refer to
new_calc_tstat
in the sample code.
Then, I tested if the result of both methods are equal, I used same input array to ensure random won't affect the test. Refer to
test_equal
for result.
Finally, do the time performance. I am not patient so I only run it for one time, you may add the repeats count of test in the
test_perf function.
The code
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://stackoverflow.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()

Adding rows in bulk to PyTables array

I have a script that collects data from an experiment and adds it to a PyTables table. The script gets data in batches (say, groups of 10). It's a little cumbersome in the code to add one row at a time via the normal method, e.g.:
data_batch = experiment.read()
last_time = time.time()
for data_row in data_batch:
row = table.row
row['timestamp'] = last_time
last_time += dt
row['column1'] = data_row[0]
row['column2'] = data_row[1]
row.append()
table.flush()
I would much rather do something like this:
data_batch = experiment.read()
start_index = len(table)
num_rows = len(data_batch)
table.append_n_rows(num_rows)
table.cols.timestamp[start_index:] = last_time + np.arange(num_rows) * dt
last_time += dt * num_rows
table.cols.column1[start_index:] = data_batch[:, 0]
table.cols.column2[start_index:] = data_batch[:, 1]
table.flush()
Does anyone know if there is some function that does the table.append_n_rows. Right now, all I can do is [table.row for i in range(num_rows)], which I feel is hacky and inefficient.

You are on the right track. In table.append(rows), the rows argument can be any object that can be converted to a structured array. This includes: "NumPy structured arrays, lists of tuples or array records, and a string or Python buffer". (I prefer NumPy arrays because I routinely work with them. Your answer shows how to use a list of tuples.)
There is a significant performance advantage adding data in batches instead of 1 row at a time. I ran some tests and posted to SO a few years ago. I/O performance is primarily related to number of batches, and not the batch size. Take a look at this answer for details: pytables writes much faster than h5py
Also, if you are going to create a large table, consider setting expectedrows parameter when you create the table. This will also improve I/O performance. This has the side benefit of setting an appropriate chunksize.
Recommended approach with your data.
data_batch = experiment.read()
last_time = time.time()
row_list = []
for data_row in data_batch:
row_list.append( (last_time, data_row[0], data_row[1] ) )
last_time += dt
your_table.append( row_list )
your_table.flush()

There is an example in the source code
I'm going to paste it here to avoid a dead link in the future.
import tables as tb
class Particle(tb.IsDescription):
name = tb.StringCol(16, pos=1) # 16-character String
lati = tb.IntCol(pos=2) # integer
longi = tb.IntCol(pos=3) # integer
pressure = tb.Float32Col(pos=4) # float (single-precision)
temperature = tb.FloatCol(pos=5) # double (double-precision)
fileh = tb.open_file('test4.h5', mode='w')
table = fileh.create_table(fileh.root, 'table', Particle,
"A table")
# Append several rows in only one call
table.append([("Particle: 10", 10, 0, 10 * 10, 10**2),
("Particle: 11", 11, -1, 11 * 11, 11**2),
("Particle: 12", 12, -2, 12 * 12, 12**2)])
fileh.close()

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

So I'm using Omnet++, a discrete time network simulator, to simulate different networking scenarios. At some point one can further process Omnet++ output statistics and store them in a .csv file.
The interesting thing about it is that for each time (vectime) there is a value (vecvalue). Those vectime/vecvalues are stored in a single cell of such .csv file. When imported into a Pandas Dataframe, I get something like this.
In [45]: df1[['module','vectime','vecvalue']]
Out[45]:
module vectime vecvalue
237 Tictoc13.tic[1] [2.542245319062, 3.066965320033, 4.78723506093... [0.334535581612, 0.390459633837, 0.50391696492...
249 Tictoc13.tic[4] [2.649303071938, 6.02527384362, 21.42434044990... [2.649303071938, 1.654927100273, 3.11051622577...
261 Tictoc13.tic[3] [4.28876656608, 16.104821448604, 19.5989313700... [2.245250432259, 3.201153958979, 2.39023520069...
277 Tictoc13.tic[2] [13.884917126016, 21.467263378748, 29.59962616... [0.411703261805, 0.764708518232, 0.83288346614...
289 Tictoc13.tic[5] [14.146524815409, 14.349744576545, 24.95022463... [1.732060647139, 8.66456377103, 2.275388282721...
For example, if I needed to plot each vectime/vecvalue for each module, today I'm doing the following...
%pylab
def runningAvg(x):
sigma_x = np.cumsum(x)
sigma_n = np.arange(1,x.size + 1)
return sigma_x / sigma_n
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
... to obtain this ...
My question is: what's best in terms of performance:
use the data as is, meaning using those arrays inside each cell, looping over the DF to plot each array;
convert those arrays as pd.Series. In this case, what would be better to still have the module as index?
would I benefit from unnesting those arrays into pd.Series?
thanks!

Well, I've wondered around and it seems that converting Omnet data into pd.Series might not be as efficient as I thought.
These are my two methods:
1) Using Omnet data as is, lists inside Pandas DF.
figure(1)
start = datetime.datetime.now()
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
When running the above, the total is 0.026571.
2) Converting Omnet data to pd.Series.
To obtain the same result, I had to transpose the series several times.
figure(2)
start = datetime.datetime.now()
t = df1.vectime
v = df1.vecvalue
t = t.apply(pd.Series)
v = v.apply(pd.Series)
t = t.T
v = v.T
sigma_v = np.cumsum(v)
sigma_n = np.arange(1,v.shape[0]+1)
sigma = sigma_v.T / sigma_n
plot(t,sigma.T)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
For the later, total is 0.57266.
So it seems that I'll stick to method 1, looping over the different rows.

PyTables - big memory consumption using cols method

What is the purpose for using cols method in Pytables? I have got big dataset and I am interested in reading only one column from that dataset.
These two methods gives me same time, but totally different variable memory consumption:
import tables
from sys import getsizeof
f = tables.open_file(myhdf5_path, 'r')
# These two methods takes the same amount of time
x = f.root.set1[:500000]['param1']
y = f.root.set1.cols.param1[:500000]
# But totally different memory consumption:
print(getsizeof(x)) # gives me 96
print(getsizeof(y)) # gives me 2000096
They are both the same numpy array data type. Can anybody explain me what is the purpose of using cols method?
%time x = f.root.set1[:500000]['param1'] # gives ~7ms
%time y = f.root.set1.cols.param1[:500000] # gives also about 7ms

Your question caught my curiosity. I typically use table.read(field='name') because it compliments the other table.read_ methods I use (for example: .read_where() and .read_coordinates()).
After a reviewing the docs, I found at least 4 ways to read one column of table data with PyTables. You showed 2, and there are 2 more:
table.read(field='name')
table.col('name') (singular)
I ran some tests with all 4, plus 2 tests on the entire table (dataset) for additional comparisons. I called getsizeof() for all 6 objects, and the size varies based on method. Although all 4 behave the same with numpy indexing, I suspect there's a difference in the returned object. However, I'm not a PyTables developer, so this is more inference than fact. It could also be that getsizeof() interprets the object differently.
Code Below:
import tables as tb
import numpy as np
from sys import getsizeof
# Create h5 file with 1 dataset
h5f = tb.open_file('SO_55254831.h5', 'w')
mydtype = np.dtype([('param1',float),('param2',float),('param3',float)])
arr = np.array(np.arange(3.*500000.).reshape(500000,3))
recarr = np.core.records.array(arr,dtype=mydtype)
h5f.create_table('/', 'set1', obj=recarr )
# Close, then Reopen file READ ONLY
h5f.close()
h5f = tb.open_file('SO_55254831.h5', 'r')
testds_1 = h5f.root.set1
print ("\nFOR: testds_1 = h5f.root.set1")
print (testds_1.dtype)
print (testds_1.shape)
print (getsizeof(testds_1)) # gives 128
testds_2 = h5f.root.set1.read()
print ("\nFOR: testds_2 = h5f.root.set1.read()")
print (getsizeof(testds_2)) # gives 12000096
x = h5f.root.set1[:500000]['param1']
print ("\nFOR: x = h5f.root.set1[:500000]['param1']")
print(getsizeof(x)) # gives 96
print ("\nFOR: y = h5f.root.set1.cols.param1[:500000]")
y = h5f.root.set1.cols.param1[:500000]
print(getsizeof(y)) # gives 4000096
print ("\nFOR: z = h5f.root.set1.read(stop=500000,field='param1')")
z = h5f.root.set1.read(stop=500000,field='param1')
print(getsizeof(z)) # also gives 4000096
print ("\nFOR: a = h5f.root.set1.col('param1')")
a = h5f.root.set1.col('param1')
print(getsizeof(a)) # also gives 4000096
h5f.close()
Output from Above:
FOR: testds_1 = h5f.root.set1
[('param1', '<f8'), ('param2', '<f8'), ('param3', '<f8')]
(500000,)
128
FOR: testds_2 = h5f.root.set1.read()
12000096
FOR: x = h5f.root.set1[:500000]['param1']
96
FOR: y = h5f.root.set1.cols.param1[:500000]
4000096
FOR: z = h5f.root.set1.read(stop=500000,field='param1')
4000096
FOR: a = h5f.root.set1.col('param1')
4000096

Custom reading CSV files (Keyword accesible / custom structure)

I am trying to do the following:
I downloaded a csv file containing my banking transactions of the last 180 days.
I want to readin this csv file and then do some plots with the data.
For that I setup a program that reads the csv file und makes the data avaible through keywords.
e.g. in the csv file there is a column "Buchungstag".
I replace that with the date keyword etc.
import numpy as np
import matplotlib.pylab as mpl
import csv
class finanz():
def __init__(self):
path = "/home/***/"
self.dataFileName = path + "test.csv"
self.data_read = open(self.dataFileName, 'r')
self._columns = {}
self._columns[0] = ["date", "Buchungstag", "", "S15"]
self._columns[1] = ["value", "Umsatz", "Euro", "f8"]
self._ident = {"Buchungstag":"date", "Umsatz in {0}":"value"}
self.base = 1205.30
self._readData()
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
self.data = np.recarray((2), dtype=dtype)
desiredKeys = map(lambda x:x, self._ident.iterkeys())
for i, x in enumerate(r):
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
self.data[self._ident[k]][i] = v
def getAllData(self):
return self.data.copy()
a = finanz()
b = a.getAllData()
print type(b)
print type(b['value']),type(b['date'])
Sample data
"Buchungstag";"Wertstellung (Valuta)";"Vorgang";"Buchungstext";"Umsatz in {0}";
"02.06.2015";"02.06.2015";"Lastschrift/Belast.";"Auftraggeber: abc";"-3,75";
My question now is why is type(b['date']) a class 'numpy.core.records.recarray' and type(b['value']) a type 'numpy.ndarray' ??
And my second question would be how to "save" the date in a format that I can use with matplotlib?
The Third and final question is how can I check many rows the csv file has (for the creation of the empty self.data array)
Thx!

Repeating your array generation without the extra code:
In [230]: dt=np.dtype([('date', 'S15'), ('value', '<f8')])
In [231]: data=np.recarray((2,),dtype=dt)
In [232]: type(data['date'])
Out[232]: numpy.core.records.recarray
In [233]: type(data['value'])
Out[233]: numpy.ndarray
The fact that one field is returned as ndarray, and the other as recarray isn't significant. It's just how the recarray class is setup.
Now we mostly use 'structured arrays', created for example with
data1=np.empty((2,),dtype=dt)
or filled with '0s':
data1=np.zeros((2,),dtype=dt)
# array([('', 0.0), ('', 0.0)],
dtype=[('date', 'S15'), ('value', '<f8')])
With this, both data1['date'] and ['value'] are ndarray. recarray is the old version, and still compatible, but structured arrays are more consistent in their syntax and behavior. There are lots of SO questions about structured arrays, many produced by np.genfromtxt applied to csv files like yours.
I could combine this idea, plus my comment (about list appends):
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
if self._columns[0][1].endswith('tag'):
self._columns[0][2] = 'datetime64[D]'
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
desiredKeys = map(lambda x:x, self._ident.iterkeys())
data = []
for x in r:
aline = np.zeros((1,), dtype=dtype)
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
v1 = v.split('.')
if len(v1)==3: # convert date to yyyyy-mm-dd format
v = '%s-%s-%s'%(v1[2],v1[1],v1[0])
aline[self._ident[k]] = v
data.append(aline)
self.data = np.concatenate(data)
producing a b like:
array([(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55)],
dtype=[('date', '<M8[D]'), ('value', '<f8')])
I believe genfromtxt collects each row as a tuple, and creates the array at the end. The docs for structured arrays shows that they can be constructed from
np.array([(item1, item2), (item3, item4),...], dtype=dtype)
I chose to construct an array for each line, and concatenate them at the end because that required fewer changes to your code.
I also changed that function so it converts the 'tag' column to np.datetime64 dtype. There are a number of SO questions about using that dtype. I believe it can used in matplotlib, though I don't have experience with that.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Time series storage in HDF5 format - python

Related

Running Scipy Linregress Across Dataframe Where Each Element is a List

Adding rows in bulk to PyTables array

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

PyTables - big memory consumption using cols method

Custom reading CSV files (Keyword accesible / custom structure)

Categories

Resources