Adding data to HDF5 Dataset

Adding data to HDF5 Dataset - python

import numpy as np
import h5py
x1 = [0, 1, 2, 3, 4]
y1 = ['a', 'b', 'c', 'd', 'e']
z1 = [5, 6, 7, 8, 9]
namesList = ['ID', 'Name', 'Path']
ds_dt = np.dtype({'names': namesList, 'formats': ['S32'] * 4})
rec_arr = np.rec.fromarrays([x1, y1, z1], dtype=ds_dt)
test = [[], [], []]
hdf5_file = h5py.File("test.h5", "w")
structure = hdf5_file.create_group('structure')
structure.create_dataset('images', data=test, compression='gzip', maxshape=(None,3))
structure['images'].resize((structure['images'].shape[0] + rec_arr.shape[0]), axis=0)
structure['images'][-rec_arr.shape[0]:] = rec_arr
I am starting with an empty dataset and I am trying to add data to that data set. When I view the file, nothing has been added and the dataset is empty. How do I fix this?

We can add the data directly to the .h5 file when creating the new dataset. The following code worked for me to write rec_arr to the file, and I added the 'with' statement to ensure it is closed properly.
import numpy as np
import h5py
x1 = [0, 1, 2, 3, 4]
y1 = ['a', 'b', 'c', 'd', 'e']
z1 = [5, 6, 7, 8, 9]
namesList = ['ID', 'Name', 'Path']
ds_dt = np.dtype({'names': namesList, 'formats': ['S32'] * 4})
rec_arr = np.rec.fromarrays([x1, y1, z1], dtype=ds_dt)
with h5py.File("test.h5", "w") as hdf5_file:
structure = hdf5_file.create_group('structure')
structure.create_dataset('images', data=rec_arr, compression='gzip', maxshape=(rec_arr.shape))

What you need is a resizable dataset. You define them by using the maxshape=() parameter. None means unlimited length. Example below shows how to create a resizable dataset. It starts with the data from your question and first answer. After it exits the 1st with/as: block, there is a 2nd with/as: block that reopens the file (in append mode), extends the dataset, and adds 5 more rows of data.
Also, I modified the dtype definition used for the recarray and resulting dataset. Previous code had all string values. I changed the 1st and 3rd columns to use integers (to match data values). It demonstrates how to mix datatypes in a recarray. Also, I I removed the create_group() call. Groups are not required (unless you want to use them to organize your datasets).
import numpy as np
import h5py
x1 = [0, 1, 2, 3, 4]
y1 = ['a', 'b', 'c', 'd', 'e']
z1 = [5, 6, 7, 8, 9]
namesList = ['ID', 'Name', 'Path']
ds_dt = np.dtype({'names': namesList, 'formats': [int, 'S32', int] })
rec_arr = np.rec.fromarrays([x1, y1, z1], dtype=ds_dt)
with h5py.File("test.h5", "w") as h5f:
h5f.create_dataset('data', data=rec_arr, maxshape=(None,),
compression='gzip' )
x2 = [ i for i in range(10,15)]
y2 = [chr(i) for i in range(102,107)]
z2 = [ i for i in range(15,20)]
rec_arr = np.rec.fromarrays([x2, y2, z2], dtype=ds_dt)
with h5py.File("test.h5", "a") as h5f:
ds_len = h5f['data'].shape[0]
arr_len = rec_arr.shape[0]
h5f['data'].resize(ds_len+arr_len,axis=0)
h5f['data'][arr_len:ds_len+arr_len] = rec_arr

Related

Two constraints setting together in optimization problem

I am working on an optimization problem, and facing difficulty setting up two constraints together in Python. Hereunder, I am simplifying my problem by calculation of area and volume. Only length can be changed, other parameters should remain the same.
Constraint 1: Maximum area should be 40000m2
Constraint 2: Minimum volume should be 50000m3
Here, I can set values in dataframe by following both constraints one-by-one, how to modify code so that both constraints (1 & 2) should meet given requirements?
Many Thanks for your time and support!
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'],
'Length': [1000, 2000, 3000, 5000],
'Width': [5, 12, 14, 16],
'Depth': [15, 10, 15, 18]})
area = (df['Length'])*(df['Width'])
volume = (df['Length'])*(df['Width'])*(df['Depth'])
print(area)
print(volume)
#Width and Depth are constants, only Length can be change
#Constraint 1: Maximum area should be 40000m2
#Calculation of length parameter by using maximum area, with other given parameters
Constraint_length_a = 40000/ df['Width']
#Constraint 2: Minimum volume should be 50000m3
#Calculation of length parameter by using minimum area, with other given parameters
Constraint_length_v = 50000/ ((df['Width'])*(df['Depth']))
#Setting Length values considering constraint 1
df.at[0, 'Length']=Constraint_length_a[0]
df.at[1, 'Length']=Constraint_length_a[1]
df.at[2, 'Length']=Constraint_length_a[2]
df.at[2, 'Length']=Constraint_length_a[3]
#Setting Length values considering constraint 2
df.at[0, 'Length']=Constraint_length_v[0]
df.at[1, 'Length']=Constraint_length_v[1]
df.at[2, 'Length']=Constraint_length_v[2]
df.at[2, 'Length']=Constraint_length_v[3]

I believed the code below solve the current problem you are facing.
If I can help any further let me know.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'],
'Length': [1000, 2000, 3000, 5000],
'Width': [5, 12, 14, 16],
'Depth': [15, 10, 15, 18]})
area = (df['Length'])*(df['Width'])
volume = (df['Length'])*(df['Width'])*(df['Depth'])
def constraint1(df, col, n):
df.loc[:n,'lenght'] = 40000 / df.loc[:n, col]
df.drop('Length', axis=1, inplace=True)
return df
def constraint2(df, col, col1, n):
df.loc[:n, 'lenght'] = 50000/ ((df.loc[:n,col])*(df.loc[:n,col1]))
df.drop('Length', axis=1, inplace=True)
return df
If you want to performance it through the whole column then
def constraint1a(df, col):
df['lenght'] = 40000 / df[col]
df.drop('Length', axis=1, inplace=True)
return df
def constraint2a(df, col, col1):
df['lenght'] = 50000/ ((df[col])*(df[col1]))
df.drop('Length', axis=1, inplace=True)
return df
df = constraint1(df, 'Width', 3)
df1 = constraint2(df, 'Width','Depth', 3)
df2 = constraint1a(df, 'Width')
df3 = constraint2a(df, 'Width','Depth')
Adding the conditions I left out the first time
def constraint1(df, col, col1):
l = []
for x, w in zip(df[col], df[col1]):
if x > 40000:
l.append(40000 / w)
else:
l.append(x)
df[col] = l
return df
def constraint2(df, col, col1, col2):
l = []
for x, w, d in zip(df[col], df[col1], df[col2]):
if x <= 50000:
l.append(50000 / (w*d))
else:
l.append(x)
df[col] = l
return df
df1 = constraint1(df, 'Length', 'Width')
df2 = constraint2(df, 'Length', 'Width', 'Depth')

converting pd.Series of strings into ndarray

I extract an array of words from pandas column:
X = np.array(tab1['word'])
example of X : array(['dog', 'cat'], dtype=object)
X is a pandas Series of 665 strings.
And then I transform each word into an ndarray of (1,270)
for i in range(len(X)):
tmp = X[i]
z = func(tmp) #function that returns ndarray of (1,270)
X[i] = z
My end goal is to get an Ndarray of shape: (665, 270)
but instead I get this shape: (665,)
And I also can't reshape it, when I try to: X.reshape(665,270)
I get this error:
ValueError: cannot reshape array of size 665 into shape (665,270)
The func(word) function could be any function, for example:
def func(word):
a = np.arange(0,270)
a = a.reshape(1,270)
return a
Any ideas on why is it so?

The problem is about converting a Pandas Series of strings into a NumPy array by a transformative function that, given a string input, returns a (1, n) array.
Here is the solution:
import pandas as pd
import numpy as np
# You have a series of strings
X = pd.Series(['aaa'] * 665)
# You have a transformative func that returns a (1, n) np.array
def func(word, n=270):
return np.zeros((1, n))
# You apply the function to the series and vertically stack the results
Xs = np.vstack(X.apply(func))
# You check for the desidered shape
print(Xs.shape)

The key lines below are:
z = list(func(tmp)) # converting returned value from func to a list
and
result = np.array([x for x in X.values])
Here is my full test code:
import numpy as np
import pandas as pd
def func(tmp):
return np.array([t for t in tmp])
X = pd.Series({'a': 'abc', 'x': 'xyz', 'j': 'jkl', 'z': 'zzz'})
for i in range(len(X)):
tmp = X[i]
z = list(func(tmp)) # converting returned value from func to a list
X[i] = z
result = np.array([x for x in X.values])
Then type result on console, you'll see it is an (4, 3) ndarray.
In[3] result
Out[3]:
array([['a', 'b', 'c'],
['x', 'y', 'z'],
['j', 'k', 'l'],
['z', 'z', 'z']], dtype='<U1')

How can make a dataset of elements of matrices in dataframe?

I have dataset of 3 parameters 'A','B','C' in .TXT file and after I print them in 24x20 matrices I need to collect the 1st elements of 'A','B','C' put in long arrays in panda dataframe and then 2nd elements of each then 3rd and so on till 480th elements.
So my data is like this in text file:
my data is txt file is following:
id_set: 000
A: -2.46882615679
B: -2.26408246559
C: -325.004619528
I already made a panda dataframe includes 3 columns of 'A','B','C' and index and defined functions to print 24x20 matric in right way. Simple example via 2x2 matrices:
1st cycle: A = [1,2, B = [4,5, C = [8,9,
3,4] 6,7] 10,11]
2nd cycle: A = [0,8, B = [1,9, C = [10,1,
2,5] 4,8] 2,7]
Reshape to this form:
A(1,1),B(1,1),C(1,1),A(1,2),B(1,2),C(1,2),.....
Result= [1,4,8,2,5,9,3,6,10,4,7,11] #1st cycle
[0,1,10,8,9,1,2,4,2,5,8,7] #2nd cycle
My scripts are following:
import numpy as np
import pandas as pd
import os
def normalize(value, min_value, max_value, min_norm, max_norm):
new_value = ((max_norm - min_norm)*((value - min_value)/(max_value - min_value))) + min_norm
return new_value
dft = pd.read_csv('D:\mc25.TXT', header=None)
id_set = dft[dft.index % 4 == 0].astype('int').values
A = dft[dft.index % 4 == 1].values
B = dft[dft.index % 4 == 2].values
C = dft[dft.index % 4 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]}
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
#next iteration create all plots, change the number of cycles
cycles = int(len(df)/480)
print(cycles)
for cycle in range(0,10):
count = '{:04}'.format(cycle)
j = cycle * 480
for i in df:
try:
os.mkdir(i)
except:
pass
min_val = df[i].min()
min_nor = -1
max_val = df[i].max()
max_nor = 1
ordered_data = mkdf(df.iloc[j:j+480][i])
csv = print_df(ordered_data)
#Print .csv files contains matrix of each parameters by name of cycles respectively
csv.to_csv(f'{i}/{i}{count}.csv', header=None, index=None)
if 'C' in i:
min_nor = -40
max_nor = 150
#Applying normalization for C between [-40,+150]
new_value3 = normalize(df['C'].iloc[j:j+480], min_val, max_val, -40, 150)
df3 = print_df(mkdf(new_value3))
df3.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
else:
#Applying normalization for A,B between [-1,+1]
new_value1 = normalize(df['A'].iloc[j:j+480], min_val, max_val, -1, 1)
new_value2 = normalize(df['B'].iloc[j:j+480], min_val, max_val, -1, 1)
df1 = print_df(mkdf(new_value1))
df2 = print_df(mkdf(new_value2))
df1.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
df2.to_csv(f'{i}/norm{i}{count}.csv', header=None, index=None)
Note2: I provided a dataset in text file for 3 cycles:
Text dataset

I am not sure if I understood your question fully but this is a solution:
Convert your data frame to a 2d numpy array using as_matrix() then use ravel() to get a vector of size 480 * 3 then cycle over your cycles and use vstack method for stacking rows over each other in your result, this is a code with your example data:
A = [[1,2,3,4], [10,20,30,40]]
B = [[4,5,6,7], [40,50,60,70]]
C = [[8,9,10,11], [80,90,100,110]]
cycles = 2
for cycle in range(cycles):
data = {'A': A[cycle], 'B': B[cycle], 'C': C[cycle]}
df = pd.DataFrame(data)
D = df.as_matrix().ravel()
if cycle == 0:
Results = np.array(D)
else:
Results = np.vstack((Results, D2))
# Output: Results= array([[ 1, 4, 8, 2, 5, 9, 3, 6, 10, 4, 7, 11], [ 10, 40, 80, 20, 50, 90, 30, 60, 100, 40, 70, 110]], dtype=int64)
np.savetxt("Results.csv", Results, delimiter=",")
Is this what you wanted?

Looping in Python for a beginner

I am new to coding and looking for a simple way to implement a loop in python. Here is an example of my code! I need to define variables u,v,w etc. from 1 through to 12 to carry out my regression analysis, hence why a loop would be ideal. Thanks!
import numpy as np
import pandas as pd
import statsmodels.formula.api as sm
dataset = pd.read_csv("MultipleRegression.csv")
x1 = np.append(arr = np.ones((4, 1)).astype(int), values = x1, axis = 1)
x_opt1 = x1[:, [0, 1, 2, 3, 4, 5, 6]]
regressor_OLS1 = sm.OLS(endog = y1, exog = x_opt1).fit()
regressor_OLS1.summary()
u1 = regressor_OLS1.params[1]
v1 = regressor_OLS1.params[2]
w1 = regressor_OLS1.params[3]
x1 = regressor_OLS1.params[4]
y1 = regressor_OLS1.params[5]
z1 = regressor_OLS1.params[6]

In Python you can do that without a loop, just unpack the parameters:
u1,v1 ,w1 ,x1 ,y1 ,z1, *rest = regressor_OLS1.params

Putting a gap/break in a pyplot line plot without losing data

I have a time series with several large data gaps. I would like to see a connecting line between data points that are less than an hour apart, but not if the gap is larger. The accepted answer to the question, Put a gap/break in a line plot, would work except that you sacrifice the masked points. I would like to avoid that.
I have attempted to make a list comprehension that would insert NaNs into the array, I think that would automatically achieve the same result, but I don't seem to be able to do it correctly. The best I have found is as follows:
import datetime as dtm
import numpy as np
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
xmod = np.array([x[0]]+[dt1 if dt1-dt0 < dtm.timedelta(hours=1.) else [dt1,np.nan] for dt1, dt0 in zip(x[1:],x[:-1])])
This gives the result:
In [7]: xmod
Out[7]:
array([datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 52, 30),
[datetime.datetime(2001, 4, 3, 0, 57, 30), nan],
datetime.datetime(2001, 4, 3, 3, 57, 30),
datetime.datetime(2001, 4, 3, 4, 2, 30)], dtype=object)
I have not been able to find a way to insert both the data point and the np.nan without putting brackets around them. Is this possible? Is there a better way to achieve my goal? Thanks!

In accordance with the comment above, probably the easiest way to do this would be to separate the data into groups where you need the gaps. Here is one way to implement such a thing.
import datetime as dtm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),
dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
y = range(len(x))
# make a dataframe with groups separated that are over an hour apart
data = []
g = 0
for i in range(len(x)):
x0 = x[i]
y0 = y[i]
if i < (len(x)-1):
x1 = x[i+1]
td = x1 - x0
elapsed_seconds = td.total_seconds()
hrs = (elapsed_seconds/60)/60
if hrs < 1:
data.append([x0,y0, g])
else:
data.append([x0,y0, g])
g+=1
else:
data.append([x0,y0, g])
df = pd.DataFrame(data, columns=['x', 'y', 'group'])
# draw a plot
fig, ax = plt.subplots(1,1, figsize = (8,5))
for i, dfg in df.groupby('group'):
ax.plot(dfg['x'], dfg['y'], c='b')

So, I accepted the answer by djakubosky because it seems clean and is probably the right approach. However, by the time that answer was posted, I had decided that what I was doing was inappropriate for a list comprehension and simply wrote it as a for loop - and that worked fine. Possibly this will be useful to someone else. Here is the code:
def insert_breaks(x,y):
import datetime as dtm
import numpy as np
xnew = []
ynew = []
for dt1, dt0, y1, y0 in zip(x[1:],x[:-1],y[1:],y[:-1]):
if dt1-dt0 < dtm.timedelta(hours=1):
xnew+=[dt0]
ynew+=[y0]
else:
xnew+=[dt0,dt0+(dt1-dt0)/2]
ynew+=[y0, np.nan]
xnew+=[dt1]
ynew+=[y1]
return xnew, ynew

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding data to HDF5 Dataset - python

Related

Two constraints setting together in optimization problem

converting pd.Series of strings into ndarray

How can make a dataset of elements of matrices in dataframe?

Looping in Python for a beginner

Putting a gap/break in a pyplot line plot without losing data

Categories

Resources