First empty row of openpyxl sheet is [] rather than [None] - python

I made Google spreadsheet similar to the image below and downloaded it as an xlsx.
As you can see, the first row of the sheet is fully empty.
import openpyxl
import numpy as np
sheet = openpyxl.load_workbook("sheet.xlsx", read_only=True)
ws = sheet['Sheet 1']
arr = np.array([[cell.value for cell in row] for row in ws.iter_rows()])
print(arr)
I expected this.
[[None None None None None]
[None 1.0 1.0 1.0 1.0]
[None 2.0 2.0 2.0 2.0]
[None 3.0 3.0 3.0 3.0]
[None 4.0 4.0 4.0 4.0]]
but it actually output this:
[list([]) list([None, 1.0, 1.0, 1.0, 1.0])
list([None, 2.0, 2.0, 2.0, 2.0]) list([None, 3.0, 3.0, 3.0, 3.0])
list([None, 4.0, 4.0, 4.0, 4.0])]
I tried to change it; I added one more empty row and ran the program again:
and it is showing this:
[list([]) list([None, None, None, None, None])
list([None, 1.0, 1.0, 1.0, 1.0]) list([None, 2.0, 2.0, 2.0, 2.0])
list([None, 3.0, 3.0, 3.0, 3.0]) list([None, 4.0, 4.0, 4.0, 4.0])]
First row and Second row are exactly same(fully empty). but result like below.
WHAT? Why is it different?!
FIRST_ROW = list([])
SECOND_ROW = list([None, None, None, None, None])
Finally I found that it gives this expected result when I remove read_only parameter from load_workbook()!
But I need to use the read_only option because the xlsx file that I want to load is a VERY BIG FILE.

In read-only mode, openpyxl returns exactly what it finds in the file, which in the first row is no cells at all. Because Google Sheets does not save the dimensions of the worksheet then there is no way to know how wide rows should be. This is fine as long as you intend to loop over the rows (which is the best way to proceed), but if you want a reliable number of cells in a row then you will have to set min_col and max_col parameters in ws.iter_rows()

Related

Problem Fitting a Residence Time Distribution Data

I am trying to fit Resident Time Distribution (RTD) Data. RTD is typically skewed distribution. I have built a simple code that takes this non equally space-time data set from the RTD.
Data Sett
timeArray = [0.0, 0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 12.0, 14.0]
concArray = [0.0, 0.6, 1.4, 5.0, 8.0, 10.0, 8.0, 6.0, 4.0, 3.0, 2.2, 1.5, 0.6, 0.0]
To fit the data I have been using python curve_fit function
parameters, covariance = curve_fit(nCSTR, time, conc, p0=guess)
and different sets of models (ex. CSTR, Sine, Gauss) to fit the data. However, no success so far.
The RTD data that I have correspond to a CSTR and there is an equation that model very accurate this type of behavior.
#Generalize nCSTR model
y = (( (np.power(x/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*x/tau)
As a separate note: from the Generalized nCSTR model I am using gamma instead of (n-1)! factorial terms because of the complexities of the code trying to deal with decimal values in factorials terms.
This CSTR model should be the one fitting the data without problem but for some reason is not able to do so. The outcome after executing my code:
timeArray = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0]
concArray = [0.0, 0.6, 1.4, 2.6, 5.0, 6.5, 8.0, 9.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.5, 3.0, 2.5, 2.2, 1.8, 1.5, 1.2, 1.0, 0.8, 0.6, 0.5, 0.3, 0.1, 0.0]
#Recast time and conc into numpy arrays
time = np.asarray(timeArray)
conc = np.asarray(concArray)
plt.plot(time, conc, 'o')
def nCSTR(x, tau, n):
y = (( (np.power(x/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*x/tau)
return y
guess = [1, 12]
parameters, covariance = curve_fit(nCSTR, time, conc, p0=guess)
tau = parameters[0]
n = parameters[1]
y = np.arange(0.0, len(time), 1.0)
for i in range(len(timeArray)):
y[i] = (( (np.power(time[i]/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*time[i]/tau)
plt.plot(time,y)
is this plot Fitting Output
I know I am missing something and any help will be well appreciated. The model has been well known for decades so it should not be related to the equation. I did some dummy data to confirm that the equation is written correctly and the output was the same type of profile that I am looking for. In that end, the equestion is fine.
import numpy as np
import math
t = np.arange(0.0, 10.5, 0.5)
tau = 2
n = 5
y = np.arange(0.0, len(t), 1.0)
for i in range(len(t)):
y[i] = (( (np.power(t[i]/tau,n-1)) * np.power(n,n) ) / (tau * math.gamma(n)) ) * np.exp(-n*t[i]/tau)
print(y)
plt.plot(t,y)
CSTR profile with Dummy Data (image)
If anyone is interested in the theory behind it I recommend any reading related to Tank In Series (specifically CSTR) Fogler has a great book about this topic.
I think that the main problem is that your model does not allow for an overall scale factor or that your data may not be normalized as you expect.
If you'll permit me to convert your curve-fitting program to use lmfit (I am a lead author), you might do:
import numpy as np
from scipy.special import gamma
import matplotlib.pyplot as plt
from lmfit import Model
timeArray = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5, 10.0, 10.5, 11.0, 11.5, 12.0, 12.5, 13.0, 13.5, 14.0]
concArray = [0.0, 0.6, 1.4, 2.6, 5.0, 6.5, 8.0, 9.0, 10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.5, 3.0, 2.5, 2.2, 1.8, 1.5, 1.2, 1.0, 0.8, 0.6, 0.5, 0.3, 0.1, 0.0]
#Recast time and conc into numpy arrays
time = np.asarray(timeArray)
conc = np.asarray(concArray)
plt.plot(time, conc, 'o', label='data')
def nCSTR(x, scale, tau, n):
"""scaled CSTR model"""
z = n*x/tau
return scale * np.exp(-z) * z**(n-1) * (n/(tau*gamma(n)))
# create a Model for your model function
cmodel = Model(nCSTR)
# now create a set of Parameters for your model (note that parameters
# are named using your function arguments), and give initial values
params = cmodel.make_params(tau=3, scale=10, n=10)
# since you have `xxx**(n-1)`, setting a lower bound of 1 on `n`
# is wise, otherwise you would have to handle complex values
params['n'].min = 1
# now fit the model to your `conc` data with those parameters
# (and also passing in independent variables using `x`: the argument
# name from the signature of the model function)
result = cmodel.fit(conc, params, x=time)
# print out a report of the results
print(result.fit_report())
# you do not need to construct the best fit yourself, it is in `result`:
plt.plot(time, result.best_fit, label='fit')
plt.legend()
plt.show()
This will print out a report that includes statistics and uncertainties:
[[Model]]
Model(nCSTR)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 29
# data points = 29
# variables = 3
chi-square = 2.84348862
reduced chi-square = 0.10936495
Akaike info crit = -61.3456602
Bayesian info crit = -57.2437727
R-squared = 0.98989860
[[Variables]]
scale: 49.7615649 +/- 0.81616118 (1.64%) (init = 10)
tau: 5.06327482 +/- 0.05267918 (1.04%) (init = 3)
n: 4.33771512 +/- 0.14012112 (3.23%) (init = 10)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, n) = -0.521
C(scale, tau) = 0.477
C(tau, n) = -0.406
and generate a plot of

Stack bar-chart intersected between each other

I have the following code for the stack bar chart
cols = ['Bug Prediction','Traceability','Security', 'Program Generation & Repair',
'Performance Prediction','Code Similarity & Clone Detection',
'Code Navigation & Understanding', 'Other_SE']
count_ANN = [2.0,0.0,1.0,0.0,0.0,3.0,5.0,1.0]
count_CNN = [1.0,0.0,5.0,0.0,1.0,4.0,4.0,0.0]
count_RNN = [1.0,0.0,3.0,1.0,0.0,4.0,7.0,2.0]
count_LSTM =[3.0,0.0,5.0,3.0,1.0,9.0,15.0,1.0]
count_GNN = [0.0,0.0,1.0,0.0,0.0,3.0,3.0,3.0]
count_AE = [0.0,0.0,1.0,3.0,0.0,6.0,11.0,0.0]
count_AM = [2.0,0.0,1.0,4.0,1.0,4.0,15.0,1.0]
count_other =[1.0,0.0,2.0,2.0,0.0,1.0,3.0,0.0]
b_RNN = list(np.add(count_ANN,count_CNN))
b_LSTM = list(np.add(np.add(count_ANN,count_CNN),count_RNN))
b_AE = list(np.add(np.add(np.add(count_ANN,count_CNN),count_RNN),count_AE))
b_GNN = list(np.add(b_AE,count_GNN))
b_others = list(np.add(b_GNN,count_other))
plt.bar(cols,count_ANN,0.4,label = "ANN")
plt.bar(cols,count_CNN,0.4,bottom=count_ANN,label = "CNN")
plt.bar(cols,count_RNN,0.4,bottom=b_RNN,label = "RNN")
plt.bar(cols,count_LSTM,0.4,bottom =b_LSTM, label = "LSTM")
plt.bar(cols,count_AE,0.4,bottom=b_AE,label = "Auto-Encoder")
plt.bar(cols,count_GNN,0.4,bottom=b_GNN,label = "GNN")
plt.bar(cols,count_other,0.4,bottom=b_others,label = "Others")
#ax.bar(cols, count)
plt.xticks(np.arange(len(cols))+0.1,cols)
fig.autofmt_xdate()
plt.legend()
plt.show()
Then the output for this is overlapped stacks as in the following figure
The specific problem is that b_AE is calculated wrong. (Also, there is a list called count_AM for which there is no label).
The more general problem, is that calculating all these values "by hand" is very prone to errors and difficult to adapt when there are changes. It helps to write things in a loop.
The magic of numpy's broadcasting and vectorization lets you initialize bottom as a single zero, and then use numpy's adding to add the counts.
To have a bit neater x-axis, you can put the individual words on separate lines. Also, plt.tight_layout() tries to make sure all text fits nicely into the plot.
import matplotlib.pyplot as plt
import numpy as np
cols = ['Bug Prediction', 'Traceability', 'Security', 'Program Generation & Repair',
'Performance Prediction', 'Code Similarity & Clone Detection',
'Code Navigation & Understanding', 'Other_SE']
count_ANN = [2.0, 0.0, 1.0, 0.0, 0.0, 3.0, 5.0, 1.0]
count_CNN = [1.0, 0.0, 5.0, 0.0, 1.0, 4.0, 4.0, 0.0]
count_RNN = [1.0, 0.0, 3.0, 1.0, 0.0, 4.0, 7.0, 2.0]
count_LSTM = [3.0, 0.0, 5.0, 3.0, 1.0, 9.0, 15.0, 1.0]
count_GNN = [0.0, 0.0, 1.0, 0.0, 0.0, 3.0, 3.0, 3.0]
count_AE = [0.0, 0.0, 1.0, 3.0, 0.0, 6.0, 11.0, 0.0]
count_AM = [2.0, 0.0, 1.0, 4.0, 1.0, 4.0, 15.0, 1.0]
count_other = [1.0, 0.0, 2.0, 2.0, 0.0, 1.0, 3.0, 0.0]
all_counts = [count_ANN, count_CNN, count_RNN, count_LSTM, count_GNN, count_AE, count_AM, count_other]
all_labels = ["ANN", "CNN", "RNN", "LSTM", "GNN", "Auto-Encoder", "AM", "Others"]
cols = ["\n".join(c.split(" ")) for c in cols]
cols = [c.replace("&\n", "& ") for c in cols]
bottom = 0
for count_i, label in zip(all_counts, all_labels):
plt.bar(cols, count_i, 0.4, bottom=bottom, label=label)
bottom += np.array(count_i)
# plt.xticks(np.arange(len(cols)) + 0.1, cols)
plt.tick_params(axis='x', labelrotation=45, length=0)
plt.legend()
plt.tight_layout()
plt.show()
PS: To have the bars in the same order as the legend, you could draw them starting from the top:
bottom = np.sum(all_counts, axis=0)
for count_i, label in zip(all_counts, all_labels):
bottom -= np.array(count_i)
plt.bar(cols, count_i, 0.4, bottom=bottom, label=label)

Advanced filtering of lists

import random
import xlwings as xw
from collections import Counter
wb = xw.Book('Test.xlsx')
sheet = xw.sheets.active
SKUs = sheet.range('A2:C693').value
list_of_prob = sheet.range('D2:D693').value
list_of_prob = [float(i) for i in list_of_prob]
SKUs = random.choices(SKUs, weights = list_of_prob, k=20)
for item in zip(SKUs):
print (item)
I coded the following program (above) which outputs an order picking list of 20 items based on their probability:
(['Item91', 10.0, 1.0],)
(['Item482', 6.0, 15.0],)
(['Item533', 8.0, 17.0],)
(['Item63', 7.0, 2.0],)
(['Item50', 5.0, 5.0],)
(['Item14', 2.0, 2.0],)
(['Item145', 1.0, 6.0],)
(['Item225', 6.0, 9.0],)
(['Item23', 3.0, 2.0],)
(['Item33', 4.0, 2.0],)
(['Item47', 5.0, 4.0],)
(['Item88', 9.0, 4.0],)
(['Item8', 1.0, 4.0],)
(['Item1', 1.0, 1.0],)
(['Item13', 2.0, 2.0],)
(['Item21', 3.0, 1.0],)
(['Item86', 9.0, 3.0],)
(['Item205', 5.0, 6.0],)
(['Item1', 1.0, 1.0],)
(['Item67', 7.0, 4.0],)
Every item has two numbers that correspond to the aisle and slot in the aisle (in a warehouse). The aim now is to filter the list that all duplicate aisles are removed and only left with the furthest slot for the corresponding aisle.
Example: aisle 1 has four items to be picked. To calculate the order picker’s travel time (return routing policy), I only need the location of the furthest item. That would be slot 6 in aisle 1. Thus I want to filter all the aisle 1 duplicates and only keep ([1.0, 6.0],).
Thus for all aisles I want the following list:
From this:
([10.0, 1.0],)
([6.0, 15.0],)
([8.0, 17.0],)
([7.0, 2.0],)
([5.0, 5.0],)
([2.0, 2.0],)
([1.0, 6.0],)
([6.0, 9.0],)
([3.0, 2.0],)
([4.0, 2.0],)
([5.0, 4.0],)
([9.0, 4.0],)
([1.0, 4.0],)
([1.0, 1.0],)
([2.0, 2.0],)
([3.0, 1.0],)
([9.0, 3.0],)
([5.0, 6.0],)
([1.0, 1.0],)
([7.0, 4.0],)
To this:
([10.0, 1.0],)
([6.0, 15.0],)
([8.0, 17.0],)
([2.0, 2.0],)
([1.0, 6.0],)
([3.0, 2.0],)
([4.0, 2.0],)
([9.0, 4.0],)
([5.0, 6.0],)
([7.0, 4.0],)
I did manage to find a solution in excel. First removing all duplicates and then with the remaining duplicates look for the max value of the corresponding value. Is there a good way to achieve this kind of "advanced" filtering in Python?
Use a dictionary to keep track of which rows you've seen so far and what the maximum slot number for each row is:
results = {}
for item, row, slot in SKUs:
if results.get(row, 0) < slot:
results[row] = slot
Note that I didn't use zip like you are in your example code, since that seems to be pointlessly wrapping your data in 1-tuples that you don't need.
If you need a list of row, max-slot pairs at the end, use list(results.items())
Here's a pandas-based solution as suggested by others above. It uses sql-like grouping.
import pandas as pd
datin = [['Item91', 10.0, 1.0],
['Item482', 6.0, 15.0],
['Item533', 8.0, 17.0],
['Item63', 7.0, 2.0],
['Item50', 5.0, 5.0],
['Item14', 2.0, 2.0],
['Item145', 1.0, 6.0],
['Item225', 6.0, 9.0],
['Item23', 3.0, 2.0],
['Item33', 4.0, 2.0],
['Item47', 5.0, 4.0],
['Item88', 9.0, 4.0],
['Item8', 1.0, 4.0],
['Item1', 1.0, 1.0],
['Item13', 2.0, 2.0],
['Item21', 3.0, 1.0],
['Item86', 9.0, 3.0],
['Item205', 5.0, 6.0],
['Item1', 1.0, 1.0],
['Item67', 7.0, 4.0]]
pd.DataFrame(datin, columns=['Item', 'Aisle', 'Slot']).groupby(by='Aisle', as_index=False)['Slot'].max().values.tolist()

How does XGBoost/lightGBM evaluate ndcg for ranking tasks?

I am currently running tests between XGBoost/lightGBM for their ability to rank items. I am reproducing the benchmarks presented here: https://github.com/guolinke/boosting_tree_benchmarks.
I have been able to successfully reproduce the benchmarks mentioned in their work. I want to make sure that I am correctly implementing my own version of the ndcg metric and also understanding the ranking problem correctly.
My questions are:
When creating the validation for the test set using ndcg - there is a test.group file that says the first X rows are group 0, etc. To get the recommendations for the group, I get the predicted values and known relevance scores and sort that list by descending predicted values for each group?
In order to get the final ndcg scores from the lists created above - do I get the ndcg scores and take the mean over all the scores? Is this the same evaluation methodology that XGBoost/lightGBM in the evaluation phase?
Here is my methodology for evaluating the test set after the model has finished training.
For the final tree when I run lightGBM I obtain these values on the validation set:
[500] valid_0's ndcg#1: 0.513221 valid_0's ndcg#3: 0.499337 valid_0's ndcg#5: 0.505188 valid_0's ndcg#10: 0.523407
My final step is to take the predicted output for the test set and calculate the ndcg values for the predictions.
Here is my python code for calculating ndcg:
import numpy as np
def dcg_at_k(r, k):
r = np.asfarray(r)[:k]
if r.size:
return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
return 0.
def ndcg_at_k(r, k):
idcg = dcg_at_k(sorted(r, reverse=True), k)
if not idcg:
return 0.
return dcg_at_k(r, k) / idcg
After I get the predictions for the test set for a particular group (GROUP-0) I have these predictions:
query_id predict
0 0 (2.0, -0.221681199441)
1 0 (1.0, 0.109895548348)
2 0 (1.0, 0.0262799346312)
3 0 (0.0, -0.595343431322)
4 0 (0.0, -0.52689043426)
5 0 (0.0, -0.542221350664)
6 0 (1.0, -0.448015576024)
7 0 (1.0, -0.357090949646)
8 0 (0.0, -0.279677741045)
9 0 (0.0, 0.2182200869)
NOTE
Group-0 actually has about 112 rows.
I then sort the list of tuples in descending order which provides a list of relevance scores:
def get_recommendations(x):
sorted_list = sorted(list(x), key=lambda i: i[1], reverse=True)
return [k for k, _ in sorted_list]
relavance = evaluation.groupby('query_id').predict.apply(get_recommendations)
query_id
0 [4.0, 2.0, 2.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
1 [4.0, 2.0, 2.0, 2.0, 1.0, 1.0, 3.0, 2.0, 1.0, ...
2 [2.0, 3.0, 2.0, 2.0, 1.0, 0.0, 2.0, 2.0, 1.0, ...
3 [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
4 [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, ...
Finally, for each query id I calculated the ndcg scores on the relevance list and then take the mean of all the ndcg scores calculated for each query id:
relavance.apply(lambda x: ndcg_at_k(x, 10)).mean()
The value I obtain is ~0.497193.
Cross-posting my Cross Validated answer to this cross-posted question:
https://stats.stackexchange.com/questions/303385/how-does-xgboost-lightgbm-evaluate-ndcg-metric-for-ranking/487487#487487
I happened across this myself, and finally dug into the code to figure it out.
The difference is the handling of a missing IDCG. Your code returns 0, while LightGBM is treating that case as a 1.
The following code produced matching results for me:
import numpy as np
def dcg_at_k(r, k):
r = np.asfarray(r)[:k]
if r.size:
return np.sum(np.subtract(np.power(2, r), 1) / np.log2(np.arange(2, r.size + 2)))
return 0.
def ndcg_at_k(r, k):
idcg = dcg_at_k(sorted(r, reverse=True), k)
if not idcg:
return 1. # CHANGE THIS
return dcg_at_k(r, k) / idcg
I think the problem is caused by data in the same query that have same labels.
In that case, Both XGBoost and LightGBM will produce ndcg 1 for that query.

H5PY - How to store many 2D arrays of different dimensions

I would like to organize my collected data (from computer simulations) into a hdf5 file using Python.
I measured positions and velocities [x,y,z,vx,vy,vz] of all atoms within a certain space region over many time steps. The number of atoms, of course, varies from time step to time step.
A minimal example could look as follows:
[
[ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2] ],
[ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2], [x3,y3,z3,vx3,vy3,vz3] ]
]
(2 time steps,
first time step: 2 atoms,
second time step: 3 atoms)
My idea was to create a hdf5 dataset within Python which stores all the information. At each time step it should store a 2d array of alls positions/velocities of all atoms, i.e.
dataset[0] = [ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2] ]
dataset[1] = [ [x1,y1,z1,vx1,vy1,vz1], [x2,y2,z2,vx2,vy2,vz2], [x3,y3,z3,vx3,vy3,vz3] ].
The idea is clear, I think. However, I struggle with the definition of the correct data type of the data set with varying array length.
My code looks like this:
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
rowtype = np.dtype("%sfloat32" % columnNo)
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
dataset = file.create_dataset("dset", (2,), dtype=dt)
print dataset.value
testarray = np.array([[1.,2.,3.,2.,3.,4.],[1.,2.,3.,2.,3.,4.]])
print testarray
dataset[0] = testarray
print dataset[0]
This, however, does not work. When I run the script I get the error message "AttributeError: 'float' object has no attribute 'dtype'."
It seems that my defined dtype is wrong.
Does anybody see how it should be defined correctly?
Thanks very much,
Sven
The error in your case is buried, though it is clear it occurs when trying to assign the testarray to the dataset:
Traceback (most recent call last):
File "stack41465480.py", line 26, in <module>
dataset[0] = testarray
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-GhwtGD/h5py-2.6.0/h5py/_objects.c:2577)
...
File "h5py/_conv.pyx", line 712, in h5py._conv.ndarray2vlen (/build/h5py-GhwtGD/h5py-2.6.0/h5py/_conv.c:6171)
AttributeError: 'float' object has no attribute 'dtype'
I'm not skilled with the special_dtype and vlen, but I was able to write a numpy structured arrays to h5py.
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
# rowtype = np.dtype("%sfloat32" % columnNo)
rowtype = np.dtype([('f0', '<f4',(6,))])
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
print('rowtype',rowtype)
print('dt',dt)
dataset = file.create_dataset("dset", (2,), dtype=rowtype)
print('value')
print(dataset.value[0])
arr = np.ones((2,),dtype=rowtype)
print(repr(arr))
dataset[0] = arr[0]
print(dataset.value)
testarray = np.array([([1.,2.,3.,2.,3.,4.],),([2.,3.,4.,1.,2.,3.],)], dtype=rowtype)
print(repr(testarray))
dataset[1] = testarray[1]
print(dataset.value)
print(dataset.value['f0'])
producing
1316:~/mypy$ python3 stack41465480.py
rowtype [('f0', '<f4', (6,))]
dt object
value
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)
array([([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],), ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)],
dtype=[('f0', '<f4', (6,))])
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0],)]
array([([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],), ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)],
dtype=[('f0', '<f4', (6,))])
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)]
[[ 1. 1. 1. 1. 1. 1.]
[ 2. 3. 4. 1. 2. 3.]]
Thanks for the quick answer. It helped a lot.
If I now simply change the data type of the data set to
dtype = dt,
I get what I would like to have.
Here, the Python code (for completeness):
import numpy as np
import h5py
file = h5py.File ('file.h5','w')
columnNo = 6
rowtype = np.dtype([('f0', '<f4',(6,))])
dt = h5py.special_dtype( vlen=np.dtype(rowtype) )
print('rowtype',rowtype)
print('dt',dt)
dataset = file.create_dataset("dset", (2,), dtype=dt)
# print('value')
# print(dataset.value[0])
arr = np.ones((3,),dtype=rowtype)
# print(repr(arr))
dataset[0] = arr
# print(dataset.value)
testarray = np.array([([1.,2.,3.,2.,3.,4.],),([2.,3.,4.,1.,2.,3.],)], dtype=rowtype)
# print(repr(testarray))
dataset[1] = testarray
print(dataset.value)
for i in range(2): print dataset[i]
And to corresponding output reads
('rowtype', dtype([('f0', '<f4', (6,))]))
('dt', dtype('O'))
[ array([([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],),
([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],), ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)],
dtype=[('f0', '<f4', (6,))])
array([([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],), ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)],
dtype=[('f0', '<f4', (6,))])]
[([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],) ([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)
([1.0, 1.0, 1.0, 1.0, 1.0, 1.0],)]
[([1.0, 2.0, 3.0, 2.0, 3.0, 4.0],) ([2.0, 3.0, 4.0, 1.0, 2.0, 3.0],)]
Just to get it right: The problem in my original code was a bad definition of my rowtype data structure, right?
Best,
Sven

Categories

Resources