Stacking Arrays in Numpy: Different behaviors between UNIX and Windows - python

Note: This is Python 2.7, not Py3
This is an updated attempt at asking an earlier question. You requested my complete code, and explanation of content, and example output files. I'll try my best to format this well.
This code is meant to take an input file from a fluorometric "plate reader" and convert the readings to DNA concentrations and masses. It then generates an output file organized according to an 8x12 plate scheme (standard for DNA/molecular work). Rows are labeled "A, B, C,...,H" and columns are labeled simply 1 - 12.
Based on user input, arrays need to be stacked to format the output. However, when arrays are stacked in UNIX (and either printed or written to an outfile), they are limited to the first character.
In other words, in Windows, if a number in the array is 247.5, it prints the full number. But in a UNIX environment (Linux/Ubuntu/MacOS), it becomes truncated to simply "2". A number that is -2.7 will print normally in Windows, but in UNIX simply prints as "-".
The complete code can be found below; note that the last chunk is the most relevant portion of the code:
#!/usr/bin/env python
Usage = """
plate_calc.py - version 1.0
Convert a series of plate fluorescence readings
to total DNA mass per sample and print them to
a tab-delimited output file.
This program can take multiple files as inputs
(separated by a space) and generates a new
output file for each input file.
NOTE:
1) Input(s) must be an exported .txt file.
2) Standards must be in columns 1 and 2, or 11
and 12.
3) The program assumes equal volumes across wells.
Usage:
plate_calc.py input.txt input2.txt input3.txt
"""
import sys
import numpy as np
if len(sys.argv)<2:
print Usage
else:
#First, we want to extract the values of interest into a Numpy array
Filelist = sys.argv[1:]
input_DNA_vol = raw_input("Volume of sample used for AccuClear reading (uL): ")
remainder_vol = raw_input("Remaining volume per sample (uL): ")
orientation = raw_input("Are the standards on the LEFT (col. 1 & 2), or on the RIGHT (col. 11 and 12)? ")
orientation = orientation.lower()
for InfileName in Filelist:
with open(InfileName) as Infile:
fluor_list = []
Linenumber = 1
for line in Infile: #this will extract the relevant information and store as a list of lists
if Linenumber == 5:
line = line.strip('\n').strip('\r').strip('\t').split('\t')
fluor_list.append(line[1:])
elif Linenumber > 5 and Linenumber < 13:
line = line.strip('\n').strip('\r').strip('\t').split('\t')
fluor_list.append(line)
Linenumber += 1
fluor_list = [map(float, x) for x in fluor_list] #converts list items from strings to floats
fluor_array = np.asarray(fluor_list) #this takes our list of lists and converts it to a numpy array
This portion of the code (above) extracts the values of interest from an input file (obtained from the plate reader) and converts them to an array. It also takes user input to obtain information for calculations and conversions, and also to determine the columns in which standards are placed.
That last part comes into play later, when arrays are stacked - which is where the problematic behavior occurs.
#Create conditional statement, depending on where the standards are, to split the array
if orientation == "right":
#Next, we want to average the 11th and 12th values of each of the 8 rows in our numpy array
stds = fluor_array[:,[10,11]] #Creates a sub-array with the standard values (last two columns, (8x2))
data = np.delete(fluor_array,(10,11),axis=1) #Creates a sub-array with the data (first 10 columns, (8x10))
elif orientation == "left":
#Next, we want to average the 1st and 2nd values of each of the 8 rows in our numpy array
stds = fluor_array[:,[0,1]] #Creates a sub-array with the standard values (first two columns, (8x2))
data = np.delete(fluor_array,(0,1),axis=1) #Creates a sub-array with the data (last 10 columns, (8x10))
else:
print "Error: answer must be 'LEFT' or 'RIGHT'"
std_av = np.mean(stds, axis=1) #creates an array of our averaged std values
#Then, we want to subtract the average value from row 1 (the BLANK) from each of the 8 averages (above)
std_av_st = std_av - std_av[0]
#Run a linear regression on the points in std_av_st against known concentration values (these data = y axis, need x axis)
x = np.array([0.00, 0.03, 0.10, 0.30, 1.00, 3.00, 10.00, 25.00])*10 #ng/uL*10 = ng/well
xi = np.vstack([x, np.zeros(len(x))]).T #creates new array of (x, 0) values (for the regression only); also ensures a zero-intercept (when we use (x, 1) values, the y-intercept is not forced to be zero, and the slope is slightly inflated)
m, c = np.linalg.lstsq(xi, std_av_st)[0] # m = slope for future calculations
#Now we want to subtract the average value from row 1 of std_av (the average BLANK value) from all data points in "data"
data_minus_blank = data - std_av[0]
#Now we want to divide each number in our "data" array by the value "m" derived above (to get total ng/well for each sample; y/m = x)
ng_per_well = data_minus_blank/m
#Now we need to account for the volume of sample put in to the AccuClear reading to calculate ng/uL
ng_per_microliter = ng_per_well/float(input_DNA_vol)
#Next, we multiply those values by the volume of DNA sample (variable "ng")
ng_total = ng_per_microliter*float(remainder_vol)
#Set number of decimal places to 1
ng_per_microliter = np.around(ng_per_microliter, decimals=1)
ng_total = np.around(ng_total, decimals=1)
The above code performs the necessary calculations to figure out the concentration (ng/uL) and total mass (ng) of DNA in a given sample based on a linear regression of the DNA "standards," which can either be in columns 1 and 2 (user input = "left") or in columns 11 and 12 (user input = "right").
#Create a row array (values A-H), and a filler array ('-') to add to existing arrays
col = [i for i in range(1,13)]
row = np.asarray(['A','B','C','D','E','F','G','H'])
filler = np.array(['-','-','-','-','-','-','-','-','-','-','-','-','-','-','-','-',]).reshape((8,2))
The above code creates arrays to be stacked with the original array. The "filler" array is placed based on the user input of "right" or "left" (the stacking command, np.c_[ ], is seen below).
#Create output
Outfile = open('Total_DNA_{0}'.format(InfileName),"w")
Outfile.write("DNA concentration (ng/uL):\n\n")
Outfile.write("\t"+"\t".join([str(n) for n in col])+"\n")
if orientation == "left": #Add filler to left, then add row to the left of filler
ng_per_microliter = np.c_[filler,ng_per_microliter]
ng_per_microliter = np.c_[row,ng_per_microliter]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_per_microliter.tolist()])+"\n\n")
elif orientation == "right": #Add rows to the left, and filler to the right
ng_per_microliter = np.c_[row,ng_per_microliter]
ng_per_microliter = np.c_[ng_per_microliter,filler]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_per_microliter.tolist()])+"\n\n")
Outfile.write("Total mass of DNA per sample (ng):\n\n")
Outfile.write("\t"+"\t".join([str(n) for n in col])+"\n")
if orientation == "left":
ng_total = np.c_[filler,ng_total]
ng_total = np.c_[row,ng_total]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_total.tolist()]))
elif orientation == "right":
ng_total = np.c_[row,ng_total]
ng_total = np.c_[ng_total,filler]
Outfile.write("\n".join(["\t".join([n for n in item]) for item in ng_total.tolist()]))
Outfile.close
Finally, we have the generation of the output file. This is where the problematic behavior occurs.
Using a simple print command, I found that the stacking command numpy.c_[ ] is the culprit (NOT the array writing command).
So it appears that numpy.c_[ ] does not truncate these numbers in Windows, but will limit those numbers to the first character in a UNIX environment.
What are some alternatives that might work on both platforms? If none exists, I don't mind making a UNIX-specific script.
Thank you all for your help and your patience. Sorry for not providing all of the necessary information earlier.
The images are screenshots showing proper output from Windows and what I end up getting in UNIX (I tried to format these for you...but they were a nightmare). I have also included a screenshot of the output obtained in the terminal when I simply print the arrays "ng_per_microliter" and "ng_total."

Using a simple print command, I found that the stacking command numpy.c_[ ] is the culprit (NOT the array writing command).
So it appears that numpy.c_[ ] does not truncate these numbers in Windows, but will limit those numbers to the first character in a UNIX environment.
Illustrate these statements in simple examples. np.c_[] should not be doing anything different.
In Py3, where the default string type in unicode. And numpy 1.12
In [149]: col = [i for i in range(1,13)]
...: row = np.asarray(['A','B','C','D','E','F','G','H'])
...: filler = np.array(['-','-','-','-','-','-','-','-','-','-','-','-','-','-','-','-',]).reshape((8,2))
...:
In [150]: col
Out[150]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
In [151]: "\t"+"\t".join([str(n) for n in col])+"\n"
Out[151]: '\t1\t2\t3\t4\t5\t6\t7\t8\t9\t10\t11\t12\n'
In [152]: filler
Out[152]:
array([['-', '-'],
...
['-', '-'],
['-', '-']],
dtype='<U1')
In [153]: row
Out[153]:
array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
dtype='<U1')
In [154]: row.shape
Out[154]: (8,)
In [155]: filler.shape
Out[155]: (8, 2)
In [159]: ng_per_microliter=np.arange(8.)+1.23
In [160]: np.c_[filler,ng_per_microliter]
Out[160]:
array([['-', '-', '1.23'],
['-', '-', '2.23'],
['-', '-', '3.23'],
...
['-', '-', '8.23']],
dtype='<U32')
In [161]: np.c_[row,ng_per_microliter]
Out[161]:
array([['A', '1.23'],
['B', '2.23'],
['C', '3.23'],
....
['H', '8.23']],
dtype='<U32')
It is possible that with earlier numpy versions that a concatenate of the U1 (or S1 in Py2) array with numeric values leaves the dtype at U1. In my example they've been expanded to U32.
So if you suspect the np.c_, display the result of those (with repr if needed)
print(repr(np.c_[row,ng_per_microliter]))
and track the dtype.
for v 1.12 release notes (possibly earlier)
The astype method now returns an error if the string dtype to cast to is not long enough in “safe” casting mode to hold the max value of integer/float array that is being casted. Previously the casting was allowed even if the result was truncated.
This might come into play when doing concatenate.

With the help of user hpaulj, I've figured out that this isn't an issue with different behavior between operating systems and environments. It's more than likely due to users having different versions of numpy.
The concatenating of arrays automatically converted 'float64' dtypes to 'S1' (to match the "filler" arrays ('-') and "row" arrays ('A','B', etc.)).
Newer versions of numpy - specifically, v 1.12.X - seem to allow the concatenation of arrays without this automatic conversion.
I'm still not sure of a way around this issue in older versions of numpy, but it should be a simple matter to advise folks to upgrade their version for full performance. :)

Related

Pandas modify column vlaue to new per defined requirement

below is part of my data, currently, there's a requirement to change the old_data to the required one. Just use the below data for example.
df = pd.DataFrame({'old_data':['12-13A:A','12-13A:B','12-13A:C','12-13A:G','39-40:A','39-40:B','39-40:G','13A-19:A','13A-19:B',
'13A-19:C','13A-19:D','13A-19:E','13A-19:F','13A-19:G']})
The pre-defined rule is that the digit range difference of each group's old data is 2(like 39-40),3(like 12-13A), or 6(like 13A-19). And if the single digit of a number is 4, then we need to change it to the number before that number appending an 'A'. For example, the number 14, we need to change it to 13A, 23A means 24. If the old_data is 33-35:B, the required data shall be Bed 33A.
Appreciate you can give some ideas for how to modify the old_data column to the required_data column with Pandas, Thanks.
Essentially your data is range:alphabet_index.
Some helper functions. I will switch between your 'no-four-in-the-last-digit' integer system and the normal integer system
import re
def to_number_system(s):
return int(re.sub('3A$', '4', s))
def to_no_four_system(n):
return 'Bed ' + re.sub('4$', '3A', str(n))
The following function map your alphabetical indices to the Bed numbers generated by the range, or G to the range itself.
def do_the_job(df):
_range = df['_range'].iloc[0]
range_start, range_end = map(to_number_system, _range.split('-'))
numbers = map(to_no_four_system, range(range_start, range_end+1))
return df['index'].map(dict(zip('ABCDEF', numbers), G=_range))
df[['_range', 'index']] = df['old_data'].str.split(':', expand=True)
df['required_data'] = df.groupby('_range').apply(do_the_job).droplevel(0)
Take care of the formatting
df.drop(columns=['_range', 'index'])

Running Scipy Linregress Across Dataframe Where Each Element is a List

I am working with a Pandas dataframe where each element contains a list of values. I would like to run a regression between the lists in the first column and the lists in each subsequent column for every row in the dataframe, and store the t-stats of each regression (currently using a numpy array to store them). I am able to do this using a nested for loop that loops through each row and column, but the performance is not optimal for the amount of data I am working with.
Here is a quick sample of what I have so far:
import numpy as np
import pandas as pd
from scipy.stats import linregress
df = pd.DataFrame(
{'a': [list(np.random.rand(11)) for i in range(100)],
'b': [list(np.random.rand(11)) for i in range(100)],
'c': [list(np.random.rand(11)) for i in range(100)],
'd': [list(np.random.rand(11)) for i in range(100)],
'e': [list(np.random.rand(11)) for i in range(100)],
'f': [list(np.random.rand(11)) for i in range(100)]
}
)
Here is what the data looks like:
a b c d e f
0 [0.279347961395256, 0.07198822780319691, 0.209... [0.4733815106836531, 0.5807425586417414, 0.068... [0.9377037591435088, 0.9698329284595916, 0.241... [0.03984770879654953, 0.650429630364027, 0.875... [0.04654151678901641, 0.1959629573862498, 0.36... [0.01328000288459652, 0.10429773699794731, 0.0...
1 [0.1739544898167934, 0.5279297754363472, 0.635... [0.6464841177367048, 0.004013634850660308, 0.2... [0.0403944630279538, 0.9163938509072009, 0.350... [0.8818108296208096, 0.2910758930807579, 0.739... [0.5263032002243185, 0.3746299115677546, 0.122... [0.5511171062367501, 0.327702669239891, 0.9147...
2 [0.49678125158054476, 0.807770957943305, 0.396... [0.6218806473477556, 0.01720135741717188, 0.15... [0.6110516368605904, 0.20848099927159314, 0.51... [0.7473669581190695, 0.5107081859246958, 0.442... [0.8231961741887535, 0.9686869510163731, 0.473... [0.34358121300094313, 0.9787339533782848, 0.72...
3 [0.7672751789941814, 0.412055981587398, 0.9951... [0.8470471648467321, 0.9967427749160083, 0.818... [0.8591072331661481, 0.6279199806511635, 0.365... [0.9456189188046846, 0.5084362869897466, 0.586... [0.2685328112579779, 0.8893788305422594, 0.235... [0.029919732007230193, 0.6377951981939682, 0.1...
4 [0.21420195955828203, 0.15178914447352077, 0.9... [0.6865307542882283, 0.0620359602798356, 0.382... [0.6469510945986712, 0.676059598071864, 0.0396... [0.2320436872397288, 0.09558341089961908, 0.98... [0.7733653233006889, 0.2405189745554751, 0.016... [0.8359561624563979, 0.24335481664355396, 0.38...
... ... ... ... ... ... ...
95 [0.42373270776373506, 0.7731750012629109, 0.90... [0.9430465078763153, 0.8506292743184455, 0.567... [0.41367168515273345, 0.9040247409476362, 0.72... [0.23016875953835192, 0.8206550830081965, 0.26... [0.954233948805146, 0.995068745046983, 0.20247... [0.26269690906898413, 0.5032835345055103, 0.26...
96 [0.36114607798432685, 0.11322299769211142, 0.0... [0.729848741496316, 0.9946930423163686, 0.2265... [0.17207915211677138, 0.3270055732644267, 0.73... [0.13211243241239223, 0.28382298905995607, 0.2... [0.03915259352564071, 0.05639914089770948, 0.0... [0.12681415759423675, 0.006417761276839351, 0....
97 [0.5020186971295065, 0.04018166955309821, 0.19... [0.9082402680300308, 0.1334790715379094, 0.991... [0.7003469664104871, 0.9444397336912727, 0.113... [0.7982221018200218, 0.9097963438776192, 0.163... [0.07834894180973451, 0.7948519146738178, 0.56... [0.5833962514812425, 0.403689767723475, 0.7792...
98 [0.16413822314461857, 0.40683312270714234, 0.4... [0.07366489230864415, 0.2706766599711766, 0.71... [0.6410967759869383, 0.5780018716586993, 0.622... [0.5466463581695835, 0.4949639043264169, 0.749... [0.40235314091318986, 0.8305539205264385, 0.35... [0.009668651763079184, 0.8071825962911674, 0.0...
99 [0.8189246990381518, 0.69175150213841, 0.82687... [0.40469941577758317, 0.49004906937461257, 0.7... [0.4940080411615112, 0.33621539942693246, 0.67... [0.8637418291877355, 0.34876318713083676, 0.09... [0.3526913672876807, 0.5177762589812651, 0.746... [0.3463129199717484, 0.9694802522161138, 0.732...
100 rows × 6 columns
My code to run the regressions and store the t-stats:
rows = len(df)
cols = len(df.columns)
tstats = np.zeros(shape=(rows,cols-1))
for i in range(0,rows):
for j in range(1,cols):
lg = linregress(df.iloc[i,0],df.iloc[i,j])
tstats[i,j-1] = lg.slope/lg.stderr
The code above works just fine and is doing exactly what I need, however as I mentioned above the performance begins to slow down when the # of rows and columns in df increases substantially.
I'm hoping someone could offer advice on how to optimize my code for better performance.
Thank you!
I am newbie to this but I do optimization your original code:
by purely use python builtin list object (there is no need to use pandas and to be honest I cannot find a better way to solve your problem in pandas than you original code :D)
by using numpy, which should be (at least they claimed) faster than python builtin list.
You can jump to see the code, its in Jupyter notebook format so you need to install Jupyter first.
Conclusion
Here is the test result:
On a (100, 100) matrix containing (30,) length random lists,
the total time difference is around 1 second.
Time elapsed to run 1 times on new method is 24.282760 seconds.
Time elapsed to run 1 times on old method is 25.954801 seconds.
Refer to
test_perf
in sample code for result.
PS: During test only one thread is used, so maybe multi-thread will help to improve performance, but that's out of my ability...
Idea
I think numpy.nditer is suitable for your request, though the result of optimization is not that significant. Here is my idea:
Generate the input array
I have altered you first part of script, I think using list comprehension along is enough to build a matrix of random lists. Refer to
get_matrix_from_builtin.
Please note I have stored the random lists in another 1-element tuple to keep the shape as ndarray generate from numpy.
As a compare, you can also construct such matrix with numpy. Refer to
get_matrix_from_numpy.
Because ndarray try to boardcast list-like object (and I don't know how to stop it), I have to wrap it into a tuple to avoid auto boardcast from numpy.array constructor. If anyone have a better solution please note it, thanks :)
Calculate the result
I altered you original code using pandas.DataFrame to access element by row/col index, but it is not that way.
Pandas provides some iteration tool for DataFrame: pipe, apply, agg, and appymap, search API for more info, but it seems not suitable for your request here, as you want to obtain the current index of row and col during iteration.
I searched and found numpy.nditer can provide that needs: it return a iterator of ndarray, which have an attribution multi_index that provide the row/col pair of current element. see iterating-over-arrays
Explain on solve.ipynb
I use Jupyter Notebook to test this, you might need got one, here is the instruction of install.
I have altered your original code, which remove the request of pandas and purely used builtin list. Refer to
old_calc_tstat
in the sample code.
Also, I used numpy.nditer to calc your tstats matrix, Refer to
new_calc_tstat
in the sample code.
Then, I tested if the result of both methods are equal, I used same input array to ensure random won't affect the test. Refer to
test_equal
for result.
Finally, do the time performance. I am not patient so I only run it for one time, you may add the repeats count of test in the
test_perf function.
The code
# To add a new cell, type '# %%'
# To add a new markdown cell, type '# %% [markdown]'
# %% [markdown]
# [origin question](https://stackoverflow.com/questions/69228572/running-scipy-linregress-across-dataframe-where-each-element-is-a-list)
#
# %%
import sys
import time
import numpy as np
from scipy.stats import linregress
# %%
def get_matrix_from_builtin():
# use builtin list to construct matrix of random list
# note I put random list inside a tuple to keep it same shape
# as I later use numpy to do the same thing.
return [
[(list(np.random.rand(11)),)
for col in range(6)]
for row in range(100)
]
# %timeit get_matrix_from_builtin()
# %%
def get_matrix_from_numpy(
gen=np.random.rand,
shape=(1, 1),
nest_shape=(1, ),
):
# custom dtype for random lists
mydtype = [
('randonlist', 'f', nest_shape)
]
a = np.empty(shape, dtype=mydtype)
# [DOC] moditfying array values
# https://numpy.org/doc/stable/reference/arrays.nditer.html#modifying-array-values
# enable per operation flags 'readwrite' to modify element in ndarray
# enable global flag 'refs_ok' to allow use callable function 'gen' in iteration
with np.nditer(a, op_flags=['readwrite'], flags=['refs_ok']) as it:
for x in it:
# pack list in a 1-d turple to prevent numpy boardcast it
x[...] = (gen(nest_shape[0]), )
return a
def test_get_matrix_from_numpy():
gen = np.random.rand # generator of random list
shape = (6, 100) # shape of matrix to hold random lists
nest_shape = (11, ) # shape of random lists
return get_matrix_from_numpy(gen, shape, nest_shape)
# access a random list by a[row][col][0]
# %timeit test_get_matrix_from_numpy()
# %%
def test_get_matrix_from_numpy():
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
return get_matrix_from_numpy(gen, shape, nest_shape)
# %%
def old_calc_tstat(a=None):
if a is None:
a = get_matrix_from_builtin()
a = np.array(a)
rows, cols = a.shape[:2]
tstats = np.zeros(shape=(rows, cols))
for i in range(0, rows):
for j in range(1, cols):
lg = linregress(a[i][0][0], a[i][j][0])
tstats[i, j-1] = lg.slope/lg.stderr
return tstats
# %%
def new_calc_tstat(a=None):
# read input metrix of random lists
if a is None:
gen = np.random.rand
shape = (6, 100)
nest_shape = (11, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# construct ndarray for t-stat result
tstats = np.empty(a.shape)
# enable global flags 'multi_index' to retrive index of current element
# [DOC] Tracking an Index or Multi-Index
# https://numpy.org/doc/stable/reference/arrays.nditer.html#tracking-an-index-or-multi-index
it = np.nditer(tstats, op_flags=['readwrite'], flags=['multi_index'])
# obtain total columns count of tstats's shape
col = tstats.shape[1]
for x in it:
i, j = it.multi_index
# trick to avoid IndexError: substract len(list) after +1 to index
j = j + 1 - col
lg = linregress(
a[i][0][0],
a[i][j][0]
)
# note: nditer ignore ZeroDivisionError by default, and return np.inf to the element
# you have to override it manually:
if lg.stderr == 0:
x[...] = 0
else:
x[...] = lg.slope / lg.stderr
return tstats
# new_calc_tstat()
# %%
def test_equal():
"""Test if the new method has equal output to old one"""
# use same input list to avoid affect of rand
a = test_get_matrix_from_numpy()
old = old_calc_tstat(a)
new = new_calc_tstat(a)
print(
"Is the shape of old and new same ?\n%s. old: %s, new: %s\n" % (
old.shape == new.shape, old.shape, new.shape),
)
res = (old == new)
print(
"Is the result object same?"
)
if res.all() == True:
print("True.")
else:
print("False. Difference(new - old) as below:\n")
print(new - old)
return old, new
old, new = test_equal()
# %%
# the only diff is the last element
# in old method it is 0
# in new method it is inf
# if you perfer the old method, just add condition in new method to override
# [new[x][99] for x in range(6)]
# %%
# python version: 3.8.8
timer = time.clock if sys.platform[:3] == 'win' else time.time
def total(func, *args, _reps=1, **kwargs):
start = timer()
for i in range(_reps):
ret = func(*args, **kwargs)
elapsed = timer() - start
return elapsed
def test_perf():
"""Test of performance"""
# first, get a larger input array
gen = np.random.rand
shape = (1000, 100)
nest_shape = (30, )
a = get_matrix_from_numpy(gen, shape, nest_shape)
# repeat how many time for each test
reps = 1
# then, time both old and new calculation method
old = total(old_calc_tstat, a, _reps=reps)
new = total(new_calc_tstat, a, _reps=reps)
msg = "Time elapsed to run %d times on %s is %f seconds."
print(msg % (reps, 'new method', new))
print(msg % (reps, 'old method', old))
test_perf()

Compute mean and standard deviation for HDF5 data

I am currently running 100 simulations that computes 1M values per simulation (i.e. per episode/iteration there is one value).
Main Routine
My main file looks like this:
# Defining the test simulation environment
def test_simulation:
environment = environment(
periods = 1000000
parameter_x = ...
parameter_y = ...
)
# Defining the simulation
environment.simulation()
# Run the simulation 100 times
for i in range(100):
print(f'--- Iteration {i} ---')
test_simulation()
The simulation procedure is as follows: Within game() I generate a value_history that is continuously appended:
def simulation:
for episode in range(periods):
value = doSomething()
self.value_history.append(value)
Hence, as a result, for each episode/iteration, I compute one value that is an array, e.g. [1.4 1.9] (player 1 having 1.4 and player 2 having 1.9 in the current episode/iteration).
Storing of Simulation Data
To store the data, I use the approach proposed in Append simulation data using HDF5, which works perfectly fine.
After running the simulations, I receive the following Keys structure:
Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>
Computing Statistics for Files
Now, the goal is to compute averages and standard deviations for each value in the 100 data files that I run, which means that, in the end, I would have a final_data set consisting of 1M averages and 1M standard deviations (one average and one standard deviation for each row (for each player) across the 100 simulations).
The goal would thus be to get something like the the following structure [average_player1, average_player2], [std_player1, std_player2]:
episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]
I currently use the following code to extract the data storing it into an empty list:
def ExtractSimData(name, simulation_runs, length):
# Create empty list
result = []
# Call the simulation run file
filename = f"runs/{length}/{name}_simulation_runs2.h5"
with h5py.File(filename, "r") as hf:
# List all groups
print("Keys: %s" % hf.keys())
for i in range(simulation_runs):
a_group_key = list(hf.keys())[i]
data = list(hf[a_group_key])
for element in data:
result.append(element)
The data structure of result looks something like this:
[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]
First Attempt to Compute Means
I tried to use the following code to come up with a mean score for the first element (the array consists of two elements since there are two players in the simulation):
mean_result = [np.mean(k) for k in zip(*list(result))]
However, this computes the average of each element in the array across the whole list since I appended each data set to the empty list. My goal, however, would be to compute an average/standard deviation across the 100 data sets defined above (i.e. one value is the average/standard deviation across all 100 data sets).
Is there any way to efficiently accomplish this?
This calculates mean and standard deviation of episode/player values across multiple datasets in 1 file. I think it's what you want to do. If not, I can modify as needed. (Note: I created a small pseudo-data HDF5 file to replicate what you describe. For completeness, that code is at the end of this post.)
Outline of steps in the procedure summarized below (after opening the file):
Get basic size info from file : dataset count and number of dataset rows
Use values above to size arrays for player 1 and 2 values (variables p1_arr and p2_arr). shape[0] is the episode (row) count, and shape[1] is the simulation (dataset) count.
Loop over all datasets. I used hf.keys() (which iterates over the dataset names). You could also iterate over names in list ds_names created earlier. (I created it to simplify size calculations in step 2). The enumerate() counter i is used to index episode values for each simulation to the correct column in each player array.
To get the mean and standard deviation for each row, use the np.mean() and np.std() functions with the axis=1 parameter. That calculates the mean across each row of simulation results.
Next, load the data into the result dataset. I created 2 datasets (same data, different dtypes) as described below:
a. The 'final_data' dataset is a simple float array of shape=(# of episodes,4), where you need to know what value is in each column. (I suggest adding an attribute to document.)
b. The 'final_data_named' dataset uses a NumPy recarray so you can name the fields(columns). It has shape=(# of episodes,). You access each column by name.
A note on statistics: calculations are sensitive to the sum() operator's behavior over the range of values. If your data is well defined, the NumPy functions are appropriate. I investigated this a few years ago. See this discussion for all the details: when to use numpy vs statistics modules
Code to read and calculate statistics below.
import h5py
import numpy as np
def ExtractSimData(name, simulation_runs, length):
# Call the simulation run file
filename = f"runs/{length}/{name}simulation_runs2.h5"
with h5py.File(filename, "a") as hf:
# List all dataset names
ds_names = list(hf.keys())
print(f'Dataset names (keys): {ds_names}')
# Create empty arrays for player1 and player2 episode values
sim_cnt = len(ds_names)
print(f'# of simulation runs (dataset count) = {sim_cnt}')
ep_cnt = hf[ ds_names[0] ].shape[0]
print(f'# of episodes (rows) in each dataset = {ep_cnt}')
p1_arr = np.empty((ep_cnt,sim_cnt))
p2_arr = np.empty((ep_cnt,sim_cnt))
for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation
p1_arr[:,i] = hf[ds][:,0]
p2_arr[:,i] = hf[ds][:,1]
ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4),
compression='gzip', chunks=True)
ds1[:,0] = np.mean(p1_arr, axis=1)
ds1[:,1] = np.std(p1_arr, axis=1)
ds1[:,2] = np.mean(p2_arr, axis=1)
ds1[:,3] = np.std(p2_arr, axis=1)
dt = np.dtype([ ('average_player1',float), ('average_player2',float),
('std_player1',float), ('std_player2',float) ] )
ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt,
compression='gzip', chunks=True)
ds2['average_player1'] = np.mean(p1_arr, axis=1)
ds2['std_player1'] = np.std(p1_arr, axis=1)
ds2['average_player2'] = np.mean(p2_arr, axis=1)
ds2['std_player2'] = np.std(p2_arr, axis=1)
### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)
Code to create pseudo-data HDF5 file below.
import h5py
import numpy as np
# Create some psuedo-test data
def test_simulation(i):
players = 2
periods = 1000
# Define the simulation with some random data
val_hist = np.random.random(periods*players).reshape(periods,players)
if i == 0:
mode='w'
else:
mode='a'
# Save simulation data (unique datasets)
with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
hf.create_dataset(f'data_{i:03}', data=val_hist,
compression='gzip', chunks=True)
# Run the simulation N times
simulations = 10
for i in range(simulations):
print(f'--- Iteration {i} ---')
test_simulation(i)

Trying to loop through multiple arrays and getting error: ValueError: cannot reshape array of size 2 into shape (44,1)

New to for loops and I cannot seem to get this one to work. I have multiple arrays that I want to run through my code. It works for individual arrays, but when I try to run it through a list of arrays it tries to join the arrays together.
Pandas looping, multiple attempts at looping in numpy.
Min regret matrix
for i in [a],[b],[c],[d],[e]:
sum columns and rows:
suma0 = np.sum(a,axis=0)
suma1 = np.sum(a,axis=1)
#find the minimum values for rows and columns:
col_min=np.min(a)
col_min0=data.min(0)
row_min=np.min(a[:44])
row_min0=data.min(1)
difference or least regret between scenarios and policies:
p = np.array(a)
q = np.min(p,axis=0)
r = np.min(p,axis=1)
cidx = np.argmin(p,axis=0)
ridx = np.argmin(p,axis=1)
cdif = p-q
rdif = p-r[:,None]
find the sum of the rows and columns for the difference arrays:
sumc = np.sum(cdif,axis=0)
sumr = np.sum(rdif,axis=1)
sumr1 = np.reshape(sumr,(44,1))
append the scenario array with the column sums:
sumcol = np.zeros((45,10))
sumcol = np.append([cdif],[sumc])
sumcol.shape = (45,10)
rank columns:
order0 = sumc.argsort()
rank0 = order0.argsort()
rankcol = np.zeros((46,10))
rankcol = np.append([sumcol],[rank0])
rankcol.shape = (46,10)
append the policy array with row sums
sumrow = np.zeros((44,11))
sumrow = np.hstack((rdif,sumr1))
rank rows
order1 = sumr.argsort()
rank1 = order1.argsort()
rank1r = np.reshape(rank1,(44,1))
rankrow = np.zeros((44,12))
rankrow = np.hstack((sumrow,rank1r))
print(sumrow)
print(rankrow)
Add row and column headers for least regret for df0:
RCP = np.zeros((47,11))
RCP = pd.DataFrame(rankcol, columns=column_names1, index=row_names1)
print(RCP)
Add row and column headers for least regret for df1:
RCP1 = np.zeros((45,13))
RCP1 = pd.DataFrame(rankrow, columns=column_names2, index=row_names2)
print(RCP1)
Export loops to CSV in output folder:
filepath = os.path.join(output_path, 'out_'+str(index)+'.csv')
RCP.to_csv(filepath)
filepath = os.path.join(output_path, 'out1_'+str(index)+'.csv')
RCP1.to_csv(filepath)
As per your question, please highlight the input, expected output and error as this is a base case example.
x = np.random.randn(2)
x.shape = (2,)
and if we attempt for :
x.reshape(44,1)
The error we get is:
ValueError: cannot reshape array of size 2 into shape (44,1)
reason for this error is simple as we are trying to reshape an array of size 2 into 44 sized array. As per your error highlighted please check the dimension of the input and expected output.

How to fix Index Error when trying to match catalogs using astropy coordinates

I’m trying to match catalog sources using the astropy coordinates package, and I have three different files of data. My ultimate goal is for all three files to contain data for the same exact sources.
I thought I had figured it out because when I compare the files F435W.csv and F550M.csv, the code seems to work and both end up with the same number of sources, just like I wanted. When I use the same exact code to compare my third file F625W.csv to F435W.csv, I get an IndexError: index 6442 is out of bounds for axis 0 with size 6348. Why would I be getting this error with one file but not the other?
There is a significant difference between the number of sources between all three files, which is what I think would normally cause this error, but it doesn't make sense why it would work with the first two files with different array lengths but not with third file with a different array length from the first two.
import numpy as np
my_csv1 = np.genfromtxt('./F435W.csv', delimiter=',', dtype=float)
ra1, dec1 = my_csv1[:, 12], my_csv1[:, 13]
my_csv2 = np.genfromtxt('./F625W.csv', delimiter=',', dtype=float)
ra2, dec2 = my_csv2[:, 12], my_csv2[:, 13]
from astropy.coordinates import SkyCoord
from astropy import units as u
from astropy.coordinates import match_coordinates_sky
c = SkyCoord(ra1, dec1, frame='icrs', unit='deg')
catalog = SkyCoord(ra2, dec2, frame='icrs', unit='deg')
max_sep = 2.0*u.arcsec
idx, sep, _ = c.match_to_catalog_sky(catalog)
sep_constraint = idx[sep < max_sep]
c_matches = c[sep_constraint]
catalog_matches = c[idx[sep_constraint]]
print (len(c_matches), len(catalog_matches))
When using F435W.csv and F550M.csv my code outputs array lengths 4703 4703, so they are the same length, same number of sources. When I change F550M.csv to F625.csv I get the IndexError. According to the error information, it seems to be coming from line 19 c_matches = c[sep_constraint]

Categories

Resources