Writing multiple seperate numpy arrays to a single comma delimited text file - python

I would like to take multiple numpy arrays and write them to a text file that is comma delimited. Here is the example of my original data and the final data that I am trying to produce:
array([[1., 3., 0., 1.],
[2., 5., 3., 1.]].....
and so forth. For multiple different arrays of four-column lengths. I can get an out put txt file using write() but I can't get the data into the format shown below:
1., 3., 0., 1.
2., 5., 3., 1.
Also, I need to have the 0th column integers and the 1st through 3rd be floating point.
Cheers.

How about this?
data = array([[1., 3., 0., 1.],
[2., 5., 3., 1.]].....
with open('output.csv', 'w') as f:
for x in data:
f.write('%d,%f,%f,%f\n' % tuple(x))
This outputs
1,3.000000,0.000000,1.000000
2,5.000000,3.000000,1.000000
You can adjust the precision of the floating point output by changing %f to %.2f if you want two decimal places, for instance.

I would recommend using pandas for this:
import numpy
import pandas
data = numpy.array([[1., 3., 0., 1.],
[2., 5., 3., 1.]])
data = pandas.DataFrame(data,columns=['a','b','c','d'])
data['a'] = data['a'].astype(int)
data.to_csv('outfile.csv')

Related

Numpy: select along axis of 3d array using 2d array

I've been struggling with this for a few hours and can't quite get my head around it. The setup is something like this:
A.shape # (T,N,K)
B.shape # (L,K) L < N
Each of the K columns of the 2D B array index one of the N columns along that same K row. I can grab along any specific k slice easily via
A[:,B[:,k],k].shape # (T,L)
However, looping over K isn't ideal because A is a very large matrix
I'm sure someone has a really simple answer, but I am stumped.
Edit: I should also add that I need to preserve the 3D structure of the A matrix. I figured out how to grab the individual values, but only in a (TxLxK,) array.
You can use np.take_along_axis
np.take_along_axis(A,B[None,...],axis=1)
For example,
A = np.linspace(1,24,24).reshape(3,4,2)
B = np.repeat([[0,1]],3,axis=0)
np.take_along_axis(A,B[None,...],axis=1)
the result is
array([[[ 1., 4.],
[ 1., 4.],
[ 1., 4.]],
[[ 9., 12.],
[ 9., 12.],
[ 9., 12.]],
[[17., 20.],
[17., 20.],
[17., 20.]]])

How can I change multiple values at once in pandas dataframe, using arrays as indices that vary in length?

I want to change a number of values in my pandas dataframe, where the indices that are indicating the columns may vary in size.
I need something that is faster than a for-loop, because it will be done on a lot of rows, and this turned out to be too slow.
As a simple example, consider this
df = pd.DataFrame(np.zeros((5,5)))
Now, I want to change some of the values in this dataframe to 1. If I e.g. want to change the values in the second and fith row for the first two columns, but in the fourth row I want to change all the values, I want something like this to work:
col_indices = np.array([np.arange(2),np.arange(5),np.arange(2)])
row_indices = np.array([1,3,4])
df.loc(row_indices,col_indices) =1
However, this does not work (I suspect that it does not work because the shape of the data you would select is not conform with a dataframe).
Is there any more flexible way of indexing without having to loop over rows etc.?
A solution that works only for range-like arrays (as above) would also work for my current problem - but general answer would also be nice.
Thanks for any help!
IIUC here's one approach. Define the column indices as the amount of columns where you want to insert 1s instead, and the rows where you want to insert them:
col_indices = np.array([2,5,2])
row_indices = np.array([1,3,4])
arr = df.values
And use advanced indexing to set the cells of interest to 1:
arr[row_indices] = np.arange(arr.shape[0]) <= col_indices[:,None]
array([[0., 0., 0., 0., 0.],
[1., 1., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1.],
[1., 1., 0., 0., 0.]])

Concatenate Numpy arrays with least memory

Not I have 50GB dataset saved as h5py, which is a dictionary inside. The dictionary contains keys from 0 to n, and the values are numpy ndarray(3 dimension) which have the same shape. For example:
dictionary[0] = np.array([[[...],[...]]...])
I want to concat all these np arrays, code like
sample = np.concatenate(list(dictionary.values))
this operation waste 100GB memory! If I use
del dictionary
It will decrease to 50GB memory. But I want to control the memory usage as 50GB during loading data. Another way I tried like this
sample = np.concatenate(sample,dictionary[key])
It is still using 100GB memory. I think all the cases above, the right side will create a new memory block to save, and then assigned to the left side, which will double the memory during calculations. Thus, the third way I tried like this
sample = np.empty(shape)
with h5py.File(...) as dictionary:
for key in dictionary.keys():
sample[key] = dictionary[key]
I think this code has an advantage. The value dictionary[key] assigned to some row of sample, then the memory of dictionary[key] will clear. However, I test it and find that the memory usage is also 100GB. Why?
Are there any good methods to limit the memory usage as 50GB?
Your problem is that you need to have 2 copies of the same data in memory.
If you build the array as in test1 you'll need far less memory at once, but at the cost of losing the dictionary.
import numpy as np
import time
def test1(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.array([a.pop(i) for i in range(n)]).reshape(-1)
return b
def test2(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.concatenate(list(a.values()))
return b
x1 = test1(1000000)
del x1
time.sleep(1)
x2 = test2(1000000)
Results:
test1 : 0.71 s
test2 : 1.39 s
The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.
dictionary[key] is a dataset on the file. dictionary[key][...] will be an numpy array, that dataset downloaded.
I imagine
sample[key] = dictionary[key]
is evaluated as
sample[key,...] = dictionary[key][...]
The dataset is downloaded, and then copied to a slice of the sample array. That downloaded array should be free for recycling. But whether numpy/python does that is another matter. I'm not in the habit of pushing memory limits.
You don't want to do the incremental concatenate - that's slow. A single concatenate on the list should be faster. I don't know for such what
list(dictionary.values)
contains. Will it be references to the datasets, or downloaded arrays? Regardless concatenate(...) on that list will have to used the downloaded arrays.
One thing puzzles me - how can you use the same key to index the first dimension of sample and dataset in dictionary? h5py keys are supposed to be strings, not integers.
Some testing
Note that I'm using string dataset names:
In [21]: d = f.create_dataset('0',data=np.zeros((2,3)))
In [22]: d = f.create_dataset('1',data=np.zeros((2,3)))
In [23]: d = f.create_dataset('2',data=np.ones((2,3)))
In [24]: d = f.create_dataset('3',data=np.arange(6.).reshape(2,3))
Your np.concatenate(list(dictionary.values)) code is missing ():
In [25]: f.values
Out[25]: <bound method MappingHDF5.values of <HDF5 file "test.hf" (mode r+)>>
In [26]: f.values()
Out[26]: ValuesViewHDF5(<HDF5 file "test.hf" (mode r+)>)
In [27]: list(f.values())
Out[27]:
[<HDF5 dataset "0": shape (2, 3), type "<f8">,
<HDF5 dataset "1": shape (2, 3), type "<f8">,
<HDF5 dataset "2": shape (2, 3), type "<f8">,
<HDF5 dataset "3": shape (2, 3), type "<f8">]
So it's just a list of the datasets. The downloading occurs when concatenate does a np.asarray(a) for each element of the list:
In [28]: np.concatenate(list(f.values()))
Out[28]:
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[1., 1., 1.],
[1., 1., 1.],
[0., 1., 2.],
[3., 4., 5.]])
e.g.:
In [29]: [np.array(a) for a in f.values()]
Out[29]:
[array([[0., 0., 0.],
[0., 0., 0.]]), array([[0., 0., 0.],
[0., 0., 0.]]), array([[1., 1., 1.],
[1., 1., 1.]]), array([[0., 1., 2.],
[3., 4., 5.]])]
In [30]: [a[...] for a in f.values()]
....
Let's look at what happens when using your iteration approach:
Make an array that can takes one dataset for each 'row':
In [34]: samples = np.zeros((4,2,3),float)
In [35]: for i,d in enumerate(f.values()):
...: v = d[...]
...: print(v.__array_interface__['data']) # databuffer location
...: samples[i,...] = v
...:
(27845184, False)
(27815504, False)
(27845184, False)
(27815504, False)
In [36]: samples
Out[36]:
array([[[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.]],
[[1., 1., 1.],
[1., 1., 1.]],
[[0., 1., 2.],
[3., 4., 5.]]])
In this small example, it recycled every other databuffer block. The 2nd iteration frees up the databuffer used in the first, which can then be reused in the 3rd, and so on.
These are small arrays in a interactive ipython session. I don't know if these observations apply in large cases.

Python indexing for central differencing

I have a question about python indexing: I am trying to use central differencing to estimate 'dU' from an array 'U' and I'm doing this by initialising 'dU' with an array of 'nan' of length(U) and then applying central differencing such that dU(i) = (U(i+1) - U(i-1))/2 to the central elements. The output 'dU' array is currently giving me two 'nan' entries at the end of the vector. Can anyone explain why the second to last element isn't being updated?
import numpy as np
U= np.array([1,2,3,4,5,6])
dU = np.zeros(len(U))
dU[:] = np.NAN
dU[1:-2] = (U[2:-1]-U[0:-3])/2
>>> dU
array([ nan, 1., 1., 1., nan, nan])
To have second to last element included you would need:
dU[1:-1] = (U[2:]-U[0:-2])/2
Doesn't answer your question, but as a helpful tip, you can just use numpy.gradient
>>> np.gradient(np.array([1,2,3,4,5,6]))
>>> array([ 1., 1., 1., 1., 1., 1.])

scipy.optimize.leastsq returns best guess parameters not new best fit

I want to fit a lorentzian peak to a set of data x and y, the data is fine. Other programs like OriginLab fit it perfectly, but I wanted to automate the fitting with python so I have the below code which is based on http://mesa.ac.nz/?page_id=1800
The problem I have is that the scipy.optimize.leastsq returns as the best fit the same initial guess parameters I passed to it, essentially doing nothing. Here is the code.
#x, y are the arrays with the x,y axes respectively
#defining funcitons
def lorentzian(x,p):
return p[2]*(p[0]**2)/(( x - (p[1]) )**2 + p[0]**2)
def residuals(p,y,x):
err = y - lorentzian(x,p)
return err
p = [0.055, wv[midIdx], y[midIdx-minIdx]]
pbest = leastsq(residuals, p, args=(y, x), full_output=1)
best_parameters = pbest[0]
print p
print pbest
p are the initial guesses and best_parameters are the returned 'best fit' parameters from leastsq, but they are always the same.
this is what returned by the full_output=1 (the long numeric arrays have been shortened but are still representitive)
[0.055, 855.50732, 1327.0]
(array([ 5.50000000e-02, 8.55507324e+02, 1.32700000e+03]),
None, {'qtf':array([ 62.05192947, 69.98033905, 57.90628052]),
'nfev': 4,
'fjac': array([[-0., 0., 0., 0., 0., 0., 0.,],
[ 0., -0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[ 0., 0., -0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.]]),
'fvec': array([ 62.05192947, 69.98033905,
53.41218567, 45.49879837, 49.58242035, 36.66483688,
34.74443436, 50.82238007, 34.89669037]),
'ipvt': array([1, 2, 3])},
'The cosine of the angle between func(x) and any column of the\n Jacobian
is at most 0.000000 in absolute value', 4)
can anyone see whats wrong?
A quick google search hints at a problem with the data being single precision (your other programs almost certainly upcast to double precision too, though this explicitely is a problem with scipy as well, see also this bug report). If you look at your full_output=1 result, you see the the Jacobian is approximated as zero everywhere.
So giving the Jacobian explicitely might help (though even then you might want to upcast, because the minimum precision for a relative error you can get with single precision is just very limited).
Solution: the easiest and numerically best solution (of course giving the real Jacobian is also a bonus) is to just cast your x and y data to double precision (x = x.astype(np.float64) will do for example).
I would not suggest this, but you also may be able to fix it by setting epsfcn keyword argument (and also the tolerance keyword arguments probably) by hand, something along epsfcn=np.finfo(np.float32).eps. This seems to fix the issue in a way, but (since most calculations are with scalars, and scalars do not force an upcast in your calculation) the calculations are done in float32 and the precision loss seem to be rather big, at least when not providing Dfunc.

Categories

Resources