Python indexing for central differencing - python

I have a question about python indexing: I am trying to use central differencing to estimate 'dU' from an array 'U' and I'm doing this by initialising 'dU' with an array of 'nan' of length(U) and then applying central differencing such that dU(i) = (U(i+1) - U(i-1))/2 to the central elements. The output 'dU' array is currently giving me two 'nan' entries at the end of the vector. Can anyone explain why the second to last element isn't being updated?
import numpy as np
U= np.array([1,2,3,4,5,6])
dU = np.zeros(len(U))
dU[:] = np.NAN
dU[1:-2] = (U[2:-1]-U[0:-3])/2
>>> dU
array([ nan, 1., 1., 1., nan, nan])

To have second to last element included you would need:
dU[1:-1] = (U[2:]-U[0:-2])/2

Doesn't answer your question, but as a helpful tip, you can just use numpy.gradient
>>> np.gradient(np.array([1,2,3,4,5,6]))
>>> array([ 1., 1., 1., 1., 1., 1.])

Related

How can I change multiple values at once in pandas dataframe, using arrays as indices that vary in length?

I want to change a number of values in my pandas dataframe, where the indices that are indicating the columns may vary in size.
I need something that is faster than a for-loop, because it will be done on a lot of rows, and this turned out to be too slow.
As a simple example, consider this
df = pd.DataFrame(np.zeros((5,5)))
Now, I want to change some of the values in this dataframe to 1. If I e.g. want to change the values in the second and fith row for the first two columns, but in the fourth row I want to change all the values, I want something like this to work:
col_indices = np.array([np.arange(2),np.arange(5),np.arange(2)])
row_indices = np.array([1,3,4])
df.loc(row_indices,col_indices) =1
However, this does not work (I suspect that it does not work because the shape of the data you would select is not conform with a dataframe).
Is there any more flexible way of indexing without having to loop over rows etc.?
A solution that works only for range-like arrays (as above) would also work for my current problem - but general answer would also be nice.
Thanks for any help!
IIUC here's one approach. Define the column indices as the amount of columns where you want to insert 1s instead, and the rows where you want to insert them:
col_indices = np.array([2,5,2])
row_indices = np.array([1,3,4])
arr = df.values
And use advanced indexing to set the cells of interest to 1:
arr[row_indices] = np.arange(arr.shape[0]) <= col_indices[:,None]
array([[0., 0., 0., 0., 0.],
[1., 1., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[1., 1., 1., 1., 1.],
[1., 1., 0., 0., 0.]])

Operation between ndarray and heterogeneous ndarray

I've been trying to come uo with a way to add these two ndarrays, one of them with a different amount of elements in each row:
a = np.array([np.array([0, 1]), np.array([4, 5, 6])])
z = np.zeros((3,3))
Expected output:
array([[0., 1., 0.],
[4., 5., 6.]])
Can anyone think of a way to do this using numpy?
I don't think there is a 'numpy-fast' solution for this. I think you will need to loop over a with a for loop and add every line individually.
for i in range(len(a)):
z[i,:len(a[i])] = z[i,:len(a[i])] + a[i]

Concatenate Numpy arrays with least memory

Not I have 50GB dataset saved as h5py, which is a dictionary inside. The dictionary contains keys from 0 to n, and the values are numpy ndarray(3 dimension) which have the same shape. For example:
dictionary[0] = np.array([[[...],[...]]...])
I want to concat all these np arrays, code like
sample = np.concatenate(list(dictionary.values))
this operation waste 100GB memory! If I use
del dictionary
It will decrease to 50GB memory. But I want to control the memory usage as 50GB during loading data. Another way I tried like this
sample = np.concatenate(sample,dictionary[key])
It is still using 100GB memory. I think all the cases above, the right side will create a new memory block to save, and then assigned to the left side, which will double the memory during calculations. Thus, the third way I tried like this
sample = np.empty(shape)
with h5py.File(...) as dictionary:
for key in dictionary.keys():
sample[key] = dictionary[key]
I think this code has an advantage. The value dictionary[key] assigned to some row of sample, then the memory of dictionary[key] will clear. However, I test it and find that the memory usage is also 100GB. Why?
Are there any good methods to limit the memory usage as 50GB?
Your problem is that you need to have 2 copies of the same data in memory.
If you build the array as in test1 you'll need far less memory at once, but at the cost of losing the dictionary.
import numpy as np
import time
def test1(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.array([a.pop(i) for i in range(n)]).reshape(-1)
return b
def test2(n):
a = {x:(x, x, x) for x in range(n)} # Build sample data
b = np.concatenate(list(a.values()))
return b
x1 = test1(1000000)
del x1
time.sleep(1)
x2 = test2(1000000)
Results:
test1 : 0.71 s
test2 : 1.39 s
The first peek is for test1, it's not exactly in place but it reduces the memory usage quite a bit.
dictionary[key] is a dataset on the file. dictionary[key][...] will be an numpy array, that dataset downloaded.
I imagine
sample[key] = dictionary[key]
is evaluated as
sample[key,...] = dictionary[key][...]
The dataset is downloaded, and then copied to a slice of the sample array. That downloaded array should be free for recycling. But whether numpy/python does that is another matter. I'm not in the habit of pushing memory limits.
You don't want to do the incremental concatenate - that's slow. A single concatenate on the list should be faster. I don't know for such what
list(dictionary.values)
contains. Will it be references to the datasets, or downloaded arrays? Regardless concatenate(...) on that list will have to used the downloaded arrays.
One thing puzzles me - how can you use the same key to index the first dimension of sample and dataset in dictionary? h5py keys are supposed to be strings, not integers.
Some testing
Note that I'm using string dataset names:
In [21]: d = f.create_dataset('0',data=np.zeros((2,3)))
In [22]: d = f.create_dataset('1',data=np.zeros((2,3)))
In [23]: d = f.create_dataset('2',data=np.ones((2,3)))
In [24]: d = f.create_dataset('3',data=np.arange(6.).reshape(2,3))
Your np.concatenate(list(dictionary.values)) code is missing ():
In [25]: f.values
Out[25]: <bound method MappingHDF5.values of <HDF5 file "test.hf" (mode r+)>>
In [26]: f.values()
Out[26]: ValuesViewHDF5(<HDF5 file "test.hf" (mode r+)>)
In [27]: list(f.values())
Out[27]:
[<HDF5 dataset "0": shape (2, 3), type "<f8">,
<HDF5 dataset "1": shape (2, 3), type "<f8">,
<HDF5 dataset "2": shape (2, 3), type "<f8">,
<HDF5 dataset "3": shape (2, 3), type "<f8">]
So it's just a list of the datasets. The downloading occurs when concatenate does a np.asarray(a) for each element of the list:
In [28]: np.concatenate(list(f.values()))
Out[28]:
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[1., 1., 1.],
[1., 1., 1.],
[0., 1., 2.],
[3., 4., 5.]])
e.g.:
In [29]: [np.array(a) for a in f.values()]
Out[29]:
[array([[0., 0., 0.],
[0., 0., 0.]]), array([[0., 0., 0.],
[0., 0., 0.]]), array([[1., 1., 1.],
[1., 1., 1.]]), array([[0., 1., 2.],
[3., 4., 5.]])]
In [30]: [a[...] for a in f.values()]
....
Let's look at what happens when using your iteration approach:
Make an array that can takes one dataset for each 'row':
In [34]: samples = np.zeros((4,2,3),float)
In [35]: for i,d in enumerate(f.values()):
...: v = d[...]
...: print(v.__array_interface__['data']) # databuffer location
...: samples[i,...] = v
...:
(27845184, False)
(27815504, False)
(27845184, False)
(27815504, False)
In [36]: samples
Out[36]:
array([[[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.]],
[[1., 1., 1.],
[1., 1., 1.]],
[[0., 1., 2.],
[3., 4., 5.]]])
In this small example, it recycled every other databuffer block. The 2nd iteration frees up the databuffer used in the first, which can then be reused in the 3rd, and so on.
These are small arrays in a interactive ipython session. I don't know if these observations apply in large cases.

Numpy: signed values of element-wise absolute maximum of a 2D array

Let us assume that I have a 2D array named arr of shape (4, 3) as follows:
>>> arr
array([[ nan, 1., -18.],
[ -1., -1., -1.],
[ 1., 1., 5.],
[ 1., -1., 0.]])
Say that, I would like to assign the signed value of the element-wise absolute maximum of (1.0, 1.0, -15.0) and the rows arr[[0, 2], :] back to arr. Which means, I am looking for the output:
>>> arr
array([[ 1., 1., -18.],
[ -1., -1., -1.],
[ 1., 1., -15.],
[ 1., -1., 0.]])
The closest thing I found in the API reference for this is numpy.fmax but it doesn't do the absolute value. If I used:
arr[index_list, :] = np.fmax(arr[index_list, :], new_tuple)
my array would finally look like:
>>> arr
array([[ 1., 1., -15.],
[ -1., -1., -1.],
[ 1., 1., 5.],
[ 1., -1., 0.]])
Now, the API says that this function is
equivalent to np.where(x1 >= x2, x1, x2) when neither x1 nor x2 are NaNs, but it is faster and does proper broadcasting
I tried using the following:
arr[index_list, :] = np.where(np.absolute(arr[index_list, :]) >= np.absolute(new_tuple),
arr[index_list, :], new_tuple)
Although this produced the desired output, I got the warning:
/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevconsole.py:1: RuntimeWarning: invalid value encountered in greater_equal
I believe this warning is because of the NaN which is not handled gracefully here, unlike the np.fmax function. In addition, the API docs mention that np.fmax is faster and does broadcasting correctly (not sure what part of broadcasting is missing in the np.where version)
In conclusion, what I am looking for is something similar to:
arr[index_list, :] = np.fmax(arr[index_list, :], new_tuple, key=abs)
There is no such key attribute available to this function, unfortunately.
Just for context, I am interested in the fastest possible solution because my actual shape of the arr array is an average of (100000, 50) and I am looping through almost 1000 new_tuple tuples (with each tuple equal in shape to the number of columns in arr, of course). The index_list changes for each new_tuple.
Edit 1:
One possible solution is, to begin with replacing all NaN in arr with 0. i.e. arr[np.isnan(arr)] = 0. After this, I can use the np.where with np.absolute trick mentioned in my original text. However, this is probably a lot slower than np.fmax, as suggested by the API.
Edit 2:
The index_list may have repeated indexes in subsequent loops. Every new_tuple comes with a corresponding rule and the index_list is selected based on that rule. There is nothing stopping different rules from having overlapping indexes that they match to. #Divakar has an excellent answer for the case where index_list has no repeats. Other solutions are however welcome covering both cases.
Assuming that list of all index_list has no repeated indexes:
Approach #1
I would propose more of a vectorized solution once we have all of index_lists and new_tuples stored in one place, preferably as a list. As such this could be the preferred one, if we are dealing with lots of such tuples and lists.
So, let's say we have them stored as the following :
new_tuples = [(1.0, 1.0, -15.0), (6.0, 3.0, -4.0)] # list of all new_tuple
index_lists =[[0,2],[4,1,6]] # list of all index_list
The solution thereafter would be to manually repeat, replacing the broadcasting and then use np.where as shown later on in the question. Using np.where on the concern around the said warning, we can ignore, if the new_tuples have non-NaN values. Thus, the solution would be -
idx = np.concatenate(index_lists)
lens = list(map(len,index_lists))
a = arr[idx]
b = np.repeat(new_tuples,lens,axis=0)
arr[idx] = np.where(np.abs(a) > np.abs(b), a, b)
Approach #2
Another approach would be to store the absolute values of arr beforeand : abs_arr = np.abs(arr) and using those within np.where. This should save a lot time within the loop. Thus, the relevant computation would reduce to :
arr[index_list, :] = np.where(abs_arr[index_list, :] > np.abs(b), a, new_tuple)

Create a numpy array according to another array along with indices array

I have a numpy array(eg., a = np.array([ 8., 2.])), and another array which stores the indices I would like to get from the former array. (eg., b = np.array([ 0., 1., 1., 0., 0.]).
What I would like to do is to create another array from these 2 arrays, in this case, it should be: array([ 8., 2., 2., 8., 8.])
of course, I can always use a for loop to achieve this goal:
for i in range(5):
c[i] = a[b[i]]
I wonder if there is a more elegant method to create this array. Something like c = a[b[0:5]] (well, this apparently doesn't work)
Only integer arrays can be used for indexing, and you've created b as a float64 array. You can get what you're looking for if you explicitly convert to integer:
bi = np.array(b, dtype=int)
c = a[bi[0:5]]

Categories

Resources