I am setting an existing array to zeros using the numpy.zeros_like function as follows:
import numpy as np
x = np.random.rand(3, 3, 3, 3, 3) # Some random data
x = np.zeros_like(x.shape)
I think the way I am doing it involves creating a new array of zeros and updating the reference to it. I was wondering if there is an efficient way to set everything to zeros without this initialization. The reason I need it is because it is called in an optimisation routine which gets called quite a few times.
You can do the following:
x[:] = 0
Related
I am generating a large set of monte carlo data and ideally want to store in it an array of arrays.
When i use the array.append(x) and then cycle over a loop that returns a new array for x when I look at the elements of the array at the end they are all the same as the last array x added to the list. I believe this must be because i'm adding the memory location to the list and not the actual array data hence when I add more arrays all the other elements that point to the same location also update.
Is there anyway to prevent this by setting a kwarg or something or do i have to construct my arrays in a different way?
#test to illustrate point
import numpy as np
x = np.random.choice((-1, 1), size=(5, 5))
array_test = []
for T in range(10):
array_test.append(x)
print(x)
x += 10
print(array_test)
My supervisor uses IDL but I've been using Python as I'm more familiar with it. I'm performing an interpolation and have saved the lower/upper bound values. Is there a quicker way of doing this?
Variables
Inputs:
sed = numpy array [6,221,6900]
it0, it1, iz0, iz1 = numpy arrays [341499]
snapshot (38 of them)
Outputs:
sed1, sed2, sed3, sed4 = numpy arrays [341499]
MWE
I want to loop through the 38 snapshots, then within that loop through the 341499 particles, and then assign the resulting numpy array [6900] given below.
sed1 = sed[iz0, it0]
sed2 = sed[iz1, it0]
sed3 = sed[iz0, it1]
sed4 = sed[iz1, it1]
What I've Tried
I cannot initialise an array of the required size i.e. numpy [38, 341499, 4, 6900] as this gives a memory error. Meaning can't assign using vector [:] operations
I've tried initialising a numpy dtype object array of size [38, 341499] but this very slow
i got a problem to solve and i cannot come up with a good solution.
To ease it down I got an array of 10x10 and i want to slice out "little arrays" of 3x3. Right now i do this the following way:
array = np.arange(100).reshape((10,10))
patch = np.array(array[:3, :3]
for n in range(3, 10, 3):
for m in range(3, 10, 3):
patch = numpy.append(patch, array[n:n+3, m:m+3]
i basically create the numpy array patch with the first slice and append all other slices afterwards. The problem with this is that its horribly slow and does not do good use of the slicing opportunities of numpy. I need to do this for a high number of much bigger arrays.
Can anyone give me any advice on how to make this more efficient?
1000 thanks!
Your problem is entirely down to using numpy.append. append creates a new array each time you use it. As your patch array gets bigger this will take progressively longer.
Instead, use a presized array (you already know the final size of the patch array), and avoid making intermediary copies of any data.
# setup
x, y = 999, 999
array = np.arange(x * y)
array.shape = x, y
little_array_size = 3
# creates an array of "little arrays"
patch = np.empty(array.size, dtype=int)
patch.shape = -1, little_array_size, little_array_size
i = 0
for n in range(0, array.shape[0], little_array_size):
for m in range(0, array.shape[1], little_array_size):
# uses view, so data is copied straight from array to patch
patch[i,:] = array[n:n+little_array_size, m:m+little_array_size]
i += 1
patch.shape = -1 # flattens array
The above takes about a third of second on my computer (two orders of magnitude faster than using numpy.append (20+ seconds)).
Pandas seems to be missing a R-style matrix-level rolling window function (rollapply(..., by.column = FALSE)), providing only the vector based version. Thus I tried to follow this question and it works beautifully with the example which can be replicated, but it doesn't work with pandas DataFrames even when using the (seemingly identical) underlying Numpy array.
Artificial problem replication:
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import as_strided
test = [[x * y for x in range(1, 10)] for y in [10**z for z in range(5)]]
mm = np.array(test, dtype = np.int64)
pp = pd.DataFrame(test).values
mm and pp look identical:
The numpy directly-derived matrix gives me what I want perfectly:
as_strided(mm, (mm.shape[0] - 3 + 1, 3, mm.shape[1]), (mm.shape[1] * 8, mm.shape[1] * 8, 8))
That is, it gives me 3 strides of 3 rows each, in a 3d matrix, allowing me to perform computations on a submatrix moving down by one row at a time.
But the pandas-derived version (identical call with mm replaced by pp):
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (pp.shape[1] * 8, pp.shape[1] * 8, 8))
is all weird like it's transposed somehow. Is this to do with column/row major order stuff?
I need to do matrix sliding windows in Pandas, and this seems my best shot, especially because it is really fast. What's going on here? How do I get the underlying Pandas array to behave like Numpy?
It seems that the .values returns the underlying data in Fortran order (as you speculated):
>>> mm.flags # NumPy array
C_CONTIGUOUS : True
F_CONTIGUOUS : False
...
>>> pp.flags # array from DataFrame
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
This confuses as_strided which expects the data to be arranged in C order in memory.
To fix things, you could copy the data in C order and use the same strides as in your question:
pp = pp.copy('C')
Alternatively, if you want to avoid copying large amounts of data, adjust the strides to acknowledge the column-order layout of the data:
as_strided(pp, (pp.shape[0] - 3 + 1, 3, pp.shape[1]), (8, 8, pp.shape[0]*8))
Is this to do with column/row major order stuff?
Yes, see mm.strides and pp.strides.
How do I get the underlying Pandas array to behave like Numpy?
The Numpy array mm is "C-contiguous" and that's why the stride trick works. If you want to call the exact same code on the array underlying the DataFrame, you can use np.ascontiguousarray first. Or maybe it would be better to write the data windowing while taking the array strides and itemsize into account.
I want to create an array in numpy that contains the values of a mathematical series, in this example the square of the previous value, giving a single starting value, i.e. a_0 = 2, a_1 = 4, a_3 = 16, ...
Trying to use the vectorization in numpy I thought this might work:
import numpy as np
a = np.array([2,0,0,0,0])
a[1:] = a[0:-1]**2
but the outcome is
array([2, 4, 0, 0, 0])
I have learned now that numpy does internally create a temporary array for the output and in the end copies this array, that is why it fails for the values that are zero in the original array.
Is there a way to vectorize this function using numpy, numexpr or other tools? What other ways are there to effectively calculate the values of a series when fast numpy functions are available without going for a for loop?
There is no general way to vectorise recursive sequence definitions in NumPy. This particular case is rather easy to write without a for-loop though:
>>> 2 ** 2 ** numpy.arange(5)
array([ 2, 4, 16, 256, 65536])