I'm dealing with arrays in python, and this generated a lot of doubts...
1) I produce a list of list reading 4 columns from N files and I store 4 elements for N times in a list. I then convert this list in a numpy array:
s = np.array(s)
and I ask for the shape of this array. The answer is correct:
print s.shape
#(N,4)
I then produce the mean of this Nx4 array:
s_m = sum(s)/len(s)
print s_m.shape
#(4,)
that I guess it means that this array is a 1D array. Is this correct?
2) If I subtract the mean vector s_m from the rows of the array s, I can proceed in two ways:
residuals_s = s - s_m
or:
residuals_s = []
for i in range(len(s)):
residuals_s.append([])
tmp = s[i] - s_m
residuals_s.append(tmp)
if I now ask for the shape of residuals_s in the two cases I obtain two different answers. In the first case I obtain:
(N,4)
in the second:
(N,1,4)
can someone explain why there is an additional dimension?
You can get the mean using the numpy method (producing the same (4,) shape):
s_m = s.mean(axis=0)
s - s_m works because s_m is 'broadcasted' to the dimensions of s.
If I run your second residuals_s I get a list containing empty lists and arrays:
[[],
array([ 1.02649662, 0.43613824, 0.66276758, 2.0082684 ]),
[],
array([ 1.13000227, -0.94129685, 0.63411801, -0.383982 ]),
...
]
That does not convert to a (N,1,4) array, but rather a (M,) array with dtype=object. Did you copy and paste correctly?
A corrected iteration is:
for i in range(len(s)):
residuals_s.append(s[i]-s_m)
produces a simpler list of arrays:
[array([ 1.02649662, 0.43613824, 0.66276758, 2.0082684 ]),
array([ 1.13000227, -0.94129685, 0.63411801, -0.383982 ]),
...]
which converts to a (N,4) array.
Iteration like this usually is not needed. But if it is, appending to lists like this is one way to go. Another is to pre allocate an array, and assign rows
residuals_s = np.zeros_like(s)
for i in range(s.shape[0]):
residuals_s[i,:] = s[i]-s_m
I get your (N,1,4) with:
In [39]: residuals_s=[]
In [40]: for i in range(len(s)):
....: residuals_s.append([])
....: tmp = s[i] - s_m
....: residuals_s[-1].append(tmp)
In [41]: residuals_s
Out[41]:
[[array([ 1.02649662, 0.43613824, 0.66276758, 2.0082684 ])],
[array([ 1.13000227, -0.94129685, 0.63411801, -0.383982 ])],
...]
In [43]: np.array(residuals_s).shape
Out[43]: (10, 1, 4)
Here the s[i]-s_m array is appended to an empty list, which has been appended to the main list. So it's an array within a list within a list. It's this intermediate list that produces the middle 1 dimension.
You are using NumPy ndarray without using the functions in NumPy, sum() is a python builtin function, you should use numpy.sum() instead.
I suggest you change your code as:
import numpy as np
np.random.seed(0)
s = np.random.randn(10, 4)
s_m = np.mean(a, axis=0, keepdims=True)
residuals_s = s - s_m
print s.shape, s_m.shape, residuals_s.shape
use mean() function with axis and keepdims arguments will give you the correct result.
Related
I have a numpy array arr containing 0s and 1s,
arr = np.random.randint(2, size=(800,800))
Then I casted it to astype(np.float32) and inserted various float numbers at various positions. In fact, what I would like to do is insert those float numbers only where the original array had 1 rather than 0; where the original array had 0 I want to keep 0.
My thought was to take a copy of the array (with .copy()) and reinsert from that later. So now I have arr above (1s and 0s), and a same-shaped array arr2 with numerical elements. I want to replace the elements in arr2 with those in arr only where (and everywhere where) the element in arr is 0. How can I do this?
Small example:
arr = np.array([1,0],
[0,1])
arr2 = np.array([2.43, 5.25],
[1.54, 2.59])
Desired output:
arr2 = np.array([2.43, 0],
[0, 2.59])
N.B. should be as fast as possible on arrays of around 800x800
Simply do:
arr2[arr == 0] = 0
or
arr2 = arr2 * arr
#swag2198 is correct, an alternative is below
Numpy has a functin called 'where' which allows you to set values based on a condition from another array - this is essentially masking
below will achieve what you want - essentially it will return an array the same dimensions as arr2, except wherever there is a zero in arr, it will be replaced with zero
arr = np.array([[1,0],[0,1]])
arr2 = np.array([[2.43, 5.25],
[1.54, 2.59]])
arr_out = np.where(arr, arr2, 0)
the advantage of this way is that you can pick values based on two arrays if you wish - say you wanted to mask an image for instance - replace the background
Given a array in list
import numpy as np
n_pair = 5
np.random.seed ( 0 )
nsteps = 4
nmethod = 2
nbands = 3
t_band=0
t_method=0
t_step=0
t_sbj=0
t_gtmethod=1
all_sub = [[np.random.rand ( nmethod, nbands, 2 ) for _ in range ( nsteps )] for _ in range ( 3)]
Then extract the array data point from each of the list as below
this_gtmethod=[x[t_step][t_method][t_band][t_gtmethod] for x in all_sub]
However, I would like to avoid the loop and instead would like to access directly all the three elements as below
this_gtmethod=all_sub[:][t_step][t_method][t_band][t_gtmethod]
But, it does not return the expected result when indexing the element as above
May I know where did I do wrong?
This sort of slicing and indexing is best accomplished with Numpy arrays rather than lists.
If you make all_sub into a Numpy array, you can achieve your desired result with simple slicing.
all_sub = np.array(all_sub)
this_gtmethod = all_sub[:, t_step, t_method, t_band, t_gtmethod]
The result is the same as with your looping example.
You made a list of lists of arrays:
In [279]: type(all_sub), len(all_sub)
Out[279]: (list, 3)
In [280]: type(all_sub[0]), len(all_sub[0])
Out[280]: (list, 4)
In [282]: type(all_sub[0][0]), all_sub[0][0].shape
Out[282]: (numpy.ndarray, (2, 3, 2))
Lists can only be indexed with a scalar value or slice. List comprehension is the normal way of iterating through a list.
But an array can be indexed several dimensions at a time:
In [283]: all_sub[0][1][1,2,:]
Out[283]: array([0.46147936, 0.78052918])
Since the nested lists are all the same size, and arrays the same, it can be turned into a multidimensional array:
In [284]: M = np.array(all_sub)
In [285]: M.shape
Out[285]: (3, 4, 2, 3, 2)
2 ways of accessing the same subarrays:
In [286]: M[:,0,0,0,:]
Out[286]:
array([[0.5488135 , 0.71518937],
[0.31542835, 0.36371077],
[0.58651293, 0.02010755]])
In [287]: [a[0][0,0,:] for a in all_sub]
Out[287]:
[array([0.5488135 , 0.71518937]),
array([0.31542835, 0.36371077]),
array([0.58651293, 0.02010755])]
This problem only seems to arise when my dummy function returns an array and thus, a multidimensional array is being created.
I reduced the issue to the following example:
def dummy(x):
y = np.array([np.sin(x), np.cos(x)])
return y
x = np.array([0, np.pi/2, np.pi])
The code I want to optimize looks like this:
y = []
for x_i in x:
y_i = dummy(x_i)
y.append(y_i)
y = np.array(y)
So I thought, I could use vectorize to get rid of the slow loop:
y = np.vectorize(dummy)(x)
But this results in
ValueError: setting an array element with a sequence.
Where even is the sequence, which the error is talking about?!
Your function returns an array when given a scalar:
In [233]: def dummy(x):
...: y = np.array([np.sin(x), np.cos(x)])
...: return y
...:
...:
In [234]: dummy(1)
Out[234]: array([0.84147098, 0.54030231])
In [235]: f = np.vectorize(dummy)
In [236]: f([0,1,2])
...
ValueError: setting an array element with a sequence.
vectorize constructs a empty result array, and tries to put the result of each calculation in it. But a cell of the target array cannot accept an array.
If we specify a otypes parameter, it does work:
In [237]: f = np.vectorize(dummy, otypes=[object])
In [238]: f([0,1,2])
Out[238]:
array([array([0., 1.]), array([0.84147098, 0.54030231]),
array([ 0.90929743, -0.41614684])], dtype=object)
That is, each dummy array is put in a element of a shape (3,) result array.
Since the component arrays all have the same shape, we can stack them:
In [239]: np.stack(_)
Out[239]:
array([[ 0. , 1. ],
[ 0.84147098, 0.54030231],
[ 0.90929743, -0.41614684]])
But as noted, vectorize does not promise a speedup. I suspect we could also use the newer signature parameter, but that's even slower.
vectorize makes some sense if your function takes several scalar arguments, and you'd like to take advantage of numpy broadcasting when feeding sets of values. But as replacement for a simple iteration over a 1d array, it isn't an improvement.
I don't really understand the error either, but with python 3.6.3 you can just write:
y = dummy(x)
so it is automatically vectorized.
Also in the official documentation there is written the following:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
I hope this was at least a little help.
I am attempting to utilize numpy to the best of its capabilities, but I am obviously missing some important link in the documentation due to my 'Noob-ness'.
What I want to do is create an array with a certain number of rows and columns and populate it with a sub array. The sub array is incremented by a pair of values as one traverses along the row. For subsequent rows, another pair of values is used to populate the columns. The best I have come up with is to use list comprehensions to generate the desired output. At this stage I can create an array which doesn't have the desired shape...I can deal with that in an awkward fashion, so all is not lost.
Here is what I have so far:
>>> import numpy as np
>>> np.set_printoptions(precision=4,threshold=20,edgeitems=3,linewidth=80) # default print options
>>> sub = np.array([[1,2],[3,4],[5,6]],dtype='float64') # a sub array of floats
>>> sub
array([[ 1., 2.],
[ 3., 4.],
[ 5., 6.]])
>>> e = np.empty((3,4),dtype='object') # an empty array of the desired shape
>>> e
array([[None, None, None, None],
[None, None, None, None],
[None, None, None, None]], dtype=object)
>>> dX = 1; dY = np.sqrt(3.)/2.0 # values to add to sub array per cell in e
>>> rows,cols = e.shape # rows and columns from 'e' shape
>>> out = [sub + [dX*i,dY*(i%2)] for i in range(0,cols)] # create the first row
>>> for j in range(1,rows): # create the other rows
... out += [out[k] + [0,-dY*2*j] for k in range(cols)]
...
>>> arr = np.array(out)
>>> arr.shape # expect to see ((3,4),3,2)...I think
(12, 3, 2)
>>> arr[0:4] # I will let you try this to see the format
The last line just shows the format of the first 4 elements of the output array. What I was hoping to do was populate the empty array, e, in a fashion which is more elegant than my list comprehension method AND/OR how to reshape the array properly. Again, unless I am missing links in the documentation, I would have expected a 3x4 array containing 3x2 subarrays...which is not what it is showing me. I would appreciate any help or links to appropriate documentation, since, I have spent hours trolling this site and I am obviously missing some appropriate numpy terminology.
The first out is a list of 4 (3,2) arrays.
np.array(out) at this stage produces a (4,3,2) array. np.array creates the highest dimension array that the data allows. In this case is concatenates those 4 arrays along a new dimension.
After the rows loop, out is a list of 12 arrays. out +=... on a list appends them.
So by the same logic, arr = np.array(out) will produce a (12,3,2) array. That could be reshaped: arr = arr.reshape(3,4,3,2).
Subarrays from arr could be copied to e, e.g.:
e[0,0] = arr[0,0]
Which raises the question, why do you want array like e? What advantage does it have over arr? arr represents 'the best of numpy's capabilities,e` tries to extend to them in poorly developed areas.
Your out list can be vectorized with something along this line:
ii = np.arange(cols)
ixy = np.array([dX*ii, dY*(ii%2)])
arr1 = sub[None,:,:] + ixy.T[:,None,:]
arr1 is a (4,3,2) array, and could be copied to the e[0,:] elements.
This could be cleaned up and extended to the other rows.
A clean way of iterating over all the elements of e, and assigning the corresponding subarray of arr uses np.ndindex (from the index_tricks module):
for i in np.ndindex(3,4):
e[i]=arr[i]
While it is a Python level iteration, it does not involve copying data. It just copies pointers. I'm a little surprised about this, but e[i,j] points to the same data block as arr[i,j]. This is evident from the .__array_interface__ values, and by modifying entries, e.g.
e[1,1][0,0] = 30
changes the value of arr[1,1,0,0].
Say I have a 3 dimensional numpy array:
np.random.seed(1145)
A = np.random.random((5,5,5))
and I have two lists of indices corresponding to the 2nd and 3rd dimensions:
second = [1,2]
third = [3,4]
and I want to select the elements in the numpy array corresponding to
A[:][second][third]
so the shape of the sliced array would be (5,2,2) and
A[:][second][third].flatten()
would be equivalent to to:
In [226]:
for i in range(5):
for j in second:
for k in third:
print A[i][j][k]
0.556091074129
0.622016249651
0.622530505868
0.914954716368
0.729005532319
0.253214472335
0.892869371179
0.98279375528
0.814240066639
0.986060321906
0.829987410941
0.776715489939
0.404772469431
0.204696635072
0.190891168574
0.869554447412
0.364076117846
0.04760811817
0.440210532601
0.981601369658
Is there a way to slice a numpy array in this way? So far when I try A[:][second][third] I get IndexError: index 3 is out of bounds for axis 0 with size 2 because the [:] for the first dimension seems to be ignored.
Numpy uses multiple indexing, so instead of A[1][2][3], you can--and should--use A[1,2,3].
You might then think you could do A[:, second, third], but the numpy indices are broadcast, and broadcasting second and third (two one-dimensional sequences) ends up being the numpy equivalent of zip, so the result has shape (5, 2).
What you really want is to index with, in effect, the outer product of second and third. You can do this with broadcasting by making one of them, say second into a two-dimensional array with shape (2,1). Then the shape that results from broadcasting second and third together is (2,2).
For example:
In [8]: import numpy as np
In [9]: a = np.arange(125).reshape(5,5,5)
In [10]: second = [1,2]
In [11]: third = [3,4]
In [12]: s = a[:, np.array(second).reshape(-1,1), third]
In [13]: s.shape
Out[13]: (5, 2, 2)
Note that, in this specific example, the values in second and third are sequential. If that is typical, you can simply use slices:
In [14]: s2 = a[:, 1:3, 3:5]
In [15]: s2.shape
Out[15]: (5, 2, 2)
In [16]: np.all(s == s2)
Out[16]: True
There are a couple very important difference in those two methods.
The first method would also work with indices that are not equivalent to slices. For example, it would work if second = [0, 2, 3]. (Sometimes you'll see this style of indexing referred to as "fancy indexing".)
In the first method (using broadcasting and "fancy indexing"), the data is a copy of the original array. In the second method (using only slices), the array s2 is a view into the same block of memory used by a. An in-place change in one will change them both.
One way would be to use np.ix_:
>>> out = A[np.ix_(range(A.shape[0]),second, third)]
>>> out.shape
(5, 2, 2)
>>> manual = [A[i,j,k] for i in range(5) for j in second for k in third]
>>> (out.ravel() == manual).all()
True
Downside is that you have to specify the missing coordinate ranges explicitly, but you could wrap that into a function.
I think there are three problems with your approach:
Both second and third should be slices
Since the 'to' index is exclusive, they should go from 1 to 3 and from 3 to 5
Instead of A[:][second][third], you should use A[:,second,third]
Try this:
>>> np.random.seed(1145)
>>> A = np.random.random((5,5,5))
>>> second = slice(1,3)
>>> third = slice(3,5)
>>> A[:,second,third].shape
(5, 2, 2)
>>> A[:,second,third].flatten()
array([ 0.43285482, 0.80820122, 0.64878266, 0.62689481, 0.01298507,
0.42112921, 0.23104051, 0.34601169, 0.24838564, 0.66162209,
0.96115751, 0.07338851, 0.33109539, 0.55168356, 0.33925748,
0.2353348 , 0.91254398, 0.44692211, 0.60975602, 0.64610556])