I am having problems with while loops in python right now:
while(j < len(firstx)):
trainingset[j][0] = firstx[j]
trainingset[j][1] = firsty[j]
trainingset[j][2] = 1 #top
print(trainingset)
j += 1
print("j is " + str(j))
i = 0
while(i < len(firsty)):
trainingset[j+i][0] = twox[i]
trainingset[j+i][1] = twoy[i]
trainingset[j+i][2] = 0 #bottom
i += 1
Where trainingset = [[0,0,0]]*points*2 where points is a number. Also firstx and firsty and twox and twoy are all numpy arrays.
I want the training set to have 2*points array entries which go [firstx[0], firsty[0], 1] all the way to [twox[points-1], twoy[points-1], 0].
After some debugging, I am realizing that for each iteration, **every single value **in the training set is being changed, so that when j = 0 all the training set values are replaced with firstx[0], firsty[0], and 1.
What am I doing wrong?
In this case, I would recommend a for loop instead of a while loop; for loops are great for when you want the index to increment and you know what the last increment value should be, which is true here.
I've had to make some assumptions about the shape of your arrays based on your code. I'm assuming that:
firstx, firsty, twox, and twoy are 1D NumPy arrays with either shape (length,) or (length, 1).
trainingset is a 2D NumPy array with least len(firstx) + len(firsty) rows and at least 3 columns.
j starts at 0 before the while loop begins.
Given these assumptions, here's some code that gives you the output you want:
len_firstx = len(firstx)
# Replace each row with index j with the desired values
for j in range(len(firstx)):
trainingset[j][0:3] = [firstx[j], firsty[j], 1]
# Because i starts at 0, len_firstx needs to be added to the trainingset row index
for i in range(len(firsty)):
trainingset[i + len_firstx][0:3] = [twox[i], twoy[i], 0]
Let me know if you have any questions.
EDIT: Alright, looks like the above doesn't work correctly on rare occasions, not sure why, so if it's being fickle, you can change trainingset[j][0:3] and trainingset[i + len_firstx][0:3] to trainingset[j, 0:3] and trainingset[i + len_firstx, 0:3].
I think it has something to do with the shape of the trainingset array, but I'm not quite sure.
EDIT 2: There's also a more Pythonic way to do what you want instead of using loops. It standardizes the shapes of the four arrays assumed to be 1D (firstx, twox, etc. -- also, if you could let me know exactly what shape these arrays have, that would be super helpful and I could simplify the code!) and then makes the appropriate rows and columns in trainingset have the corresponding values.
# Function to reshape the 1D arrays.
# Works for shapes (length,), (length, 1), and (1, length).
def reshape_1d_array(arr):
shape = arr.shape
if len(shape) == 1:
return arr[:, None]
elif shape[0] >= shape[1]:
return arr
else:
return arr.T
# Reshape the 1D arrays
firstx = reshape_1d_array(firstx)
firsty = reshape_1d_array(firsty)
twox = reshape_1d_array(twox)
twoy = reshape_1d_array(twoy)
len_firstx = len(firstx)
# The following 2 lines do what the loops did, but in 1 step.
arr1 = np.concatenate((firstx, firsty[0:len_firstx], np.array([[1]*len_firstx]).T), axis=1)
arr2 = np.concatenate((twox, twoy[0:len_firstx], np.array([[0]*len_firstx]).T), axis=1)
# Now put arr1 and arr2 where they're supposed to go in trainingset.
trainingset[:len_firstx, 0:3] = arr1
trainingset[len_firstx:len_firstx + len(firsty), 0:3] = arr2
This gives the same result as the two for loops I wrote before, but is faster if firstx has more than ~50 elements.
Related
I have a piece of code that is running, but is currently a bottleneck, and I was wondering whether there is a smarter way to do it.
I have a 1D array of integers between 0-20, with a length of 20-1000 (it varies) and I'm trying to compare it to a set of 1D arrays that are stored in a 2D array.
I wish to find any row in the 2D array that completely matches the 1D array.
My current approach to do this is the following:
res = np.mean(one_d_array == two_d_array,axis=1) == 1
The problem with this approach is that it will compare all elements in all rows, even if these rows don't match on the first element, second, ect... Which is of course very inefficient.
I could remedy this by looping through the rows and comparing each row individually, then I would probably be able to stop the comparison as soon as one element is false. However then I would be stuck with a slow for loop, which would also not be ideal.
So I'm wondering is there some other clever way to get the best of both of these approaches?
numpy has a few useful built-in functions for checking matrix/vector equality, this is about twice as fast:
import numpy as np
import time
x = np.random.random((1, 1000))
y = np.random.random((10000, 1000))
y[53] = x
t = time.time()
x_in_y = np.equal(x, y).all(axis=1) # equal(x, y) returns a row x col matrix of True for matches; all(axis=0) returns a vector len(rows) if the entire row in x == y is true
idx = np.where(x_in_y) # returns the indicies where x_in_y is true (here 53)
print(time.time() - t) # 0.019975900650024414
t = time.time()
res = np.mean(x == y, axis=1) == 1
print(time.time() - t) # 0.03999614715576172
I'm trying to create an array based on values from another data frame in Python. I want it to fill the array as such.
If x > or = 3 in the dataframe then it inputs a 0 in the array.
If x < 3 in the dataframe then it inputs a 1 in the array.
If x = 0 in the dataframe then it inputs a 0 in the array.
Below is the code I have so far but the result is coming out as just [0]
array = np.array([])
for x in df["disc"]:
for y in array:
if x >= 3:
y=0
elif x < 3:
y=1
else:
y=0
Any help would be much appreciated thank you.
When working with numpy arrays, it is more efficient if you can avoid using explicit loops in Python at all. (The actual looping takes place inside compiled C code.)
disc = df["disc"]
# make an array containing 0 where disc >= 3, elsewhere 1
array = np.where(disc >= 3, 0, 1)
# now set it equal to 0 in any places where disc == 0
array[disc == 0] = 0
It could also be done in a single statement (other than the initial assignment of disc) using:
array = np.where((disc >= 3) | (disc == 0), 0, 1)
Here the | does an element-by-element "or" test on the boolean arrays. (It has higher precedence than comparison operators, so the parentheses around the comparisons are needed.)
This is a simple problem. There are many ways to solve this, I think the easiest way is to use a list. You can use a list and append the values according to the conditions.
array = []
for x in df["disc"]:
if x >= 3:
array.append(0)
elif x < 3:
array.append(1)
else:
array.append(0)
Your code doesn't seem to be doing anything to the array, as you are trying to modify the variable y, rather than the array itself. y doesn't reference the array, it just holds the values found. The second loop also doesn't do anything due to the array being empty - it's looping through 0 elements. What you need rather than another for loop is to simply append to the array.
With a list, you would use the .append() method to add an element, however as you appear to be using numpy, you'd want to use the append(arr, values) function it provides, like so:
array = np.array([])
for x in df["disc"]:
if x >= 3:
array = np.append(array, 0)
elif x < 3:
array = np.append(array, 1)
else:
array = np.append(array, 0)
I'll also note that these conditions can be simplified to combine the two branches which append a 0. Namely, if x < 3 and x is not 0, then add a 1, otherwise add a 0. Thus, the code can be rewriten as follows.
array = np.array([])
for x in df["disc"]:
if x < 3 and x != 0:
array = np.append(array, 1)
else:
array = np.append(array, 0)
I would like to speed up a function on a single array in Numpy using fancy indexing, vectorization, and/or broadcasting. For each value in my array, I need to do a calculation that involves adjacent values. Therefore, in my vectorized operation, I need to have access to the current index so that I can grab indices around it. Consider the following simple array operation:
x = np.arange(36).reshape(6, 6)
y = np.zeros((6, 6))
y[:] = x + 1
I'd like to use similar syntax, but rather than a simple increment, I'd like to do something like add all values at adjacent indices to the current value in the vectorized loop. For instance if the region around index [i, j] == 7 looks like
3 2 5
2 7 6
5 5 5
I'd like the calculated value for [i, j] to be 3 + 2 + 5 + 2 + 7 + 6 + 5 + 5 + 5, and I want to do that for all indices [i, j].
This is a straightforward nested for loop (or a single for loop using np.sum for each index)... but I want to use broadcasting and/or fancy indexing if possible. This may be too complex of a problem for the Numpy syntax, but I feel like it should be possible.
Essentially, it comes down to this: how do I reference the current index during a broadcasting operation?
Start with a 1D example:
x = np.arange(10)
There is a choice you have to make: do you discard the edges or not, since they don't have two neighbors? If you do, you can create your output array in esentially one step:
result = x[:-2] + x[1:-1] + x[2:]
Notice that all three addends are views because they use simple indexing. You want to avoid fancy indexing as much as you can because it generally involves making copies.
If you prefer to retain the edges, you can pre-allocate the output buffer and add directly into it:
result = x.copy()
result[:-1] += x[1:]
result[1:] += x[:-1]
The fundamental idea in both cases is that to apply an operation to all neighboring elements, you just shift the array by +/-1. You don't need to know any indices, or do anything fancy. The simpler the better.
Hopefully you can see how how to generalize this to the 2D case. Rather than a single index shifting between -1, 0, 1, you have two indices in every possible combination of -1, 0, 1 between the two of them.
Appendix
Here's the generalized approach for a no-egde result:
from itertools import product
def sum_shifted(a):
result = np.zeros(tuple(x - 2 for x in a.shape), dtype=a.dtype)
for index in product([slice(0, -2), slice(1, -1), slice(2, None)], repeat=a.ndim):
result += a[index]
return result
This implementation is somewhat rudimentary because it doesn't check for inputs with no dimensions or shapes < 2, but it does work for arbitrary numbers of dimensions.
Notice that for a 1D case, the loop will run exactly three times, for 2D nine times and for ND 3N. This is one case where I find an explicit for loop to be appropriate with numpy. The loop is very small compared to the work done on a large array, fast enough for a small array, and certainly better than writing all 27 possibilities out by hand for the 3D case.
One more thing to pay attention to is how the successive indices are generated. In Python an index with a colon, like x[1:2:3] is converted to the relatively unknown slice object: slice(1, 2, 3). Since (almost) everything with commas gets interpreted as a tuple, an index like in the expression x[1:2, ::-1, :2] is exactly equivalent to (slice(1, 2), slice(None, None, -1), slice(None, 2)). The loop generates exactly such an expression, with one element for each dimension. So the result is actually simple indexing across all dimensions.
A similar approach is possible if you want to retain edges. The only significant difference is that you need to index both the input and the output arrays:
from itertools import product
def sum_shifted(a):
result = np.zeros_like(a)
for r_index, a_index in zip(product([slice(0, -1), slice(None), slice(1, None)], repeat=a.ndim),
product([slice(1, None), slice(None), slice(0, -1)], repeat=a.ndim)):
result[r_index] += a[a_index]
return result
This works because itertools.product guarantees the order of the iteration, so the two zipped iterators will stay in lockstep.
try this:
x = np.arange(36).reshape(6, 6)
y = np.zeros((6, 6))
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if i>0 and i<x.shape[0]-1 and j>0 and j<x.shape[1]-1:
y[i,j]=x[i,j]+x[i-1,j]+x[i,j-1]+x[i-1,j-1]+x[i+1,j]+x[i,j+1]+x[i+1,j+1]+x[i-1,j+1]+x[i+1,j-1]
if j==0:
if i==0:
y[i,j]=x[i,j]+x[i,j+1]+x[i+1,j+1]+x[i+1,j]
elif i==x.shape[0]-1:
y[i,j]=x[i,j]+x[i,j+1]+x[i-1,j+1]+x[i-1,j]
else:
y[i,j]=x[i,j]+x[i,j+1]+x[i+1,j+1]+x[i+1,j]+x[i-1,j]+x[i-1,j+1]
if j==x.shape[1]-1:
if i==0:
y[i,j]=x[i,j]+x[i,j-1]+x[i+1,j-1]+x[i+1,j]
elif i==x.shape[0]-1:
y[i,j]=x[i,j]+x[i,j-1]+x[i-1,j-1]+x[i-1,j]
else:
y[i,j]=x[i,j]+x[i,j-1]+x[i-1,j-1]+x[i+1,j]+x[i-1,j]+x[i+1,j-1]
if i==0 and j in range(1,x.shape[1]-1):
y[i,j]=x[i,j]+x[i,j-1]+x[i+1,j-1]+x[i+1,j]+x[i+1,j+1]+x[i,j+1]
if i==x.shape[0]-1 and j in range(1,x.shape[1]-1):
y[i,j]=x[i,j]+x[i,j-1]+x[i-1,j-1]+x[i-1,j]+x[i-1,j+1]+x[i,j+1]
print(y)
For example, let's consider this toy code
import numpy as np
import numpy.random as rnd
a = rnd.randint(0,10,(10,10))
k = (1,2)
b = a[:,k]
for col in np.arange(np.size(b,1)):
b[:,col] = b[:,col]+col*100
This code will work when the size of k is bigger than 1. However, with the size equal to 1, the extracted sub-matrix from a is transformed into a row vector, and applying the function in the for loop throws an error.
Of course, I could fix this by checking the dimension of b and reshaping:
if np.dim(b) == 1:
b = np.reshape(b, (np.size(b), 1))
in order to obtain a column vector, but this is expensive.
So, the question is: what is the best way to handle this situation?
This seems like something that would arise quite often and I wonder what is the best strategy to deal with it.
If you index with a list or tuple, the 2d shape is preserved:
In [638]: a=np.random.randint(0,10,(10,10))
In [639]: a[:,(1,2)].shape
Out[639]: (10, 2)
In [640]: a[:,(1,)].shape
Out[640]: (10, 1)
And I think b iteration can be simplified to:
a[:,k] += np.arange(len(k))*100
This sort of calculation will also be easier is k is always a list or tuple, and never a scalar (a scalar does not have a len).
np.column_stack ensures its inputs are 2d (and expands at the end if not) with:
if arr.ndim < 2:
arr = array(arr, copy=False, subok=True, ndmin=2).T
np.atleast_2d does
elif len(ary.shape) == 1:
result = ary[newaxis,:]
which of course could changed in this case to
if b.ndim==1:
b = b[:,None]
Any ways, I think it is better to ensure the k is a tuple rather than adjust b shape after. But keep both options in your toolbox.
I am working with data from netcdf files, with multi-dimensional variables, read into numpy arrays. I need to scan all values in all dimensions (axes in numpy) and alter some values. But, I don't know in advance the dimension of any given variable. At runtime I can, of course, get the ndims and shapes of the numpy array.
How can I program a loop thru all values without knowing the number of dimensions, or shapes in advance? If I knew a variable was exactly 2 dimensions, I would do
shp=myarray.shape
for i in range(shp[0]):
for j in range(shp[1]):
do_something(myarray[i][j])
You should look into ravel, nditer and ndindex.
# For the simple case
for value in np.nditer(a):
do_something_with(value)
# This is similar to above
for value in a.ravel():
do_something_with(value)
# Or if you need the index
for idx in np.ndindex(a.shape):
a[idx] = do_something_with(a[idx])
On an unrelated note, numpy arrays are indexed a[i, j] instead of a[i][j]. In python a[i, j] is equivalent to indexing with a tuple, ie a[(i, j)].
You can use the flat property of numpy arrays, which returns a generator on all values (no matter the shape).
For instance:
>>> A = np.array([[1,2,3],[4,5,6]])
>>> for x in A.flat:
... print x
1
2
3
4
5
6
You can also set the values in the same order they're returned, e.g. like this:
>>> A.flat[:] = [x / 2 if x % 2 == 0 else x for x in A.flat]
>>> A
array([[1, 1, 3],
[2, 5, 3]])
I am not sure the order in which flat returns the elements is guaranteed in any way (as it iterates through the elements as they are in memory, so depending on your array convention you are likely to have it always being the same, unless you are really doing it on purpose, but be careful...)
And this will work for any dimension.
** -- Edit -- **
To clarify what I meant by 'order not guaranteed', the order of elements returned by flat does not change, but I think it would be unwise to count on it for things like row1 = A.flat[:N], although it will work most of the time.
This might be the easiest with recursion:
a = numpy.array(range(30)).reshape(5, 3, 2)
def recursive_do_something(array):
if len(array.shape) == 1:
for obj in array:
do_something(obj)
else:
for subarray in array:
recursive_do_something(subarray)
recursive_do_something(a)
In case you want the indices:
a = numpy.array(range(30)).reshape(5, 3, 2)
def do_something(x, indices):
print(indices, x)
def recursive_do_something(array, indices=None):
indices = indices or []
if len(array.shape) == 1:
for obj in array:
do_something(obj, indices)
else:
for i, subarray in enumerate(array):
recursive_do_something(subarray, indices + [i])
recursive_do_something(a)
Look into Python's itertools module.
Python 2: http://docs.python.org/2/library/itertools.html#itertools.product
Python 3: http://docs.python.org/3.3/library/itertools.html#itertools.product
This will allow you to do something along the lines of
for lengths in product(shp[0], shp[1], ...):
do_something(myarray[lengths[0]][lengths[1]]