Related
Consider the following two implemtations of the same piece of code. I would have thought they are identical but they are not.
Is this a Python/Numpy bug or a subtle gotcha? If the latter, what rule would make it obvious why it does not work as expected?
I was working with multiple arrays of data and having to process each array item by item, with each array manipulated by a table depending on it's metadata.
In the real world example 'n' is multiple factors and offsets but the following code still demonstrates the issue that I was getting the wrong result in all but one case.
import numpy as np
# Change the following line to True to show different behaviour
NEEDS_BUGS = False # Changeme
# Create some data
data = np.linspace(0, 1, 10)
print(data)
# Create an array of vector functions each of which does a different operation on a set of data
vfuncd = dict()
# Two implementations
if NEEDS_BUGS:
# Lets do this in a loop because we like loops - However WARNING this does not work!!
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * n)
else:
# Unwrap the loop - NOTE: Spoiler - this works
vfuncd[0] = np.vectorize(lambda x: x * 0)
vfuncd[1] = np.vectorize(lambda x: x * 1)
vfuncd[2] = np.vectorize(lambda x: x * 2)
vfuncd[3] = np.vectorize(lambda x: x * 3)
vfuncd[4] = np.vectorize(lambda x: x * 4)
vfuncd[5] = np.vectorize(lambda x: x * 5)
vfuncd[6] = np.vectorize(lambda x: x * 6)
vfuncd[7] = np.vectorize(lambda x: x * 7)
vfuncd[8] = np.vectorize(lambda x: x * 8)
vfuncd[9] = np.vectorize(lambda x: x * 9)
# Prove we have multiple different vectorised functions
for k, vfunc in vfuncd.items():
print(k, vfunc)
# Do the work
res = {k: vfuncd[k](data) for k in vfuncd.keys()}
# Show the result
for k, r in res.items():
print(k, r)
I don't know what exactly you're trying to achieve and if it's a bad idea or not (in terms of np.vectorize), but the issue you're facing is because of the way python makes closures. Quoting from an answer to the linked question:
Scoping in Python is lexical. A closure will always
remember the name and scope of the variable, not the object it's
pointing to. Since all the functions in your example are created in
the same scope and use the same variable name, they always refer to
the same variable.
in other words when you make that closure over n, you're not actually closing off the state of n, just the name. So when n changes, the value in your closure also changes. This is quite unexpected to me, but others find it natural.
Here is one fix using partial:
from functools import partial
.
.
.
def func(x, n):
return x * n
for n in range(10):
vfuncd[n] = np.vectorize(partial(func, n=n))
Or another using a factory method
def func_factory(n):
return lambda x: x * n
for n in range(10):
vfuncd[n] = np.vectorize(func_factory(n))
It seems that the python variable n is bound to the vectorized expression:
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * n)
This fixes it as it creates a new object with which to bind:
for n in range(10):
vfuncd[n] = np.vectorize(lambda x: x * np.scalar(n))
In fact this has implications in terms of performance as I assume the value of the python variable would have to be fetched repeatedly.
In [13]: data = np.linspace(0,1,11)
Since the data array can be multiplied with a simple:
In [14]: data*3
Out[14]: array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3. ])
we don't need the complication of np.vectorize to see the closure issue. A simple lambda is enough.
In [15]: vfuncd = {}
...: for n in range(3):
...: vfuncd[n] = lambda x:x*n
...:
In [16]: vfuncd
Out[16]:
{0: <function __main__.<lambda>(x)>,
1: <function __main__.<lambda>(x)>,
2: <function __main__.<lambda>(x)>}
In [17]: {k:v(data) for k,v in vfuncd.items()}
Out[17]:
{0: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
1: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
We won't get the closure problem if we use a proper numpy "vectorization":
In [18]: data * np.arange(3)[:,None]
Out[18]:
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])
Or a simple iteration is we need a dictionary:
In [20]: {k:data*k for k in range(3)}
Out[20]:
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
np.vectorize has a speed disclaimer. But it is justified where the function only takes scalar inputs, and we want the flexibility of numpy broadcasting - i.e. for 2 or more arguments.
Creating multiple vectorize is clearly an 'anti-pattern'. I'd rather see one vectorize with the appropriate arguments:
In [25]: f = np.vectorize(lambda x,n: x*n)
In [26]: {n: f(data,n) for n in range(3)}
Out[26]:
{0: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
1: array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
2: array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])}
That f can also produce the array Out[18] (but is slower):
In [27]: f(data, np.arange(3)[:,None])
Out[27]:
array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
[0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ]])
I am trying to find the row and column index in a 2d numpy array where the value lies in a range.
Though I am able to accomplish this with the following code, I would like only one occurrence to be encountered in a matrix where a ij = a ji:
In [118]: test_arr = np.array([[1, 0.2, 0.04],
...: [0.2, 0.3, 0.06 ],
...: [0.04, 0.06, 0.09]
...: ])
...:
In [119]: test_arr
Out[119]:
array([[1. , 0.2 , 0.04],
[0.2 , 0.3 , 0.06],
[0.04, 0.06, 0.09]])
In [120]: np.argwhere((test_arr==0.06))
Out[120]:
array([[1, 2],
[2, 1]])
Is there any way using numpy where we can restrict i<j so that the output will only be as:
array([[1, 2]])
Any help is appreciated!
In [38]: In [118]: test_arr = np.array([[1, 0.2, 0.04],
...: ...: [0.2, 0.3, 0.06 ],
...: ...: [0.04, 0.06, 0.09]
...: ...: ])
In [39]: test_arr
Out[39]:
array([[1. , 0.2 , 0.04],
[0.2 , 0.3 , 0.06],
[0.04, 0.06, 0.09]])
In [40]: np.where(test_arr==0.06)
Out[40]: (array([1, 2]), array([2, 1]))
Let's explore using one of the tri functions to set some of the values of the array to 0:
In [41]: np.tril(test_arr)
Out[41]:
array([[1. , 0. , 0. ],
[0.2 , 0.3 , 0. ],
[0.04, 0.06, 0.09]])
In [42]: np.triu(test_arr)
Out[42]:
array([[1. , 0.2 , 0.04],
[0. , 0.3 , 0.06],
[0. , 0. , 0.09]])
Now apply the equality test:
In [44]: np.triu(test_arr)==0.06
Out[44]:
array([[False, False, False],
[False, False, True],
[False, False, False]])
In [45]: np.argwhere(np.triu(test_arr)==0.06)
Out[45]: array([[1, 2]])
I am confused about doing vectorization using numpy.
In particular, I have a matrix of this form:
of type <type 'list'>
[[0.0, 0.0, 0.0, 0.0], [0.02, 0.04, 0.0325, 0.04], [1, 2, 3, 4]]
How do I make it look like the following using numpy?
[[ 0.0 0.0 0.0 0.0 ]
[ 0.02 0.04 0.0325 0.04 ]
[ 1 2 3 4 ]]
Yes, I know I can do it using:
np.array([[0.0, 0.0, 0.0, 0.0], [0.02, 0.04, 0.0325, 0.04], [1, 2, 3, 4]])
But I have a very long matrix, and I can't just type out each rows like that. How can I handle the case when I have a very long matrix?
This is not a matrix of type list, it is a list that contains lists. You may think of it as matrix, but to Python it is just a list
alist = [[0.0, 0.0, 0.0, 0.0], [0.02, 0.04, 0.0325, 0.04], [1, 2, 3, 4]]
arr = np.array(alist)
works just the same as
arr = np.array([[0.0, 0.0, 0.0, 0.0], [0.02, 0.04, 0.0325, 0.04], [1, 2, 3, 4]])
This creates 2d array, with shape (3,4) and dtype float
In [212]: arr = np.array([[0.0, 0.0, 0.0, 0.0], [0.02, 0.04, 0.0325, 0.04], [1, 2, 3, 4]])
In [213]: arr
Out[213]:
array([[ 0. , 0. , 0. , 0. ],
[ 0.02 , 0.04 , 0.0325, 0.04 ],
[ 1. , 2. , 3. , 4. ]])
In [214]: print(arr)
[[ 0. 0. 0. 0. ]
[ 0.02 0.04 0.0325 0.04 ]
[ 1. 2. 3. 4. ]]
Assuming you start with a large array, why not split it into arrays of the right size (n):
splitted = [l[i:i + n] for i in range(0, len(array), n)]
and make the matrix from that:
np.array(splitted)
If you're saying you have a list of lists stored in Python object A, all you need to do is call np.array(A) which will return a numpy array using the elements of A. Otherwise, you need to specify what form your data is in right now to clarify how you want to load your data.
I have an nd array that looks as follows:
[[ 0. 1.73205081 6.40312424 7.21110255 2.44948974]
[ 1.73205081 0. 5.09901951 5.91607978 1. ]
[ 6.40312424 5.09901951 0. 1. 4.35889894]
[ 7.21110255 5.91607978 1. 0. 5.09901951]
[ 2.44948974 1. 4.35889894 5.09901951 0. ]]
Each element in this array is a distance and I need to turn this into a list with the row,col,distance as follows:
l = [(0,0,0),(0,1, 1.73205081),(0,2, 6.40312424),...,(1,0, 1.73205081),(1,1,0),...,(4,4,0)]
Additionally, it would be cool to remove the diagonal elements and also the elements (j,i) as (i,j) are already there. Essentially, is it possible to take just the top triangular matrix of this?
Is this possible to do efficiently (without a lot of loops)? I had created this array with squareform, but couldn't find any docs to do this.
squareform does all this. Read the docs and experiment. It works in both directions. If you give it a matrix it returns the upper triangle values (condensed form). If you give it those values, it returns the matrix.
In [668]: M
Out[668]:
array([[ 0. , 0.1, 0.5, 0.2],
[ 0.1, 0. , 2. , 0.3],
[ 0.5, 2. , 0. , 0.2],
[ 0.2, 0.3, 0.2, 0. ]])
In [669]: spatial.distance.squareform(M)
Out[669]: array([ 0.1, 0.5, 0.2, 2. , 0.3, 0.2])
In [670]: v=spatial.distance.squareform(M)
In [671]: v
Out[671]: array([ 0.1, 0.5, 0.2, 2. , 0.3, 0.2])
In [672]: spatial.distance.squareform(v)
Out[672]:
array([[ 0. , 0.1, 0.5, 0.2],
[ 0.1, 0. , 2. , 0.3],
[ 0.5, 2. , 0. , 0.2],
[ 0.2, 0.3, 0.2, 0. ]])
You can also specify a force and checks parameter, but without those it just goes by the shape.
Indicies can come from triu
In [677]: np.triu_indices(4,1)
Out[677]:
(array([0, 0, 0, 1, 1, 2], dtype=int32),
array([1, 2, 3, 2, 3, 3], dtype=int32))
In [680]: np.vstack((np.triu_indices(4,1),v)).T
Out[680]:
array([[ 0. , 1. , 0.1],
[ 0. , 2. , 0.5],
[ 0. , 3. , 0.2],
[ 1. , 2. , 2. ],
[ 1. , 3. , 0.3],
[ 2. , 3. , 0.2]])
Just to check, we can fill in a 4x4 matrix with these values
In [686]: A=np.vstack((np.triu_indices(4,1),v)).T
In [687]: MM = np.zeros((4,4))
In [688]: MM[A[:,0].astype(int),A[:,1].astype(int)]=A[:,2]
In [689]: MM
Out[689]:
array([[ 0. , 0.1, 0.5, 0.2],
[ 0. , 0. , 2. , 0.3],
[ 0. , 0. , 0. , 0.2],
[ 0. , 0. , 0. , 0. ]])
Those triu indices can also fetch the values from M:
In [693]: I,J = np.triu_indices(4,1)
In [694]: M[I,J]
Out[694]: array([ 0.1, 0.5, 0.2, 2. , 0.3, 0.2])
squareform uses compiled code in spatial.distance._distance_wrap so I expect it will be quite fast for large arrays. Only problem it just returns the condensed form values, but not the indices. But given the shape,the indices can always be calculated. They don't need to be stored with the values.
If your input is x, first generate the indices:
i0,i1 = np.indices(x.shape)
Then:
np.concatenate((i1,i0,x)).reshape(3,5,5).T
That gives you the first result--for the entire matrix.
As for taking only the upper triangle, you might considering trying np.triu() but I'm not sure exactly what result you're looking for. You can probably figure out how to mask the parts you don't want now though.
you can try this,
print([(x,y, value) for (x,y), value in np.ndenumerate(numpymatrixarray)])
output [(0, 0, 0.0), (0, 1, 1.7320508100000001), (0, 2, 6.4031242400000004), (0, 3, 7.2111025499999997), (0, 4, 2.4494897400000002), (1, 0, 1.7320508100000001), (1, 1, 0.0), (1, 2, 5.0990195099999998), (1, 3, 5.9160797799999996), (1, 4, 1.0), (2, 0, 6.4031242400000004), (2, 1, 5.0990195099999998), (2, 2, 0.0), (2, 3, 1.0), (2, 4, 4.3588989400000004), (3, 0, 7.2111025499999997), (3, 1, 5.9160797799999996), (3, 2, 1.0), (3, 3, 0.0), (3, 4, 5.0990195099999998), (4, 0, 2.4494897400000002), (4, 1, 1.0), (4, 2, 4.3588989400000004), (4, 3, 5.0990195099999998), (4, 4, 0.0)]
Do you really want the top triangular matrix for an [nxm] matrix where n>m? That will give you (nxn-n)/2 elements and lose all the data where m⊖n.
What you probably want is the lower triangular matrix:
def tri_reduce(m):
n=m.shape
if n[0]>n[1]:
i=np.tril_indices(n[0],1,n[1])
else:
i=np.triu_indices(n[0],1,n[1])
return np.vstack((i,m[i])).T
Rebuilding it into a list of tuples would require a loop though I believe. list(tri_reduce(m)) would give a list of nd arrays.
In a research study I have 2 variables:
x = number objects remembered
y = % tasks completed correctly
as follows:
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
I would like to return the result of the number of:
WMC Percent Count
2 100 3
3 33 2
3 66 2 etc.
I note the scipy.stats.itemfreq and np.bincounts only work for one variable.
If you have access to a recent version of numpy (1.9.0 or higher) you can use unique with the return_counts flag enabled. That will give you 2 arrays, one with values and one with the counts.
Here's a slightly modified version of the numpy.unique method which works for your case:
def unique(ar):
ar = ar[np.lexsort((ar[:, 1], ar[:, 0]))]
flag = np.concatenate(([True], (ar[1:] != ar[:-1]).any(axis=1)))
idx = np.concatenate(np.nonzero(flag) + ([ar.size / 2],))
return np.array(zip(ar[flag][:, 0], ar[flag][:, 1], np.diff(idx)))
print unique(np.array(zip(x, y)))
Result:
[[ 2. 1. 3. ]
[ 3. 0.33 2. ]
[ 3. 0.66 2. ]
[ 3. 1. 1. ]
[ 4. 0.5 1. ]
[ 4. 0.75 2. ]
[ 4. 1. 3. ]
[ 5. 0.4 1. ]
[ 5. 0.5 1. ]
[ 5. 0.6 1. ]
[ 5. 1. 2. ]
[ 6. 0.6 1. ]
[ 6. 0.75 1. ]
[ 6. 1. 2. ]
[ 7. 0.5 1. ]
[ 7. 0.75 1. ]]
Earlier on in your code why not construct a dictionary linking 'number objects remembered' to '% tasks completed correctly'?
i.e.
completed_tasks = {2 : 1.0, 3 : 33, 4 : 66}
then, you can easily add the completed tasks count to the array that is returned by scipy.stats.itemfreq:
a = scipy.stats.itemfreq(x)
a = [i.append(completed_tasks[i[0]]) for i in a]
I would use collections.Counter for that purpose:
>>> import numpy as np
>>> x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
>>> y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
>>> from collections import Counter
>>> c = Counter(zip(x,y))
>>> c
Counter({(2, 1.0): 3, (4, 1.0): 3, (3, 0.66000000000000003): 2, (5, 1.0): 2, (3, 0.33000000000000002): 2, (6, 1.0): 2, (4, 0.75): 2, (7, 0.5): 1, (6, 0.59999999999999998): 1, (5, 0.40000000000000002): 1, (5, 0.59999999999999998): 1, (3, 1.0): 1, (7, 0.75): 1, (6, 0.75): 1, (5, 0.5): 1, (4, 0.5): 1})
Not sure if it is suitable in your case, however, you can do this using itertools.groupby() on the zipped lists:
import numpy as np
from itertools import groupby
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
print "WMC\tPercent\tCount"
for key, group in groupby(sorted(zip(x, y))):
print "{}\t{}\t{}".format(key[0], int(key[1]*100), len(list(group)))
Output
WMC Percent Count
2 100 3
3 33 2
3 66 2
3 100 1
4 100 3
4 75 2
4 50 1
5 100 2
5 60 1
5 40 1
5 50 1
6 75 1
6 100 2
6 60 1
7 50 1
7 75 1
Updated to produce numpy array
import numpy as np
from itertools import groupby
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
results = np.array([(key[0], int(key[1]*100), len(list(group)))
for key, group in groupby(sorted(zip(x, y)))])
Output
>>> results
array([[ 2, 100, 3],
[ 3, 33, 2],
[ 3, 66, 2],
[ 3, 100, 1],
[ 4, 50, 1],
[ 4, 75, 2],
[ 4, 100, 3],
[ 5, 40, 1],
[ 5, 50, 1],
[ 5, 60, 1],
[ 5, 100, 2],
[ 6, 60, 1],
[ 6, 75, 1],
[ 6, 100, 2],
[ 7, 50, 1],
[ 7, 75, 1]])