As an example, I have an array of branches and probabilities that looks like this:
paths = np.array([
[1, 0, 1.0],
[2, 0, 0.4],
[2, 1, 0.6],
[3, 1, 1.0],
[5, 1, 0.25],
[5, 2, 0.5],
[5, 4, 0.25],
[6, 0, 0.7],
[6, 5, 0.2],
[6, 2, 0.1]])
The columns are upper node, lower node, probability.
Here's a visual of the nodes:
6
/ | \
5 0 2
/ | \ / \
1 2 4 0 1
| /\ |
0 0 1 0
|
0
I want to be able to pick a starting node and output an array of the branches and cumulative probabilities, including all the duplicate branches. For example:
start_node = 5 should return
array([
[5, 1, 0.25],
[5, 2, 0.5],
[5, 4, 0.25],
[1, 0, 0.25],
[2, 0, 0.2],
[2, 1, 0.3],
[1, 0, 0.3]])
Notice the [1, 0, x] branch is included twice, as it's fed by both the [5, 1, 0.25] branch and the [2, 1, 0.3] branch.
Here's some code I got working but it's far too slow for my application (millions of branches):
def branch(start_node, paths):
output = paths[paths[:,0]==start_node]
next_nodes = output
while True:
can_go_lower = np.isin(next_nodes[:,1], paths[:,0])
if ~np.any(can_go_lower): break
next_nodes_checked = next_nodes[can_go_lower]
next_nodes = np.empty([0,3])
for nodes in next_nodes_checked:
to_append = paths[paths[:,0]==nodes[1]]
to_append[:,2] *= nodes[2]
next_nodes = np.append(next_nodes, to_append, axis=0)
output = np.append(output, next_nodes, axis=0)
return output
The branches are always higher to lower, therefor getting caught in circles isn't a concern. A way to vectorize the for loop and avoid the appends would be the best optimization, I think.
Instead of storing in numpy array lets' store graph in dict.
tree = {k:arr[arr[:, 0] == k] for k in np.unique(arr[:, 0])}
Make as set of nodes which are non-leaf:
non_leaf_nodes = set(np.unique(arr[:, 0]))
Now to find the branch and cumulative probability:
def branch(start_node, tree, non_leaf_nodes):
curr_nodes = [[start_node, start_node, 1.0]] #(prev_node, starting_node, current_probability)
output = []
while True:
next_nodes = []
for _, node, prob in curr_nodes:
if node not in non_leaf_nodes: continue
subtree = tree[node]
to_append = subtree.copy()
to_append[:, 2] *= prob
to_append = to_append.tolist()
output += to_append
next_nodes += to_append
curr_nodes = next_nodes
if len(curr_nodes) == 0:
break
return np.array(output)
Output:
>>> branch(5, tree, non_leaf_nodes)
array([
[5. , 1. , 0.25],
[5. , 2. , 0.5 ],
[5. , 4. , 0.25],
[1. , 0. , 0.25],
[2. , 0. , 0.2 ],
[2. , 1. , 0.3 ],
[1. , 0. , 0.3 ]])
I am expecting it to work faster. Let me know.
I'm sorry if this question isn't framed well. So I would rather explain with an example.
I have a numpy matrix:
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
And another numpy array as shown:
b = np.array([1, 0, 2, 2])
With the given condition that values in b will be in the range(a.shape[1]) and that b.shape[1] == a.shape[0]. Now this is the operation I need to perform.
For every index i of a, and the corresponding index i of b, I need to subtract 1 from the index j of a[i] where j== b[i]
So in my example, a[0] == [0.5 0.8 0.1] and b[0] == 1. Therefore I need to subtract 1 from a[0][b[0]] so that a[0] = [0.5, -0.2, 0.1]. This has to be done for all rows of a. Any direct solution without me having to iterate through all rows or columns one by one?
Thanks.
Use numpy indexing. See this post for a nice introduction:
import numpy as np
a = np.array([[0.5, 0.8, 0.1], [0.6, 0.9, 0.3], [0.7, 0.4, 0.8], [0.8, 0.7, 0.6]])
b = np.array([1, 0, 2, 2])
a[np.arange(a.shape[0]), b] -= 1
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
As an alternative use substract.at:
np.subtract.at(a, (np.arange(a.shape[0]), b), 1)
print(a)
Output
[[ 0.5 -0.2 0.1]
[-0.4 0.9 0.3]
[ 0.7 0.4 -0.2]
[ 0.8 0.7 -0.4]]
The main idea is that:
np.arange(a.shape[0]) # shape[0] is equals to the numbers of rows
generates the indices of the rows:
[0 1 2 3]
a = np.array([[0, 2, 0, 0], [0, 1, 3, 0], [0, 0, 10, 11], [0, 0, 1, 7]])
array([[ 0, 2, 0, 0],
[ 0, 1, 3, 0],
[ 0, 0, 10, 11],
[ 0, 0, 1, 7]])
There are 0 entries in each row. I need to assign a value to each of these zero entries, where the value is calculated as follows:
V = 0.1 * Si / Ni
where Si is the sum of row i
Ni is the number of zero entries in row i
I can calculate Si and Ni fairly easy:
S = np.sum(a, axis=1)
array([ 2, 4, 21, 8])
N = np.count_nonzero(a == 0, axis=1)
array([3, 2, 2, 2])
Now, V is calculated as:
V = 0.1 * S/N
array([0.06666667, 0.2 , 1.05 , 0.4 ])
But how do I assign V[i] to a zero entry in i-th row? So I'm expecting to get the following array a:
array([[ 0.06666667, 2, 0.06666667, 0.06666667],
[ 0.2, 1, 3, 0.2],
[ 1.05, 1.05, 10, 11],
[ 0.4, 0.4, 1, 7]])
I need some kind of selective broadcasting operation or assignment?
Use np.where
np.where(a == 0, v.reshape(-1, 1), a)
array([[ 0.06666667, 2. , 0.06666667, 0.06666667],
[ 0.2 , 1. , 3. , 0.2 ],
[ 1.05 , 1.05 , 10. , 11. ],
[ 0.4 , 0.4 , 1. , 7. ]])
Here's a way using np.where:
z = a == 0
np.where(z, (0.1*a.sum(1)/z.sum(1))[:,None], a)
array([[ 0.06666667, 2. , 0.06666667, 0.06666667],
[ 0.2 , 1. , 3. , 0.2 ],
[ 1.05 , 1.05 , 10. , 11. ],
[ 0.4 , 0.4 , 1. , 7. ]])
Maybe using a mask:
for i in range(V.size):
print((a[i,:] == 0) * V[i] + a[i,:])
In a research study I have 2 variables:
x = number objects remembered
y = % tasks completed correctly
as follows:
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
I would like to return the result of the number of:
WMC Percent Count
2 100 3
3 33 2
3 66 2 etc.
I note the scipy.stats.itemfreq and np.bincounts only work for one variable.
If you have access to a recent version of numpy (1.9.0 or higher) you can use unique with the return_counts flag enabled. That will give you 2 arrays, one with values and one with the counts.
Here's a slightly modified version of the numpy.unique method which works for your case:
def unique(ar):
ar = ar[np.lexsort((ar[:, 1], ar[:, 0]))]
flag = np.concatenate(([True], (ar[1:] != ar[:-1]).any(axis=1)))
idx = np.concatenate(np.nonzero(flag) + ([ar.size / 2],))
return np.array(zip(ar[flag][:, 0], ar[flag][:, 1], np.diff(idx)))
print unique(np.array(zip(x, y)))
Result:
[[ 2. 1. 3. ]
[ 3. 0.33 2. ]
[ 3. 0.66 2. ]
[ 3. 1. 1. ]
[ 4. 0.5 1. ]
[ 4. 0.75 2. ]
[ 4. 1. 3. ]
[ 5. 0.4 1. ]
[ 5. 0.5 1. ]
[ 5. 0.6 1. ]
[ 5. 1. 2. ]
[ 6. 0.6 1. ]
[ 6. 0.75 1. ]
[ 6. 1. 2. ]
[ 7. 0.5 1. ]
[ 7. 0.75 1. ]]
Earlier on in your code why not construct a dictionary linking 'number objects remembered' to '% tasks completed correctly'?
i.e.
completed_tasks = {2 : 1.0, 3 : 33, 4 : 66}
then, you can easily add the completed tasks count to the array that is returned by scipy.stats.itemfreq:
a = scipy.stats.itemfreq(x)
a = [i.append(completed_tasks[i[0]]) for i in a]
I would use collections.Counter for that purpose:
>>> import numpy as np
>>> x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
>>> y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
>>> from collections import Counter
>>> c = Counter(zip(x,y))
>>> c
Counter({(2, 1.0): 3, (4, 1.0): 3, (3, 0.66000000000000003): 2, (5, 1.0): 2, (3, 0.33000000000000002): 2, (6, 1.0): 2, (4, 0.75): 2, (7, 0.5): 1, (6, 0.59999999999999998): 1, (5, 0.40000000000000002): 1, (5, 0.59999999999999998): 1, (3, 1.0): 1, (7, 0.75): 1, (6, 0.75): 1, (5, 0.5): 1, (4, 0.5): 1})
Not sure if it is suitable in your case, however, you can do this using itertools.groupby() on the zipped lists:
import numpy as np
from itertools import groupby
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
print "WMC\tPercent\tCount"
for key, group in groupby(sorted(zip(x, y))):
print "{}\t{}\t{}".format(key[0], int(key[1]*100), len(list(group)))
Output
WMC Percent Count
2 100 3
3 33 2
3 66 2
3 100 1
4 100 3
4 75 2
4 50 1
5 100 2
5 60 1
5 40 1
5 50 1
6 75 1
6 100 2
6 60 1
7 50 1
7 75 1
Updated to produce numpy array
import numpy as np
from itertools import groupby
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
results = np.array([(key[0], int(key[1]*100), len(list(group)))
for key, group in groupby(sorted(zip(x, y)))])
Output
>>> results
array([[ 2, 100, 3],
[ 3, 33, 2],
[ 3, 66, 2],
[ 3, 100, 1],
[ 4, 50, 1],
[ 4, 75, 2],
[ 4, 100, 3],
[ 5, 40, 1],
[ 5, 50, 1],
[ 5, 60, 1],
[ 5, 100, 2],
[ 6, 60, 1],
[ 6, 75, 1],
[ 6, 100, 2],
[ 7, 50, 1],
[ 7, 75, 1]])