counts with 2 variables - python

In a research study I have 2 variables:
x = number objects remembered
y = % tasks completed correctly
as follows:
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
I would like to return the result of the number of:
WMC Percent Count
2 100 3
3 33 2
3 66 2 etc.
I note the scipy.stats.itemfreq and np.bincounts only work for one variable.

If you have access to a recent version of numpy (1.9.0 or higher) you can use unique with the return_counts flag enabled. That will give you 2 arrays, one with values and one with the counts.
Here's a slightly modified version of the numpy.unique method which works for your case:
def unique(ar):
ar = ar[np.lexsort((ar[:, 1], ar[:, 0]))]
flag = np.concatenate(([True], (ar[1:] != ar[:-1]).any(axis=1)))
idx = np.concatenate(np.nonzero(flag) + ([ar.size / 2],))
return np.array(zip(ar[flag][:, 0], ar[flag][:, 1], np.diff(idx)))
print unique(np.array(zip(x, y)))
Result:
[[ 2. 1. 3. ]
[ 3. 0.33 2. ]
[ 3. 0.66 2. ]
[ 3. 1. 1. ]
[ 4. 0.5 1. ]
[ 4. 0.75 2. ]
[ 4. 1. 3. ]
[ 5. 0.4 1. ]
[ 5. 0.5 1. ]
[ 5. 0.6 1. ]
[ 5. 1. 2. ]
[ 6. 0.6 1. ]
[ 6. 0.75 1. ]
[ 6. 1. 2. ]
[ 7. 0.5 1. ]
[ 7. 0.75 1. ]]

Earlier on in your code why not construct a dictionary linking 'number objects remembered' to '% tasks completed correctly'?
i.e.
completed_tasks = {2 : 1.0, 3 : 33, 4 : 66}
then, you can easily add the completed tasks count to the array that is returned by scipy.stats.itemfreq:
a = scipy.stats.itemfreq(x)
a = [i.append(completed_tasks[i[0]]) for i in a]

I would use collections.Counter for that purpose:
>>> import numpy as np
>>> x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
>>> y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
>>> from collections import Counter
>>> c = Counter(zip(x,y))
>>> c
Counter({(2, 1.0): 3, (4, 1.0): 3, (3, 0.66000000000000003): 2, (5, 1.0): 2, (3, 0.33000000000000002): 2, (6, 1.0): 2, (4, 0.75): 2, (7, 0.5): 1, (6, 0.59999999999999998): 1, (5, 0.40000000000000002): 1, (5, 0.59999999999999998): 1, (3, 1.0): 1, (7, 0.75): 1, (6, 0.75): 1, (5, 0.5): 1, (4, 0.5): 1})

Not sure if it is suitable in your case, however, you can do this using itertools.groupby() on the zipped lists:
import numpy as np
from itertools import groupby
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
print "WMC\tPercent\tCount"
for key, group in groupby(sorted(zip(x, y))):
print "{}\t{}\t{}".format(key[0], int(key[1]*100), len(list(group)))
Output
WMC Percent Count
2 100 3
3 33 2
3 66 2
3 100 1
4 100 3
4 75 2
4 50 1
5 100 2
5 60 1
5 40 1
5 50 1
6 75 1
6 100 2
6 60 1
7 50 1
7 75 1
Updated to produce numpy array
import numpy as np
from itertools import groupby
x = np.array([2,2,2,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7])
y = np.array([1.0, 1.0, 1.0, 0.33, 0.33, 0.66, 0.66, 1.0, 1.0, 1.0, 1.0, 0.75, 0.75, 0.5, 1.0, 1.0, 0.6, 0.4, 0.5,0.75, 1.0,1.0,0.6,0.5,0.75])
results = np.array([(key[0], int(key[1]*100), len(list(group)))
for key, group in groupby(sorted(zip(x, y)))])
Output
>>> results
array([[ 2, 100, 3],
[ 3, 33, 2],
[ 3, 66, 2],
[ 3, 100, 1],
[ 4, 50, 1],
[ 4, 75, 2],
[ 4, 100, 3],
[ 5, 40, 1],
[ 5, 50, 1],
[ 5, 60, 1],
[ 5, 100, 2],
[ 6, 60, 1],
[ 6, 75, 1],
[ 6, 100, 2],
[ 7, 50, 1],
[ 7, 75, 1]])

Related

How to simply pass weights while np.average()

I am confused about passing weights into np.average() function. Example below:
import numpy as np
weights = [0.35, 0.05, 0.6]
abc = list()
a = [[ 0.5, 1],
[ 5, 7],
[ 3, 8]]
b = [[ 10, 1],
[ 0.5, 1],
[ 0.7, 0.2]]
c = [[ 10, 12],
[ 0.5, 13],
[ 5, 0.7]]
abc.append(a)
abc.append(b)
abc.append(c)
print(np.average(np.array(abc), weights=[weights], axis=0))
OUT:
TypeError: 1D weights expected when shapes of a and weights differ.
I know that shapes differ, but how to add simply list of weights without doing
np.average(np.array(abc), weights=[weights[0], weights[1], weights[2]], ..., axis=0)
because i am performing a loop, where weights differ with size up to 30.
Output: Weighted array like this:
OUT:
[[6.675, 7.6],
[ 2.075, 10.3],
[ 4.085, 3.23]]
*average(a * weights[0] + b * weights[1] + c * weights[2])*
Welcoming any other solution.
Not sure how the first element can be 4.675?
weights = [0.35, 0.05, 0.6]
a = [[ 0.5, 1],
[ 5, 7],
[ 3, 8]]
b = [[ 10, 1],
[ 0.5, 1],
[ 0.7, 0.2]]
c = [[ 10, 12],
[ 0.5, 13],
[ 5, 0.7]]
abc=[a, b, c]
print(np.average(np.array(abc), weights=weights,axis=0))
Your abc array has shape (1, 3, 3, 2). So either change axis=1 or use abc = [a, b, c] like #BingWang suggested.

What can I do to improve sklearn's Jaccard similarity score performance on 9000+ data

I am trying create a table of Jaccard similarity score on a list of vectors x with every other elements in the list that has over 9000 rows (so resulting to a roughly 9000, 9000 list):
[[ 2 2 67 2 5 3 62 68 27]
[ 2 9 67 2 1 3 20 62 139]
[ 2 17 67 2 0 6 6 62 73]
[ 2 17 67 2 0 6 39 68 92]
[ 0 0 67 0 0 3 62 62 13]
...
I'm a beginner so I tried to my implement shameful excuse of a code like this:
similarities_matrix = np.empty([len(x), len(x)])
for icounter, i in enumerate(x.as_matrix()):
similarities_row = np.empty(len(x))
for jcounter, j in enumerate(x.as_matrix()):
similarities_row[jcounter] = jaccard_similarity_score(i, j)
similarities_matrix[icounter] = similarities_row
pprint(similarities_matrix)
But it runs impossibly slow.
Ideally I wanted my code to run within the span of my lifetime (preferably under 5 minutes.)
Currently, this code runs roughly a second per element to compute the similarity matrix.
If you don't mind using scipy, you can use the function pdist from scipy.spatial.distance. The value computed by sklearn.metrics.jaccard_similarity_score(u, v) is equivalent to 1 -scipy.spatial.distance.hamming(u, v). For example,
In [71]: from sklearn.metrics import jaccard_similarity_score
In [72]: from scipy.spatial.distance import hamming
In [73]: u = [2, 1, 3, 5]
In [74]: v = [2, 1, 4, 5]
In [75]: jaccard_similarity_score(u, v)
Out[75]: 0.75
In [76]: 1 - hamming(u, v)
Out[76]: 0.75
'hamming' is one of the metrics provided by scipy.spatial.distance.pdist, so you can use that function to compute all the pairwise distances. Here's a small x to use as an example:
In [77]: x = np.random.randint(0, 5, size=(8, 10))
In [78]: x
Out[78]:
array([[4, 2, 2, 3, 1, 2, 0, 0, 4, 0],
[3, 1, 4, 2, 3, 1, 2, 3, 4, 4],
[1, 1, 0, 1, 0, 2, 0, 3, 3, 4],
[2, 3, 3, 3, 1, 2, 3, 2, 1, 2],
[3, 2, 3, 2, 0, 0, 4, 4, 3, 4],
[3, 0, 1, 0, 4, 2, 0, 2, 1, 0],
[4, 3, 2, 4, 1, 2, 3, 3, 2, 4],
[3, 0, 4, 1, 3, 3, 3, 3, 1, 3]])
I'll use squareform to convert the output of pdist to the symmetric array of similarities.
In [79]: from scipy.spatial.distance import pdist, squareform
In [80]: squareform(1 - pdist(x, metric='hamming'))
Out[80]:
array([[ 1. , 0.1, 0.2, 0.3, 0.1, 0.3, 0.4, 0. ],
[ 0.1, 1. , 0.3, 0. , 0.3, 0.1, 0.2, 0.4],
[ 0.2, 0.3, 1. , 0.1, 0.3, 0.2, 0.3, 0.2],
[ 0.3, 0. , 0.1, 1. , 0.1, 0.3, 0.4, 0.2],
[ 0.1, 0.3, 0.3, 0.1, 1. , 0.1, 0.1, 0.1],
[ 0.3, 0.1, 0.2, 0.3, 0.1, 1. , 0.1, 0.3],
[ 0.4, 0.2, 0.3, 0.4, 0.1, 0.1, 1. , 0.2],
[ 0. , 0.4, 0.2, 0.2, 0.1, 0.3, 0.2, 1. ]])
I converted your code to this function:
def jaccard_sim_matrix(x):
similarities_matrix = np.empty([len(x), len(x)])
for icounter, i in enumerate(x):
similarities_row = np.empty(len(x))
for jcounter, j in enumerate(x):
similarities_row[jcounter] = jaccard_similarity_score(i, j)
similarities_matrix[icounter] = similarities_row
return similarities_matrix
so we can verify that the pdist result is the same as your calculation.
In [81]: jaccard_sim_matrix(x)
Out[81]:
array([[ 1. , 0.1, 0.2, 0.3, 0.1, 0.3, 0.4, 0. ],
[ 0.1, 1. , 0.3, 0. , 0.3, 0.1, 0.2, 0.4],
[ 0.2, 0.3, 1. , 0.1, 0.3, 0.2, 0.3, 0.2],
[ 0.3, 0. , 0.1, 1. , 0.1, 0.3, 0.4, 0.2],
[ 0.1, 0.3, 0.3, 0.1, 1. , 0.1, 0.1, 0.1],
[ 0.3, 0.1, 0.2, 0.3, 0.1, 1. , 0.1, 0.3],
[ 0.4, 0.2, 0.3, 0.4, 0.1, 0.1, 1. , 0.2],
[ 0. , 0.4, 0.2, 0.2, 0.1, 0.3, 0.2, 1. ]])
Here I'll compare the timing for a larger array:
In [82]: x = np.random.randint(0, 5, size=(500, 10))
In [83]: %timeit jaccard_sim_matrix(x)
14.9 s ± 192 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [84]: %timeit squareform(1 - pdist(x, metric='hamming'))
1.19 ms ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally, let's time the calculation for an input with shape (9000, 10):
In [94]: x = np.random.randint(0, 5, size=(9000, 10))
In [95]: %timeit squareform(1 - pdist(x, metric='hamming'))
1.34 s ± 9.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's just 1.34 seconds--definitely within the span of a lifetime.

How to select the first 3 rows of every group in pandas?

I get a pandas dataframe like this:
id prob
0 1 0.5
1 1 0.6
2 1 0.4
3 1 0.2
4 2 0.3
6 2 0.5
...
I want to group it by 'id', sort descending order and get the first 3 prob of every group. Note that some groups contain rows less than 3.
Finally I want to get a 2D array like:
[[1, 0.6, 0.5, 0.4], [2, [0.5, 0.3]]...]
How can I do that with pandas?
Thanks!
Use sort_values, groupby, and head:
df.sort_values(by=['id','prob'], ascending=[True,False]).groupby('id').head(3).values
Output:
array([[ 1. , 0.6],
[ 1. , 0.5],
[ 1. , 0.4],
[ 2. , 0.5],
[ 2. , 0.3]])
Following #COLDSPEED lead:
df.sort_values(by=['id','prob'], ascending=[True,False])\
.groupby('id').agg(lambda x: x.head(3).tolist())\
.reset_index().values.tolist()
Output:
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]
You can use groupby and nlargest
df.groupby('id').prob.nlargest(3).reset_index(1,drop = True)
id
1 0.6
1 0.5
1 0.4
2 0.5
2 0.3
For the array
df1 = df.groupby('id').prob.nlargest(3).unstack(1)#.reset_index(1,drop = True)#.set_index('id')
np.column_stack((df1.index.values, df1.values))
You get
array([[ 1. , 0.5, 0.6, 0.4, nan, nan],
[ 2. , nan, nan, nan, 0.3, 0.5]])
If you're looking for a dataframe of array columns, you can use np.sort:
df = df.groupby('id').prob.apply(lambda x: np.sort(x.values)[:-4:-1])
df
id
1 [0.6, 0.5, 0.4]
2 [0.5, 0.3]
To retrieve the values, reset_index and access:
df.reset_index().values
array([[1, array([ 0.6, 0.5, 0.4])],
[2, array([ 0.5, 0.3])]], dtype=object)
[[n, g.nlargest(3).tolist()] for n, g in df.groupby('id').prob]
[[1, [0.6, 0.5, 0.4]], [2, [0.5, 0.3]]]

Write coordinates to file from 4 numpy arrays Python

There are 4 numpy matrices,for exemple, 3x3 with coordinates:
Xg [[-0.5 0.3 1.1]
[-0.5 0.3 1.1]
[-0.5 0.3 1.1]]
Yg [[-0.5 -0.5 -0.5]
[ 0.3 0.3 0.3]
[ 1.1 1.1 1.1]]
u [[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
v [[ 1.03793 0.25065 -0.28944]
[-0.21591 -0.93072 -0.10047]
[-0.08591 -0.11284 -0.06082]]
How I can write coordinates in file like this:
# in file should be ", {{" x_coordinate","y_coordinate"},{"u_coordinate","v_coordinate"}}")
file = open("coordinates.txt", "w")
file.write(",{{" + x + "," + y + "},{" + u + "," + v + "}}")
#Output
,{{-0.5,-0.5},{1,1.03793}}, {{0.3,-0.5},{1,0.25065}}, {{1.1,-0.5},{1,-0.28944}},...
You could do nested for loops, like this:
X = [[-0.5, 0.3, 1.1],
[-0.5, 0.3, 1.1],
[-0.5, 0.3, 1.1]]
Y = [[-0.5, -0.5, -0.5],
[0.3, 0.3, 0.3],
[1.1, 1.1, 1.1]]
U = [[1, 1, 1, ],
[1, 1, 1, ],
[1, 1, 1, ]]
V = [[1.03793, 0.25065, -0.28944],
[-0.21591, -0.93072, -0.10047],
[-0.08591, -0.11284, -0.06082]]
with open("coordinates.txt", "w") as f:
for i in range(3):
for j in range(3):
f.write("{{{0},{1}}}, {{{2}, {3}}}\n".format(X[j][i], Y[j][i], U[j][i], V[j][i]))
Which gives
{-0.5,-0.5}, {1, 1.03793}
{-0.5,0.3}, {1, -0.21591}
{-0.5,1.1}, {1, -0.08591}
{0.3,-0.5}, {1, 0.25065}
{0.3,0.3}, {1, -0.93072}
{0.3,1.1}, {1, -0.11284}
{1.1,-0.5}, {1, -0.28944}
{1.1,0.3}, {1, -0.10047}
{1.1,1.1}, {1, -0.06082}

What is the equivalent of "zip()" in Python's numpy?

I am trying to do the following but with numpy arrays:
x = [(0.1, 1.), (0.1, 2.), (0.1, 3.), (0.1, 4.), (0.1, 5.)]
normal_result = zip(*x)
This should give a result of:
normal_result = [(0.1, 0.1, 0.1, 0.1, 0.1), (1., 2., 3., 4., 5.)]
But if the input vector is a numpy array:
y = np.array(x)
numpy_result = zip(*y)
print type(numpy_result)
It (expectedly) returns a:
<type 'list'>
The issue is that I will need to transform the result back into a numpy array after this.
What I would like to know is what is if there is an efficient numpy function that will avoid these back-and-forth transformations?
You can just transpose it...
>>> a = np.array([(0.1, 1.), (0.1, 2.), (0.1, 3.), (0.1, 4.), (0.1, 5.)])
>>> a
array([[ 0.1, 1. ],
[ 0.1, 2. ],
[ 0.1, 3. ],
[ 0.1, 4. ],
[ 0.1, 5. ]])
>>> a.T
array([[ 0.1, 0.1, 0.1, 0.1, 0.1],
[ 1. , 2. , 3. , 4. , 5. ]])
Try using dstack:
>>> from numpy import *
>>> a = array([[1,2],[3,4]]) # shapes of a and b can only differ in the 3rd dimension (if present)
>>> b = array([[5,6],[7,8]])
>>> dstack((a,b)) # stack arrays along a third axis (depth wise)
array([[[1, 5],
[2, 6]],
[[3, 7],
[4, 8]]])
so in your case it would be:
x = [(0.1, 1.), (0.1, 2.), (0.1, 3.), (0.1, 4.), (0.1, 5.)]
y = np.array(x)
np.dstack(y)
>>> array([[[ 0.1, 0.1, 0.1, 0.1, 0.1],
[ 1. , 2. , 3. , 4. , 5. ]]])

Categories

Resources