I want to find the means of all the negative numbers from a list that has a mix of positive and negative numbers. I can find the mean of the lists as
import numpy as np
listA = [ [2,3,-7,-4] , [-2,3,4,-5] , [-5,-6,-8,2] , [9,5,13,2] ]
listofmeans = [np.mean(i) for i in listA ]
I want to create a similar one line code that only takes the mean of the negative numbers in the list. So for example the first element of the new list would be (-7 + -4)/2 = -5.5
My complete list would be:
listofnegativemeans = [ -5.5, -3.5, -6.333333, 0 ]
You could use the following:
listA = [[2,3,-7,-4], [-2,3,4,-5], [-5,-6,-8,2], [9,5,13,2]]
means = [np.mean([el for el in sublist if el < 0] or 0) for sublist in listA]
print(means)
Output
[-5.5, -3.5, -6.3333, 0.0]
If none of the elements in sublist are less than 0, the list comprehension will evaluate to []. By including the expression [] or 0 we handle your scenario where you want to evaluate the mean of an empty list to be 0.
If you're using numpy at all, you should strive for numpythonic code rather than falling back to python logic. That means using numpy's ndarray data structure, and the usual indexing style for arrays, rather than python loops.
For the usual means:
>>> listA
[[2, 3, -7, -4], [-2, 3, 4, -5], [-5, -6, -8, 2], [9, 5, 13, 2]]
>>> A = np.array(listA)
>>> np.mean(A, axis=1)
array([-1.5 , 0. , -4.25, 7.25])
Negative means:
>>> [np.mean(row[row<0]) for row in A]
[-5.5, -3.5, -6.333333333333333, nan]
The pure numpy way :
In [2]: np.ma.masked_greater(listA,0).mean(1).data
Out[2]: array([-5.5 , -3.5 , -6.33333333, 0. ])
That would be something like:
listA = np.array( [ [2,3,-7,-4] , [-2,3,4,-5] , [-5,-6,-8,2] , [9,5,13,2] ] )
listofnegativemeans = [np.mean(i[i<0]) for i in listA ]
output:
[-5.5, -3.5, -6.333333333333333, nan]
Zero is misleading, I definitely prefer nan if you don't have any elements that are negative.
Related
So i have a matrix:
a = np.array([[7,-1,0,5],
[2,5.2,4,2],
[3,-2,1,4]])
which i would like to sort by absolute value ascending column . I used np.sort(abs(a)) and sorted(a,key=abs), sorted is probably the right one but do not know how to use it for columns. I wish to get
a = np.array([[2,-1,0,2],
[3,-2,1,4],
[7,5.2,4,5]])
Try argsort on axis=0 then take_along_axis to apply the order to a:
import numpy as np
a = np.array([[7, -1, 0, 5],
[2, 5.2, 4, 2],
[3, -2, 1, 4]])
s = np.argsort(abs(a), axis=0)
a = np.take_along_axis(a, s, axis=0)
print(a)
a:
[[ 2. -1. 0. 2. ]
[ 3. -2. 1. 4. ]
[ 7. 5.2 4. 5. ]]
I'm looking to see if there is a more efficient way (i.e. using native NumPy functionality) to achieve what I'm doing currently.
My process is I start with an array a:
a = np.array([[0,2,0,-1],[-0.2,0,-0.1,0],[0,0,-0.1,0],[0,0,0,0]])
array([[ 0. , 2. , 0. , -1. ],
[-0.2, 0. , -0.1, 0. ],
[ 0. , 0. , -0.1, 0. ],
[ 0. , 0. , 0. , 0. ]])
I then filter based on where the values are not equal to 0:
r_indices, c_indicies = np.where(a != 0)
(array([0, 0, 1, 1, 2]), array([1, 3, 0, 2, 2]))
From there, I create a Python dictionary b like so:
b = {i: c_indices[r_indices == i] for i in np.unique(r_indices)}
{
0: array([1, 3]),
1: array([0, 2]),
2: array([2])},
}
I do this because I want to know for a given unique row index r, which column indices are not 0.
My own preference is to try to use NumPy as much as possible to take advantage of speed benefits. However, I'm not sure how else to structure this in NumPy since the values in the dictionary could range from a length of 0 (no values are not zero) to 4 (all values are not zero).
Am I being paranoid about the potential speed benefits?
You can use Pandas in the following way:
import pandas as pd
import numpy as np
if __name__=='__main__':
a = np.array([[0, 2, 0, -1], [-0.2, 0, -0.1, 0], [0, 0, -0.1, 0], [0, 0, 0, 0]])
rows, cols = np.where(a !=0)
x = list(zip(rows, cols))
df = pd.DataFrame.from_records(data=x)
l = df.groupby(0)[1].apply(list)
L = [np.array(a) for a in l.values]
d = dict(zip(np.unique(rows), L))
Output
{0: array([1, 3]), 1: array([0, 2]), 2: array([2])}
As pandas works with numpy under the hood, this code will be much more efficient than the regular list comprehension.
Also, if all you need is a dictionary-like object - you could inhance the performance further by using the l Pandas.GroupBy as:
l.loc[0]
which will result in :
[1, 3]
which is equivalent to the b[0] in your example.
and omitting the last two lines altogether, as Pandas provide a very fast mechanisms for handling large amounts of tabular data, and generally preferable to a plain dict object, if they used for the same thing.
Cheers.
arr = np. array([[ 1. , 9.98672295, 1. ],
[ 2. , 19.97344589, 2. ],
[ 3. , 29.96016884, 3. ]])
cnd = [True, False, True]
func = lambda a : a.astype(int)
How can I apply the func only on the columns of arr which correspond to the cnd array as True (the first and the third one)?
The ideal outcome is:
outcome = np. array([[ 1 , 9.98672295, 1 ],
[ 2 , 19.97344589, 2 ],
[ 3 , 29.96016884, 3 ]])
where the first and third columns are integers
If I understand correctly, you can use
>>> arr[:, cnd].astype(int)
array([[1, 1],
[2, 2],
[3, 3]])
Here's a way to do it:
f2 = lambda a, i: func(a) if cnd[i] else a
f3 = lambda a: np.array([f2(row, i) for (i, row) in enumerate(a.T)]).T
I applied multiple functions to make it more readable, although arguably you can do it using a single function.
My only worry is that numpy would probably need to get the column types explicitly or it might coerce them all into the same type.
How to sum every 2 consecutive vectors using numpy. Or the mean of every 2 consecutive vectors.
The list of lists (that can have even or uneven number of vectors.)
example:
[[2,2], [1,2], [1,1], [2,2]] --> [[3,4], [3,3]]
Maybe something like this but using numpy and something that actually works on array of vectors and not an array of integers. Or maybe some sort of array comprehension if the that exists.
def pairwiseSum(lst, n):
sum = 0;
for i in range(len(lst)-1):
# adding the alternate numbers
sum = lst[i] + lst[i + 1]
def mean_consecutive_vectors(lst, step):
idx_list = list(range(step, len(lst), step))
new_lst = np.split(lst, idx_list)
return np.mean(new_lst, axis=1)
Same could be done with np.sum() instead of np.mean().
You can reshape your array into pairs, which will allow you to use np.sum() or np.mean() directly by providing the correct axis:
import numpy as np
a = np.array([[2,2], [1,2], [1,1], [2,2]])
np.sum(a.reshape(-1, 2, 2), axis=1)
# array([[3, 4],
# [3, 3]])
Edit to address comment:
To get a the means of each adjacent pair, you can add slices of the original array and broadcast division by 2:
> a = np.array([[2,2], [1,2], [1,1], [2,2], [11, 10], [20, 30]])
> (a[:-1] + a[1:])/2
array([[ 1.5, 2. ],
[ 1. , 1.5],
[ 1.5, 1.5],
[ 6.5, 6. ],
[15.5, 20. ]])
I have several sparse vectors represented as lists of tuples eg.
[[(22357, 0.6265631775164965),
(31265, 0.3900572375543419),
(44744, 0.4075397480094991),
(47751, 0.5377595092643747)],
[(22354, 0.6265631775164965),
(31261, 0.3900572375543419),
(42344, 0.4075397480094991),
(47751, 0.5377595092643747)],
...
]
And my goal is to compose scipy.sparse.csr_matrix from several millions of vectors like this.
I would like to ask if there exists some simple elegant solution for this kind of conversion without trying to stuck everything to memory.
EDIT:
Just a clarification: My goal is to build the 2d matrix, where each of my sparse vectors represent one row in matrix.
Collecting indices,data into a structured array avoids the integer-double conversion issue. It is also a bit faster than the vstack approach (in limited testing) (With list data like this np.array is faster than np.vstack.)
indptr = np.cumsum([0]+[len(i) for i in vectors])
aa = np.array(vectors,dtype='i,f').flatten()
A = sparse.csr_matrix((aa['f1'], aa['f0'], indptr))
I substituted the list comprehension for map since I'm using Python3.
Indicies in the coo format (data, (i,j)) might be more intuitive
ii = [[i]*len(v) for i,v in enumerate(vectors)])
ii = np.array(ii).flatten()
aa = np.array(vectors,dtype='i,f').flatten()
A2 = sparse.coo_matrix((aa['f1'],(np.array(ii), aa['f0'])))
# A2.tocsr()
Here, ii from the 1st step is the row numbers for each sublist.
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
...]]
This construction method is slower than the csr direct indptr.
For a case where there are differing numbers of entries per row, this approach works (using intertools.chain to flatten lists):
A sample list (no empty rows for now):
In [779]: vectors=[[(1, .12),(3, .234),(6,1.23)],
[(2,.222)],
[(2,.23),(1,.34)]]
row indexes:
In [780]: ii=[[i]*len(v) for i,v in enumerate(vectors)]
In [781]: ii=list(chain(*ii))
column and data values pulled from tuples and flattened
In [782]: jj=[j for j,_ in chain(*vectors)]
In [783]: data=[d for _,d in chain(*vectors)]
In [784]: ii
Out[784]: [0, 0, 0, 1, 2, 2]
In [785]: jj
Out[785]: [1, 3, 6, 2, 2, 1]
In [786]: data
Out[786]: [0.12, 0.234, 1.23, 0.222, 0.23, 0.34]
In [787]: A=sparse.csr_matrix((data,(ii,jj))) # coo style input
In [788]: A.A
Out[788]:
array([[ 0. , 0.12 , 0. , 0.234, 0. , 0. , 1.23 ],
[ 0. , 0. , 0.222, 0. , 0. , 0. , 0. ],
[ 0. , 0.34 , 0.23 , 0. , 0. , 0. , 0. ]])
Consider the following:
import numpy as np
from scipy.sparse import csr_matrix
vectors = [[(22357, 0.6265631775164965),
(31265, 0.3900572375543419),
(44744, 0.4075397480094991),
(47751, 0.5377595092643747)],
[(22354, 0.6265631775164965),
(31261, 0.3900572375543419),
(42344, 0.4075397480094991),
(47751, 0.5377595092643747)]]
indptr = np.cumsum([0] + map(len, vectors))
indices, data = np.vstack(vectors).T
A = csr_matrix((data, indices.astype(int), indptr))
Unfortunately, this way the column indices are converted from integers to doubles and back. This works correctly for up to very large matrices, but is not ideal.