Delete columns of matrix of CSR format in Python

Delete columns of matrix of CSR format in Python - python

I have a sparse matrix (22000x97482) in csr format and i want to delete some columns (indices of columns numbers are stored in a list)

If you have a very large number of columns then generating the full set of column indices can become rather costly. One slightly faster alternative would be to temporarily convert to COO format:
import numpy as np
from scipy import sparse
def dropcols_fancy(M, idx_to_drop):
idx_to_drop = np.unique(idx_to_drop)
keep = ~np.in1d(np.arange(M.shape[1]), idx_to_drop, assume_unique=True)
return M[:, np.where(keep)[0]]
def dropcols_coo(M, idx_to_drop):
idx_to_drop = np.unique(idx_to_drop)
C = M.tocoo()
keep = ~np.in1d(C.col, idx_to_drop)
C.data, C.row, C.col = C.data[keep], C.row[keep], C.col[keep]
C.col -= idx_to_drop.searchsorted(C.col) # decrement column indices
C._shape = (C.shape[0], C.shape[1] - len(idx_to_drop))
return C.tocsr()
Check equivalence:
m, n, d = 1000, 2000, 20
M = sparse.rand(m, n, format='csr')
idx_to_drop = np.random.randint(0, n, d)
M_drop1 = dropcols_fancy(M, idx_to_drop)
M_drop2 = dropcols_coo(M, idx_to_drop)
print(np.all(M_drop1.A == M_drop2.A))
# True
Benchmark:
In [1]: m, n = 1000, 1000000
In [2]: %%timeit M = sparse.rand(m, n, format='csr')
...: dropcols_fancy(M, idx_to_drop)
...:
1 loops, best of 3: 1.11 s per loop
In [3]: %%timeit M = sparse.rand(m, n, format='csr')
...: dropcols_coo(M, idx_to_drop)
...:
1 loops, best of 3: 365 ms per loop

You can use fancy indexing to obtain a new csr_matrix with the columns that you have in your list:
all_cols = np.arange(old_m.shape[1])
cols_to_keep = np.where(np.logical_not(np.in1d(all_cols, cols_to_delete)))[0]
m = old_m[:, cols_to_keep]

Related

what is the most efficient way of concat two numbers to one number in python?

what is the most efficient way of concat two numbers to one number in python?
numbers are always in between 0 to 255, i have tested few ways by Concat as string and cast back to int but they are very costly in time vice for my code.
example
a = 152
c = 255
d = concat(a,c)
answer:
d = 152255

If the numbers are bounded, just multiply and add:
>>> a = 152
>>> c = 255
>>> d = a*1000+c
>>> d
152255
>>>

This is pretty fast:
def concat(a, b):
return 10**int(log(b, 10)+1)*a+b
It uses the logarithm to find how many times the first number must be multiplied by 10 for the sum to work as a concatenation
In [1]: from math import log
In [2]: a = 152
In [3]: b = 255
In [4]: def concat(a, b):
...: return 10**int(log(b, 10)+1)*a+b
...:
In [5]: concat(a, b)
Out[5]: 152255
In [6]: %timeit concat(a, b)
1000000 loops, best of 3: 1.18 us per loop

Yeah, there you go:
a = 152
b = 255
def concat(a, b):
n = next(x for x in range(10) if 10**x>a) # concatenates numbers up to 10**10
return a * 10**n + b
print(concat(a, b)) # -> 152255

Selecting close matches from one array based on another reference array

I have an array A and a reference array B. Size of A is at least as big as B. e.g.
A = [2,100,300,793,1300,1500,1810,2400]
B = [4,305,789,1234,1890]
B is in fact the position of peaks in a signal at a specified time, and A contains position of peaks at a later time. But some of the elements in A are actually not the peaks I want (might be due to noise, etc), and I want to find the 'real' one in A based on B. The 'real' elements in A should be close to those in B, and in the example given above, the 'real' ones in A should be A'=[2,300,793,1300,1810]. It should be obvious in this example that 100,1500,2400 are not the ones we want as they are quite far off from any of the elements in B. How can I code this in the most efficient/accurate way in python/matlab?

Approach #1: With NumPy broadcasting, we can look for absolute element-wise subtractions between the input arrays and use an appropriate threshold to filter out unwanted elements from A. It seems for the given sample inputs, a threshold of 90 works.
Thus, we would have an implementation, like so -
thresh = 90
Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]
Sample run -
In [69]: A
Out[69]: array([ 2, 100, 300, 793, 1300, 1500, 1810, 2400])
In [70]: B
Out[70]: array([ 4, 305, 789, 1234, 1890])
In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]
Out[71]: array([ 2, 300, 793, 1300, 1810])
Approach #2: Based on this post, here's a memory efficient approach using np.searchsorted, which could be crucial for large arrays -
def searchsorted_filter(a, b, thresh):
choices = np.sort(b) # if b is already sorted, skip it
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]
Sample run -
In [95]: searchsorted_filter(A,B, thresh = 90)
Out[95]: array([ 2, 300, 793, 1300, 1810])
Runtime test
In [104]: A = np.sort(np.random.randint(0,100000,(1000)))
In [105]: B = np.sort(np.random.randint(0,100000,(400)))
In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]
In [107]: out2 = searchsorted_filter(A,B, thresh = 10)
In [108]: np.allclose(out1, out2) # Verify results
Out[108]: True
In [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]
100 loops, best of 3: 2.74 ms per loop
In [110]: %timeit searchsorted_filter(A,B, thresh = 10)
10000 loops, best of 3: 85.3 µs per loop
Jan 2018 Update with further performance boost
We can avoid the second usage of np.searchsorted(..., 'right') by making use of the indices obtained from np.searchsorted(..., 'left') and also the absolute computations, like so -
def searchsorted_filter_v2(a, b, thresh):
N = len(b)
choices = np.sort(b) # if b is already sorted, skip it
l = np.searchsorted(choices, a, 'left')
l_invalid_mask = l==N
l[l_invalid_mask] = N-1
left_offset = choices[l]-a
left_offset[l_invalid_mask] *= -1
r = (l - (left_offset!=0))
r_invalid_mask = r<0
r[r_invalid_mask] = 0
r += l_invalid_mask
right_offset = a-choices[r]
right_offset[r_invalid_mask] *= -1
out = a[(left_offset < thresh) | (right_offset < thresh)]
return out
Updated timings to test the further speedup -
In [388]: np.random.seed(0)
...: A = np.random.randint(0,1000000,(100000))
...: B = np.unique(np.random.randint(0,1000000,(40000)))
...: np.random.shuffle(B)
...: thresh = 10
...:
...: out1 = searchsorted_filter(A, B, thresh)
...: out2 = searchsorted_filter_v2(A, B, thresh)
...: print np.allclose(out1, out2)
True
In [389]: %timeit searchsorted_filter(A, B, thresh)
10 loops, best of 3: 24.2 ms per loop
In [390]: %timeit searchsorted_filter_v2(A, B, thresh)
100 loops, best of 3: 13.9 ms per loop
Digging deeper -
In [396]: a = A; b = B
In [397]: N = len(b)
...:
...: choices = np.sort(b) # if b is already sorted, skip it
...:
...: l = np.searchsorted(choices, a, 'left')
In [398]: %timeit np.sort(B)
100 loops, best of 3: 2 ms per loop
In [399]: %timeit np.searchsorted(choices, a, 'left')
100 loops, best of 3: 10.3 ms per loop
Seems like searchsorted and sort are taking almost all of the runtime and they seem essential to this method. So, doesn't seem like it could be improved any further staying with this sort-based approach.

You could find the distance of each point in A from each value in B using bsxfun and then find the index of the point in A which is closest to each value in B using min.
[dists, ind] = min(abs(bsxfun(#minus, A, B.')), [], 2)
If you're on R2016b, bsxfun can be removed thanks to automatic broadcasting
[dists, ind] = min(abs(A - B.'), [], 2);
If you suspect that some values in B are not real peaks, then you can set a threshold value and remove any distances that were greater than this value.
threshold = 90;
ind = ind(dists < threshold);
Then we can use ind to index into A
output = A(ind);

You can use MATLAB interp1 function that exactly does what you want.
option nearest is used to find nearest points and there is no need to specify a threshold.
out = interp1(A, A, B, 'nearest', 'extrap');
comparing with other method:
A = sort(randi([0,1000000],1,10000));
B = sort(randi([0,1000000],1,4000));
disp('---interp1----------------')
tic
out = interp1(A, A, B, 'nearest', 'extrap');
toc
disp('---subtraction with threshold------')
%numpy version is the same
tic
[dists, ind] = min(abs(bsxfun(#minus, A, B.')), [], 2);
toc
Result:
---interp1----------------
Elapsed time is 0.00778699 seconds.
---subtraction with threshold------
Elapsed time is 0.445485 seconds.
interp1 can be used for inputs larger than 10000 and 4000 but in subtrction method out of memory error occured.

Cumulative addition/multiplication in NumPy

Have a relatively simple block of code that loops through two arrays, multiplies, and adds cumulatively:
import numpy as np
a = np.array([1, 2, 4, 6, 7, 8, 9, 11])
b = np.array([0.01, 0.2, 0.03, 0.1, 0.1, 0.6, 0.5, 0.9])
c = []
d = 0
for i, val in enumerate(a):
d += val
c.append(d)
d *= b[i]
Is there a way to do this without iterating? I imagine cumsum/cumprod could be used but I'm having trouble figuring out how. When you break down what's happening step by step, it looks like this:
# 0: 0 + a[0]
# 1: ((0 + a[0]) * b[0]) + a[1]
# 2: ((((0 + a[0]) * b[0]) + a[1]) * b[1]) + a[2]
Edit for clarification: Am interested in the list (or array) c.

In each iteration, you have -
d[n+1] = d[n] + a[n]
d[n+1] = d[n+1] * b[n]
Thus, essentially -
d[n+1] = (d[n] + a[n]) * b[n]
i.e. -
d[n+1] = (d[n]* b[n]) + K[n] #where `K[n] = a[n] * b[n]`
Now, using this formula if you write down the expressions for until n = 2 cases, you would have -
d[1] = d[0]*b[0] + K[0]
d[2] = d[0]*b[0]*b[1] + K[0]*b[1] + K[1]
d[3] = d[0]*b[0]*b[1]*b[2] + K[0]*b[1]*b[2] + K[1]*b[2] + K[2]
Scalars : b[0]*b[1]*b[2] b[1]*b[2] b[2] 1
Coefficients : d[0] K[0] K[1] K[2]
Thus, you would need reversed cumprod of b, perform elementwise multiplication with K array. Finally, to get c, perform cumsum and since c is stored before scaling down by b, so you would need to scale down the cumsum version by the reversed cumprod of b.
The final implementation would look like this -
# Get reversed cumprod of b and pad with `1` at the end
b_rev_cumprod = b[::-1].cumprod()[::-1]
B = np.hstack((b_rev_cumprod,1))
# Get K
K = a*b
# Append with 0 at the start, corresponding starting d
K_ext = np.hstack((0,K))
# Perform elementwsie multiplication and cumsum and scale down for final c
sums = (B*K_ext).cumsum()
c = sums[1:]/b_rev_cumprod
Runtime tests and verify output
Function definitions -
def original_approach(a,b):
c = []
d = 0
for i, val in enumerate(a):
d = d+val
c.append(d)
d = d*b[i]
return c
def vectorized_approach(a,b):
b_rev_cumprod = b[::-1].cumprod()[::-1]
B = np.hstack((b_rev_cumprod,1))
K = a*b
K_ext = np.hstack((0,K))
sums = (B*K_ext).cumsum()
return sums[1:]/b_rev_cumprod
Runtimes and verification
Case #1: OP Sample case
In [301]: # Inputs
...: a = np.array([1, 2, 4, 6, 7, 8, 9, 11])
...: b = np.array([0.01, 0.2, 0.03, 0.1, 0.1, 0.6, 0.5, 0.9])
...:
In [302]: original_approach(a,b)
Out[302]:
[1,
2.0099999999999998,
4.4020000000000001,
6.1320600000000001,
7.6132059999999999,
8.7613205999999995,
14.256792359999999,
18.128396179999999]
In [303]: vectorized_approach(a,b)
Out[303]:
array([ 1. , 2.01 , 4.402 , 6.13206 ,
7.613206 , 8.7613206 , 14.25679236, 18.12839618])
Case #2: Large input case
In [304]: # Inputs
...: N = 1000
...: a = np.random.randint(0,100000,N)
...: b = np.random.rand(N)+0.1
...:
In [305]: np.allclose(original_approach(a,b),vectorized_approach(a,b))
Out[305]: True
In [306]: %timeit original_approach(a,b)
1000 loops, best of 3: 746 µs per loop
In [307]: %timeit vectorized_approach(a,b)
10000 loops, best of 3: 76.9 µs per loop
Please be mindful that for extremely huge input array cases if the b elements are such small fractions, because of cummulative operations, the initial numbers of b_rev_cumprod might come out as zeros resulting in NaNs in those initial places.

Let's see if we can get even faster. I am now leaving the pure python world and show that this purely numeric problems can be optimized even further.
The two players are #Divakar's fast vectorized version:
def vectorized_approach(a,b):
b_rev_cumprod = b[::-1].cumprod()[::-1]
B = np.hstack((b_rev_cumprod,1))
K = a*b
K_ext = np.hstack((0,K))
sums = (B*K_ext).cumsum()
return sums[1:]/b_rev_cumprod
and a cython version:
%%cython
import numpy as np
def cython_approach(long[:] a, double[:] b):
cdef double d
cdef size_t i, n
n = a.shape[0]
cdef double[:] c = np.empty(n)
d = 0
for i in range(n):
d += a[i]
c[i] = d
d *= b[i]
return c
The cython version is about 5x faster than the vectorized version:
%timeit vectorized_approach(a,b) -> 10000 loops, best of 3: 43.4 µs per loop
%timeit cython_approach(a,b) -> 100000 loops, best of 3: 7.7 µs per loop
Another plus of the cython version is that it is much more readable.
The big downside is that you are leaving pure python and depending on your use case compiling an extension module may not be an option for you.

This here works for me and is vectorized
b_mat = np.tile(b,(b.size,1)).T
b_mat = np.vstack((np.ones(b.size),b_mat))
np.fill_diagonal(b_mat,1)
b_mat[np.triu_indices(b.size)]=1
b_prod_mat = np.cumprod(b_mat,axis=0)
b_prod_mat[np.triu_indices(b.size)] = 0
np.fill_diagonal(b_prod_mat,1)
c = np.dot(b_prod_mat,a)
c
# output
array([ 1. , 2.01 , 4.402, 6.132, 7.613, 8.761, 14.257,
18.128, 16.316])
I agree it is not easy to see whats going on. Your array c can be written as a matrix-vector multiplication b_prod_mat * a where a is your array and b_prod_mat consists of specific products of b. All the emphasis is basically to create b_prod_mat.

I am not sure that's better than a for loop but here is a way:
a.dot([np.concatenate((np.zeros(i), (1, ), b[i:-1])) for i in range(len(b))])
What it does it's create line of a big matrix A like this:
1 b0 b0b1 b0b1b2 ... b0b1..bn-1
0 1 b1 b1b2 ... b1..bn-1
0 0 1 b2 ...
...
0 0 0 0 ... 1
Then you simply multiply the vector a with the matrix A and you get your expected result.

Find indices of common values in two arrays

I'm using Python 2.7.
I have two arrays, A and B.
To find the indices of the elements in A that are present in B, I can do
A_inds = np.in1d(A,B)
I also want to get the indices of the elements in B that are present in A, i.e. the indices in B of the same overlapping elements I found using the above code.
Currently I am running the same line again as follows:
B_inds = np.in1d(B,A)
but this extra calculation seems like it should be unnecessary. Is there a more computationally efficient way of obtaining both A_inds and B_inds?
I am open to using either list or array methods.

np.unique and np.searchsorted could be used together to solve it -
def unq_searchsorted(A,B):
# Get unique elements of A and B and the indices based on the uniqueness
unqA,idx1 = np.unique(A,return_inverse=True)
unqB,idx2 = np.unique(B,return_inverse=True)
# Create mask equivalent to np.in1d(A,B) and np.in1d(B,A) for unique elements
mask1 = (np.searchsorted(unqB,unqA,'right') - np.searchsorted(unqB,unqA,'left'))==1
mask2 = (np.searchsorted(unqA,unqB,'right') - np.searchsorted(unqA,unqB,'left'))==1
# Map back to all non-unique indices to get equivalent of np.in1d(A,B),
# np.in1d(B,A) results for non-unique elements
return mask1[idx1],mask2[idx2]
Runtime tests and verify results -
In [233]: def org_app(A,B):
...: return np.in1d(A,B), np.in1d(B,A)
...:
In [234]: A = np.random.randint(0,10000,(10000))
...: B = np.random.randint(0,10000,(10000))
...:
In [235]: np.allclose(org_app(A,B)[0],unq_searchsorted(A,B)[0])
Out[235]: True
In [236]: np.allclose(org_app(A,B)[1],unq_searchsorted(A,B)[1])
Out[236]: True
In [237]: %timeit org_app(A,B)
100 loops, best of 3: 7.69 ms per loop
In [238]: %timeit unq_searchsorted(A,B)
100 loops, best of 3: 5.56 ms per loop
If the two input arrays are already sorted and unique, the performance boost would be substantial. Thus, the solution function would simplify to -
def unq_searchsorted_v1(A,B):
out1 = (np.searchsorted(B,A,'right') - np.searchsorted(B,A,'left'))==1
out2 = (np.searchsorted(A,B,'right') - np.searchsorted(A,B,'left'))==1
return out1,out2
Subsequent runtime tests -
In [275]: A = np.random.randint(0,100000,(20000))
...: B = np.random.randint(0,100000,(20000))
...: A = np.unique(A)
...: B = np.unique(B)
...:
In [276]: np.allclose(org_app(A,B)[0],unq_searchsorted_v1(A,B)[0])
Out[276]: True
In [277]: np.allclose(org_app(A,B)[1],unq_searchsorted_v1(A,B)[1])
Out[277]: True
In [278]: %timeit org_app(A,B)
100 loops, best of 3: 8.83 ms per loop
In [279]: %timeit unq_searchsorted_v1(A,B)
100 loops, best of 3: 4.94 ms per loop

A simple multiprocessing implementation will get you a little more speed:
import time
import numpy as np
from multiprocessing import Process, Queue
a = np.random.randint(0, 20, 1000000)
b = np.random.randint(0, 20, 1000000)
def original(a, b, q):
q.put( np.in1d(a, b) )
if __name__ == '__main__':
t0 = time.time()
q = Queue()
q2 = Queue()
p = Process(target=original, args=(a, b, q,))
p2 = Process(target=original, args=(b, a, q2))
p.start()
p2.start()
res = q.get()
res2 = q2.get()
print time.time() - t0
>>> 0.21398806572
Divakar's unq_searchsorted(A,B) method took 0.271834135056 seconds on my machine.

Optimization of average calculation from a list of dictionaries

I have a list of dictionaries, with keys 'a', 'n', 'o', 'u'.
Is there a way to speed up this calculation, for instance with NumPy? There are tens of thousands of items in the list.
The data is drawn from a database, so I must live with that it's in the form of a list of dictionaries originally.
x = n = o = u = 0
for entry in indata:
x += (entry['a']) * entry['n'] # n - number of data points
n += entry['n']
o += entry['o']
u += entry['u']
loops += 1
average = int(round(x / n)), n, o, u

I doubt this will be much faster, but I suppose it's a candidate for timeit...
from operator import itemgetter
x = n = o = u = 0
items = itemgetter('a','n','o','u')
for entry in indata:
A,N,O,U = items(entry)
x += A*N # n - number of data points
n += N
o += O #don't know what you're doing with O or U, but I'll leave them
u += U
average = int(round(x / n)), n, o, u
At the very least, it saves a lookup of entry['n'] since I've now saved it to a variable

You could try something like this:
mean_a = np.sum(np.array([d['a'] for d in data]) * np.array([d['n'] for d in data])) / len(data)
EDIT: Actually, the method above from #mgilson is faster:
import numpy as np
from operator import itemgetter
from pandas import *
data=[]
for i in range(100000):
data.append({'a':np.random.random(), 'n':np.random.random(), 'o':np.random.random(), 'u':np.random.random()})
def func1(data):
x = n = o = u = 0
items = itemgetter('a','n','o','u')
for entry in data:
A,N,O,U = items(entry)
x += A*N # n - number of data points
n += N
o += O #don't know what you're doing with O or U, but I'll leave them
u += U
average = int(round(x / n)), n, o, u
return average
def func2(data):
mean_a = np.sum(np.array([d['a'] for d in data]) * np.array([d['n'] for d in data])/len(data)
return (mean_a,
np.sum([d['n'] for d in data]),
np.sum([d['o'] for d in data]),
np.sum([d['u'] for d in data])
)
def func3(data):
dframe = DataFrame(data)
return np.sum((dframe["a"]*dframe["n"])) / dframe.shape[0], np.sum(dframe["n"]), np.sum(dframe["o"]), np.sum(dframe["u"])
In [3]: %timeit func1(data)
10 loops, best of 3: 59.6 ms per loop
In [4]: %timeit func2(data)
10 loops, best of 3: 138 ms per loop
In [5]: %timeit func3(data)
10 loops, best of 3: 129 ms per loop
If you are doing other operations on the data, I would definitely look into using the Pandas package. It's DataFrame object is a nice match to the list of dictionaries that you are working with. I think that the majority of the overhead is IO operations of getting the data into numpy arrays or DataFrame objects.

if all you're looking to do is get an average value on something why not
sum_for_average = math.fsum(your_item)
average_of_list = sum_for_average / len(your_item)
no mucking about with numpy at all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete columns of matrix of CSR format in Python - python

I have a sparse matrix (22000x97482) in csr format and i want to delete some columns (indices of columns numbers are stored in a list)

You can use fancy indexing to obtain a new csr_matrix with the columns that you have in your list: all_cols = np.arange(old_m.shape[1]) cols_to_keep = np.where(np.logical_not(np.in1d(all_cols, cols_to_delete)))[0] m = old_m[:, cols_to_keep]

Related

what is the most efficient way of concat two numbers to one number in python?

Selecting close matches from one array based on another reference array

Cumulative addition/multiplication in NumPy

Find indices of common values in two arrays

Optimization of average calculation from a list of dictionaries

Categories

Resources