I have the following statement in Pandas that uses the apply method which can take up to 2 minutes long.
I read that in order to optimize the speed. I should vectorize the statement. My original statement looks like this:
output_data["on_s"] = output_data["m_ind"].apply(lambda x: my_matrix[x, 0] + my_matrix[x, 1] + my_matrix[x, 2]
Where my_matrix is spicy.sparse matrix. So my initial step was to use the sum method:
summed_matrix = my_matrix.sum(axis=1)
But then after this point I get stuck on how to proceed.
Update: Including example data
The matrice looks like this (scipy.sparse.csr_matrix):
(290730, 2) 0.3058016922838267
(290731, 2) 0.3390328430763723
(290733, 2) 0.0838999800585995
(290734, 2) 0.0237008960604337
(290735, 2) 0.0116864263235209
output_data["m_ind"] is just a Pandas series that has come values like so:
97543
97544
97545
97546
97547
An Update: You have convert sparse matrix into dense matrix first
Since you haven't provided any reproducible code I can't understand what is your problem exactly and give you an very very precise answer. But I will answer according to my understanding.
Let's assume your my_matrix is some thing like this
[[1,2,3],
[4,5,6],
[7,8,9]]
then the summed_matrix will be like [6,15,24]. So if my assumption is right it looks like you are almost there.
First I'll give you the simplest answer.
Try using this line.
output_data["on_s"] = output_data["m_ind"].apply(lambda x: summed_matrix[x])
Then we can try this completely vectorized method.
Turn m_ind into a one hot encoded array ohe_array. Be careful to generate ohe_array according to the increasing order (sorted).
Otherwise you will get the wrong answer. refer this
Then get the dot product of ohe_array and summed_matrix. refer this
Assign the result into the column on_s
Also We can try the following code and compare performance against apply.
indexes = output_data["m_ind"].values
on_s = []
for i in indexes:
on_s.append(summed_matrix[i])
output_data['on-s'] = on_s
Related
I wanted to add 0 to help in smoothing in ML. I used basic for loop, but since there are huge number of rows greater than 50000, I wanted to know if there is a specific numpy method which does the same job?
like:
a = [[1,2,3],[4,5,6]]
by appending 0,
we get
a = [[1,2,3,0],[4,5,6,0]]
I have already ran the basic python code using for loop. Also I tried using numpy.append which only needs a proper matrix.
I used
a = [[1,2,3],[4,5,6]]
for x in a:
x.append(0)
I wish to get a = [[1,2,3,0],[4,5,6,0]] using numpy
To put Aliakbar's answer in more explicit form, the answer is the following:
b = np.zeros(row_no)
np.hstack((a, b))
will do the trick. Alternative way will be np.vstack((a,b)) if you want to add new rows, and for more generic purposes, np.concatenate((a,b), axis=1) works as well.
You can use numpy.hstack function, like example bellow:
a = np.array([[1,2,3],[4,5,6]])
np.hstack((a, np.zeros([2, 1])))
I'm trying to code something like this:
where x and y are two different numpy arrays and the j is an index for the array. I don't know the length of the array because it will be entered by the user and I cannot use loops to code this.
My main problem is finding a way to move between indexes since i would need to go from
x[2]-x[1] ... x[3]-x[2]
and so on.
I'm stumped but I would appreciate any clues.
A numpy-ic solution would be:
np.square(np.diff(x)).sum() + np.square(np.diff(y)).sum()
A list comprehension approach would be:
sum([(x[k]-x[k-1])**2+(y[k]-y[k-1])**2 for k in range(1,len(x))])
will give you the result you want, even if your data appears as list.
x[2]-x[1] ... x[3]-x[2] can be generalized to:
x[[1,2,3,...]-x[[0,1,2,...]]
x[1:]-x[:-1] # ie. (1 to the end)-(0 to almost the end)
numpy can take the difference between two arrays of the same shape
In list terms this would be
[i-j for i,j in zip(x[1:], x[:-1])]
np.diff does essentially this, a[slice1]-a[slice2], where the slices are as above.
The full answer squares, sums and squareroots.
I am trying to solve a "very simple" problem. Not so simple in Python. Given a large matrix A and another smaller matrix B I want to substitute certain elements of A with B.
In Matlab is would look like this:
Given A, row_coord = [1,5,6] col_coord = [2,4], and a matrix B of size(3X2), A[row_coord, col_coord] = B
In Python I tried to use product(row_coord, col_coord) from the itertools to generate the set of all indexes that need to be accessible in A but it does not work. All examples on submatrix substitution refer to block-wise row_coord = col_coord examples. Nothing concrete except for the http://comments.gmane.org/gmane.comp.python.numeric.general/11912 seems to relate to the problem that I am facing and the code in the link does not work.
Note: I know that I can implement what I need via the double for-loop, but on my data such a loop adds 9 secs to the run of one iteration and I am looking for a faster way to implement this.
Any help will be greatly appreciated.
Assuming you're using numpy arrays then (in the case where your B is a scalar) the following code should work to assign the chosen elements to the value of B.
itertools.product will create all of the coordinate pairs which we then convert into a numpy array and use in indexing your original array:
import numpy as np
from itertools import product
A = np.zeros([20,20])
col_coord = [0,1,3]
row_coord = [1,2]
coords = np.array(list(product(row_coord, col_coord)))
B = 1
A[coords[:,0], coords[:,1]] = B
I used this excellent answer by unutbu to work out how to do the indexing.
I am having a small issue understanding indexing in Numpy arrays. I think a simplified example is best to get an idea of what I am trying to do.
So first I create an array of zeros of the size I want to fill:
x = range(0,10,2)
y = range(0,10,2)
a = zeros(len(x),len(y))
so that will give me an array of zeros that will be 5X5. Now, I want to fill the array with a rather complicated function that I can't get to work with grids. My problem is that I'd like to iterate as:
for i in xrange(0,10,2):
for j in xrange(0,10,2):
.........
"do function and fill the array corresponding to (i,j)"
however, right now what I would like to be a[2,10] is a function of 2 and 10 but instead the index for a function of 2 and 10 would be a[1,4] or whatever.
Again, maybe this is elementary, I've gone over the docs and find myself at a loss.
EDIT:
In the end I vectorized as much as possible and wrote the simulation loops that I could not in Cython. Further I used Joblib to Parallelize the operation. I stored the results in a list because an array was not filling right when running in Parallel. I then used Itertools to split the list into individual results and Pandas to organize the results.
Thank you for all the help
Some tips for your to get the things done keeping a good performance:
- avoid Python `for` loops
- create a function that can deal with vectorized inputs
Example:
def f(xs, ys)
return x**2 + y**2 + x*y
where you can pass xs and ys as arrays and the operation will be done element-wise:
xs = np.random.random((100,200))
ys = np.random.random((100,200))
f(xs,ys)
You should read more about numpy broadcasting to get a better understanding about how the arrays's operations work. This will help you to design a function that can handle properly the arrays.
First, you lack some parenthesis with zeros, the first argument should be a tuple :
a = zeros((len(x),len(y)))
Then, the corresponding indices for your table are i/2 and j/2 :
for i in xrange(0,10,2):
for j in xrange(0,10,2):
# do function and fill the array corresponding to (i,j)
a[i/2, j/2] = 1
But I second Saullo Castro, you should try to vectorize your computations.
I would like to apply a function to a monodimensional array 3 elements at a time, and output for each of them a single element.
for example I have an array of 13 elements:
a = np.arange(13)**2
and I want to apply a function, let's say np.std as an example.
Here is the equivalent list comprehension:
[np.std(a[i:i+3]) for i in range(0, len(a),3)]
[1.6996731711975948,
6.5489609014628334,
11.440668201153674,
16.336734339790461,
0.0]
does anyone know a more efficient way using numpy functions?
The simplest way is to reshape it and apply the function along an axis.
import numpy as np
a = np.arange(12)**2
b = a.reshape(4,3)
print np.std(b, axis=1)
If you need a little better performance than that, you could try stride_tricks. Below is the same as above except using stride_tricks. I was wrong about the performance gain, because as you can see below, b becomes exactly the same view as b above. I wouldn't be surprised if they compiled to exactly the same thing.
import numpy as np
a = np.arange(12)**2
b = np.lib.stride_tricks.as_strided(a, shape=(4,3), strides=(a.itemsize*3, a.itemsize))
print np.std(b, axis=1)
Are you talking about something like vectorize? http://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html
You can reshape it. But that does require that the size not change. If you can tack on some bogus entries at the end you can do this:
[np.std(s) for s in a.reshape(-1,3)]