I import a csv into a dataframe and got a series like this:
In[1]: A = df["data1"]
B = df["data2"]
type(A)
Out[1]: pandas.core.series.Series
I make a pearson module like this
def pearson(vector1, vector2):
n = len(vector1)
# simple sums
sum1 = sum(float(vector1[i]) for i in range(n))
sum2 = sum(float(vector2[i]) for i in range(n))
# sum up the squares
sum1_pow = sum([pow(v, 2.0) for v in vector1])
sum2_pow = sum([pow(v, 2.0) for v in vector2])
# sum up the products
p_sum = sum([vector1[i] * vector2[i] for i in range(n)])
num = p_sum - (sum1*sum2/n)
den =((sum1_pow-pow(sum1, 2)/n) * (sum2_pow-pow(sum2, 2)/n)) ** 0.5
if den == 0:
return 0.0
return num/den
And I want to use as_matrix to convert the series to a numpy array and it return a method not a numpy array, How did I get a numpy array from
Series?
from modulas import pearson1
import numpy as np
An = A.as_matrix
Bn = B.as_matrix
p = pearson(An, Bn)
TypeError: 'module' object is not callable
How to covert a series into a numpy array?
Use values:
series = pd.Series([1, 2, 3], name="a")
series.values
# => array([1, 2, 3])
change the code to:
An=A.as_matrix() ... you have to call the method in order for it to perform its' function on the pandas series
as #Mad Physicist mentioned, you can use a pandas series in place of a np array most of the time anyways
you can also do
An = A.values
I believe as_matrix will be replaced by values in future version of pandas
Related
I have a pandas dataframe and I am trying to estimate a new timeseries V(t) based on the values of an existing timeseries B(t). I have written a minimal reproducible example to generate a sample dataframe as follows:
import pandas as pd
import numpy as np
lenb = 5000
lenv = 200
l = 5
B = pd.DataFrame({'a': np.arange(0, lenb, 1), 'b': np.arange(0, lenb, 1)},
index=pd.date_range('2022-01-01', periods=lenb, freq='2s'))
I want to calculate V(t) for all times 't' in the timeseries B as:
V(t) = (B(t-2*l) + 4*B(t-l)+ 6*B(t)+ 4*B(t+l)+ 1*B(t+2*l))/16
How can I perform this calculation in a vectorized manner in pandas? Lets say that l=5
Would that be the correct way to do it:
def V_t(B, l):
V = (B.shift(-2*l) + 4*B.shift(-l) + 6*B + 4*B.shift(l) + B.shift(2*l)) / 16
return V
I would have done it as you suggested in your latest edit. So here is an alternative to avoid having to type all the shift commands for an arbitrary long list of factors/multipliers:
import numpy as np
def V_t(B, l):
X = [1, 4, 6, 4, 4]
Y = [-2*l, -l, 0, l, 2*l]
return pd.DataFrame(np.add.reduce([x*B.shift(y) for x, y in zip(X, Y)])/16,
index=B.index, columns=B.columns)
I have two arrays:
index = [2,1,0,0,1,1,1,2]
values = [1,2,3,4,5,4,3,2]
I would like to produce:
[sum(v for i,v in zip(index, values) if i == ui) for i in sorted(set(index))]
in the most efficient way possible.
my values are computed via autograd
doing a groupby in pandas is really not efficient because of the point above
I have to do it hundreds of times on the same index but with different values
len(values) ~ 10**7
len(set(index)) ~ 10**6
Counter(index).most_common(1)[0][1] ~ 1000
I think a pure numpy solution would be the best.
I tried to precompute the reduced version of index, and then do:
[values[l].sum() for l in reduced_index]
but it is not efficient enough.
Here is a minimal code sample:
import numpy as np
import autograd.numpy as anp
from autograd import grad
import pandas as pd
EASY = True
if EASY:
index = np.random.randint(10, size=10**3)
values = anp.random.rand(10**3) * 2 - 1
else:
index = np.random.randint(1000, size=10**7)
values = anp.random.rand(10**7) * 2 - 1
# doesn't work
def f1(values):
return anp.exp(anp.bincount(index, weights=values)).sum()
index_unique = sorted(set(index))
index_map = {j: i for i, j in enumerate(index_unique)}
index_mapped = [index_map[i] for i in index]
index_lists = [[] for _ in range(len(index_unique))]
for i, j in enumerate(index_mapped):
index_lists[j].append(i)
def f2(values):
s = anp.array([values[l].sum() for l in index_lists])
return anp.exp(s).sum()
ans = grad(f2)(values)
If your index are non negative integers, you can use np.bincount with values as weights:
np.bincount(index, weights=values)
# array([ 7., 14., 3.])
This gives the sum at each position from 0 to max(index).
resuming this question: Compute the pairwise distance in scipy with missing values
test case: I want to compute the pairwise distance of series with different length taht are grouped together and I have to do it in the most efficient possible way (using euclidean distance).
one way that makes it work could be this:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist
a = pd.DataFrame(np.random.rand(10, 4), columns=['a','b','c','d'])
a.loc[0, 'a'] = np.nan
a.loc[1, 'a'] = np.nan
a.loc[0, 'c'] = np.nan
a.loc[1, 'c'] = np.nan
def dropna_on_the_fly(x, y):
return np.sqrt(np.nansum(((x-y)**2)))
pdist(starting_set, dropna_on_the_fly)
but I feel this could be very inefficient as built in methods of the pdist function are internally optimized whereas the function is simply passed over.
I have a hunch that a vectorized solution in numpy for which I broadcast the subtraction and then I proceed with the np.nansum for na resistant sum but I am unsure on how to proceed.
Inspired by this post, there would be two solutions.
Approach #1 : The vectorized solution would be -
ar = a.values
r,c = np.triu_indices(ar.shape[0],1)
out = np.sqrt(np.nansum((ar[r] - ar[c])**2,1))
Approach #2 : The memory-efficient and more performant one for large arrays would be -
ar = a.values
b = np.where(np.isnan(ar),0,ar)
mask = ~np.isnan(ar)
n = b.shape[0]
N = n*(n-1)//2
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
out = np.empty((N),dtype=b.dtype)
for j,i in enumerate(range(n-1)):
dif = b[i,None] - b[i+1:]
mask_j = (mask[i] & mask[i+1:])
masked_vals = mask_j * dif
out[start[j]:stop[j]] = np.einsum('ij,ij->i',masked_vals, masked_vals)
# or simply : ((mask_j * dif)**2).sum(1)
out = np.sqrt(out)
I'd like to multiply two vectors, one column (i.e., (N+1)x1), one row (i.e., 1x(N+1)) to give a (N+1)x(N+1) matrix. I'm fairly new to Numpy but have some experience with MATLAB, this is the equivalent code in MATLAB to what I want in Numpy:
n = 0:N;
xx = cos(pi*n/N)';
T = cos(acos(xx)*n');
in Numpy I've tried:
import numpy as np
n = range(0,N+1)
pi = np.pi
xx = np.cos(np.multiply(pi / float(N), n))
xxa = np.asarray(xx)
na = np.asarray(n)
nd = np.transpose(na)
T = np.cos(np.multiply(np.arccos(xxa),nd))
I added the asarray line after I noticed that without it Numpy seemed to be treating xx and n as lists. np.shape(n), np.shape(xx), np.shape(na) and np.shape(xxa) gives the same result: (100001L,)
np.multiply only does element by element multiplication. You want an outer product. Use np.outer:
np.outer(np.arccos(xxa), nd)
If you want to use NumPy similar to MATLAB, you have to make sure that your arrays have the right shape. You can check the shape of any NumPy array with arrayname.shape and because your array na has shape (4,) instead of (4,1), the transpose method is effectless and multiply calculates the dot product. Use arrayname.reshape(N+1,1) resp. arrayname.reshape(1,N+1) to transform your arrays:
import numpy as np
n = range(0,N+1)
pi = np.pi
xx = np.cos(np.multiply(pi / float(N), n))
xxa = np.asarray(xx).reshape(N+1,1)
na = np.asarray(n).reshape(N+1,1)
nd = np.transpose(na)
T = np.cos(np.multiply(np.arccos(xxa),nd))
Since Python 3.5, you can use the # operator for matrix multiplication. So it's a walkover to get code that's very similar to MATLAB:
import numpy as np
n = np.arange(N + 1).reshape(N + 1, 1)
xx = np.cos(np.pi * n / N)
T = np.cos(np.arccos(xx) # n.T)
Here n.T denotes the transpose of n.
Assume you have an array of values that will need to be summed together
d = [1,1,1,1,1]
and a second array specifying which elements need to be summed together
i = [0,0,1,2,2]
The result will be stored in a new array of size max(i)+1. So for example i=[0,0,0,0,0] would be equivalent to summing all the elements of d and storing the result at position 0 of a new array of size 1.
I tried to implement this using
c = zeros(max(i)+1)
c[i] += d
However, the += operation adds each element only once, thus giving the unexpected result of
[1,1,1]
instead of
[2,1,2]
How would one correctly implement this kind of summation?
If I understand the question correctly, there is a fast function for this (as long as the data array is 1d)
>>> i = np.array([0,0,1,2,2])
>>> d = np.array([0,1,2,3,4])
>>> np.bincount(i, weights=d)
array([ 1., 2., 7.])
np.bincount returns an array for all integers range(max(i)), even if some counts are zero
Juh_'s comment is the most efficient solution. Here's working code:
import numpy as np
import scipy.ndimage as ni
i = np.array([0,0,1,2,2])
d = np.array([0,1,2,3,4])
n_indices = i.max() + 1
print ni.sum(d, i, np.arange(n_indices))
This solution should be more efficient for large arrays (it iterates over the possible index values instead of the individual entries of i):
import numpy as np
i = np.array([0,0,1,2,2])
d = np.array([0,1,2,3,4])
i_max = i.max()
c = np.empty(i_max+1)
for j in range(i_max+1):
c[j] = d[i==j].sum()
print c
[1. 2. 7.]
def zeros(ilen):
r = []
for i in range(0,ilen):
r.append(0)
i_list = [0,0,1,2,2]
d = [1,1,1,1,1]
result = zeros(max(i_list)+1)
for index in i_list:
result[index]+=d[index]
print result
In the general case when you want to sum submatrices by labels you can use the following code
import numpy as np
from scipy.sparse import coo_matrix
def labeled_sum1(x, labels):
P = coo_matrix((np.ones(x.shape[0]), (labels, np.arange(len(labels)))))
res = P.dot(x.reshape((x.shape[0], np.prod(x.shape[1:]))))
return res.reshape((res.shape[0],) + x.shape[1:])
def labeled_sum2(x, labels):
res = np.empty((np.max(labels) + 1,) + x.shape[1:], x.dtype)
for i in np.ndindex(x.shape[1:]):
res[(...,)+i] = np.bincount(labels, x[(...,)+i])
return res
The first method use the sparse matrix multiplication. The second one is the generalization of user333700's answer. Both methods have comparable speed:
x = np.random.randn(100000, 10, 10)
labels = np.random.randint(0, 1000, 100000)
%time res1 = labeled_sum1(x, labels)
%time res2 = labeled_sum2(x, labels)
np.all(res1 == res2)
Output:
Wall time: 73.2 ms
Wall time: 68.9 ms
True