I have a NumPy array of size 94 x 155:
a = [1 2 20 68 210 290..
2 33 34 55 230 340..
.. .. ... ... .... .....]
I want to calculate the range of each row, so that I get 94 ranges in a result. I tried looking for a numpy.range function, which I don't think exists. If this can be done through a loop, that's also fine.
I'm looking for something like numpy.mean, which, if we set the axis parameter to 1, returns the mean for each row in the N-dimensional array.
I think np.ptp might do what you want:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ptp.html
r = np.ptp(a,axis=1)
where r is your range array.
Try this:
def range_of_vals(x, axis=0):
return np.max(x, axis=axis) - np.min(x, axis=axis)
Related
I try to create a matrix 100x100 which should have in each row next ordinal number like below:
I created a vector from 1 to 100 and then using for loop I copied this vector 100 times. I received an array with correct data so I tried to sort arrays using np.argsort, but it didn't worked as I want (I don't know even why there are zeros in after sorting).
Is there any option to get this matrix using another functions? I tried many approaches, but the final layout was not what I expected.
max_x = 101
z = np.arange(1,101)
print(z)
x = []
for i in range(1,max_x):
x.append(z.copy())
print(x)
y = np.argsort(x)
y
argsort returns the indices to sort by, that's why you get zeros. You don't need that, what you want is to transpose the array.
Make x a numpy array and use T
y = np.array(x).T
Output
[[ 1 1 1 ... 1 1 1]
[ 2 2 2 ... 2 2 2]
[ 3 3 3 ... 3 3 3]
...
[ 98 98 98 ... 98 98 98]
[ 99 99 99 ... 99 99 99]
[100 100 100 ... 100 100 100]]
You also don't need to loop to copy the array, use np.tile instead
z = np.arange(1, 101)
x = np.tile(z, (100, 1))
y = x.T
# or one liner
y = np.tile(np.arange(1, 101), (100, 1)).T
import numpy as np
np.asarray([ (k+1)*np.ones(100) for k in range(100) ])
Or simply
np.tile(np.arange(1,101),(100,1)).T
I have a dataframe with sets of longitude/latitude values, where each set of coordinates behave linearly. The coordinates in each set all have a common index value, so I've been trying to figure out a way to use groupby and apply interpolation in order to obtain a greater abundance of data. Here is my data (simplified):
Longitude Latitude
t
0 40 70
0 41 71
0 42 72
0 43 73
1 120 10
1 121 12
1 122 14
1 123 16
... ... ...
For this instance, each set has 4 coordinates, and I want to interpolate each set to have a specific number of coordinates, say 8. This is what I've tried so far. The function works by returning a dataframe with interpolated numbers, but I'm not sure how to use it in conjunction with groupby. How should I word the command, or is there a better method?
from scipy import interpolate
def interpolate(lon,lat):
#convert coordinates into array
x=np.asarray(lon)
y=np.asarray(lat)
#generates interpolation function based on coordinates
f=interpolate.interp1d(x, y, kind='linear')
#set of numbers stretching from minimum to maximum longitude
xnew=np.linspace(min(lon), max(lon), num=8, endpoint=True)
#apply interpolation function to obtain interpolated latitude
df={'lon':xnew,'lat':f(xnew)}
df=pd.DataFrame(df)
return df
df.groupby(level=0).apply(interpolate(df['Longitude'],df['Latitude']))
In my Dataframe I have one column with numeric values, let say - distance. I want to find out which group of distance (range) have the biggest number of records (rows).
Doing simple:
df.distance.count_values() returns:
74 1
90 1
94 1
893 1
889 1
885 1
877 1
833 1
122 1
545 1
What I want to achieve is something like buckets from histogram, so I am expecting output like this:
900 4 #all values < 900 and > 850
100 3
150 1
550 1
850 1
The one approach I've figured out so far, but I don't think is the best and most optimal one is just find max and min values, divide by my step (50 in this case) and then do loop checking all the values and assigning to appropriate group.
Is there any other, better approach for that?
I'd suggest doing the following, assuming your value column is labeled val
import numpy as np
df['bin'] = df['val'].apply(lambda x: 50*np.floor(x/50))
The result is the following:
df.groupby('bin')['val'].count()
Thanks to EdChum suggestion and based on this example I've figured out, the best way (at least for me) is to do something like this:
import numpy as np
step = 50
#...
max_val = df.distance.max()
bins = list(range(0,int(np.ceil(max_val/step))*step+step,step))
clusters = pd.cut(df.distance,bins,labels=bins[1:])
I would like to force matrix multiplication "orientation" using Python Pandas, both between DataFrames against DataFrames, Dataframes against Series and Series against Series.
As an example, I tried the following code:
t = pandas.Series([1, 2])
print(t.T.dot(t))
Which outputs: 5
But I expect this:
[1 2
2 4]
Pandas is great, but this inability to do matrix multiplications the way I want is what is the most frustrating, so any help would be greatly appreciated.
PS: I know Pandas tries to implicitly use index to find the right way to compute the matrix product, but it seems this behavior can't be switched off!
Here:
In [1]: import pandas
In [2]: t = pandas.Series([1, 2])
In [3]: np.outer(t, t)
Out[3]:
array([[1, 2],
[2, 4]])
Anyone coming to this now may want to consider: pandas.Series.to_frame(). It's kind of clunky.
Here's the original question's example:
import pandas as pd
t = pd.Series([1, 2])
t.to_frame() # t.to_frame().T
# or equivalently:
t.to_frame().dot(t.to_frame().T)
Which yields:
In [3]: t.to_frame().dot(t.to_frame().T)
Out[3]:
0 1
0 1 2
1 2 4
Solution found by y-p:
https://github.com/pydata/pandas/issues/3344#issuecomment-16533461
from pandas.util.testing import makeCustomDataframe as mkdf
a=mkdf(3,5,data_gen_f=lambda r,c: randint(1,100))
b=mkdf(5,3,data_gen_f=lambda r,c: randint(1,100))
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
print a
print b
print c
assert (a.iloc[0,:].values*b.iloc[:,0].values.T).sum() == c.iloc[0,0]
C0 C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3 C_l0_g4
R0
R_l0_g0 39 87 88 2 65
R_l0_g1 59 14 76 10 65
R_l0_g2 93 69 4 29 58
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 76 88 11
R_l0_g1 66 73 47
R_l0_g2 78 69 15
R_l0_g3 47 3 40
R_l0_g4 54 31 31
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 19174 17876 7933
R_l0_g1 15316 13503 4862
R_l0_g2 16429 15382 7284
The assert here is useless, it just does a check that it's indeed a correct matrix multiplication.
The key here seems to be line 4:
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
What this does is that it computes the dot product of a and b, but force that the resulting DataFrame c has a's indexes and b's columns, indeed converting the dot product into a matrix multiplication, and in pandas's style since you keep the indexes and columns (you lose the columns of a and indexes of b, but this is semantically correct since in a matrix multiplication you are summing over those rows, so it would be meaningless to keep them).
This is a bit awkward but seems simple enough if it is consistent with the rest of the API (I still have to test what will be the result with Series x Dataframe and Series x Series, I will post here my findings).
I have a function foo that takes a NxM numpy array as an argument and returns a scalar value. I have a AxNxM numpy array data, over which I'd like to map foo to give me a resultant numpy array of length A.
Curently, I'm doing this:
result = numpy.array([foo(x) for x in data])
It works, but it seems like I'm not taking advantage of the numpy magic (and speed). Is there a better way?
I've looked at numpy.vectorize, and numpy.apply_along_axis, but neither works for a function of 2D arrays.
EDIT: I'm doing boosted regression on 24x24 image patches, so my AxNxM is something like 1000x24x24. What I called foo above applies a Haar-like feature to a patch (so, not terribly computationally intensive).
If NxM is big (say, 100), they the cost of iterating over A will be amortized into basically nothing.
Say the array is 1000 X 100 X 100.
Iterating is O(1000), but the cumulative cost of the inside function is O(1000 X 100 X 100) - 10,000 times slower. (Note, my terminology is a bit wonky, but I do know what I'm talking about)
I'm not sure, but you could try this:
result = numpy.empty(data.shape[0])
for i in range(len(data)):
result[i] = foo(data[i])
You would save a big of memory allocation on building the list ... but the loop overhead would be greater.
Or you could write a parallel version of the loop, and split it across multiple processes. That could be a lot faster, depending on how intensive foo is (as it would have to offset the data handling).
You can achieve that by reshaping your 3D array as a 2D array with the same leading dimension, and wrap your function foo with a function that works on 1D arrays by reshaping them as required by foo. An example (using trace instead of foo):
from numpy import *
def apply2d_along_first(func2d, arr3d):
a, n, m = arr3d.shape
def func1d(arr1d):
return func2d(arr1d.reshape((n,m)))
arr2d = arr3d.reshape((a,n*m))
return apply_along_axis(func1d, -1, arr2d)
A, N, M = 3, 4, 5
data = arange(A*N*M).reshape((A,N,M))
print data
print apply2d_along_first(trace, data)
Output:
[[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]
[15 16 17 18 19]]
[[20 21 22 23 24]
[25 26 27 28 29]
[30 31 32 33 34]
[35 36 37 38 39]]
[[40 41 42 43 44]
[45 46 47 48 49]
[50 51 52 53 54]
[55 56 57 58 59]]]
[ 36 116 196]