Can I perform operations between rows on a dataframe? - python

Suppose I have a dataframe
ID1 ID2 x y time
0 0 1 34.337735 -76.3319716667 1446797582
1 0 1 34.3841816667 -76.2837666667 1446796183
2 0 2 34.49157 -76.1661133333 1446792969
3 0 3 34.5275266667 -76.1151866667 1446791765
4 0 3 34.5624816667 -76.0633883333 1446790559
What I would like is to capture the distance moved by each member, identified uniquely by the ID1,ID2 pair.
Is there anyway I can perform row operations on a dataframe? My initial idea was to convert the dataframe to a matrix using df.as_matrix(), pick out the unique IDs, an compute distances from the matrix.
This seems really inefficient. Is there a better way I could do this with dataframes?

if you want to calculate the distance for each time step you can do the following
df[['x' , 'y']].apply(lambda x : np.linalg.norm(x) , axis = 1 )
on the other hand if you want to calculate the distance by each member you can do the following
In [38]:
df.groupby([df.ID1 , df.ID2])[['x' , 'y']].
apply(lambda x : np.linalg.norm(x.diff().dropna()) if len(x) > 1 else 0 )
Out[38]:
ID1 ID2
0 1 0.066940
2 0.000000
3 0.062489
dtype: float64
first you will group by your ID columns and then check for the members length if the length is greater than 1 so this means the member has moved other wise the member didn't .
you can calculate the difference between x and y by using the diff function which will produce na for the first columns but you can drop it easily using dropna function .
then to calculate the vector length you can easily use the function np.linalg.norm
you can also use x.diff().iloc[1] instead of x.diff().dropna()

If you need to get the length of a total path for each unique pair you could do
pd.DataFrame(df.groupby(['ID1','ID2']).apply(lambda z:pathlength(z.x.values,z.y.values)))
Where pathlength is
from math import sqrt
def pathlength(x,y):
n = len(x)
lv = [sqrt((x[i]-x[i-1])**2 + (y[i]-y[i-1])**2) for i in range (1,n)]
L = sum(lv)
return L
That gives us
0
ID1 ID2
0 1 0.066940
2 0.000000
3 0.062489

Related

Create a custom percentile rank for a pandas series

I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64

Python: using rolling + apply with a function that requires 2 columns as arguments in pandas

I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.

Centre of mass row-wise in a dataframe and multiply each column by different mass

I'm trying to calculate the centre of mass of 20 objects, where each object has it's own different mass.
These objects are represented in a dataframe cm_x, and their associated masses in a list. Below I show an example of just 3 of those 20 objects, for the sake of saving space. Each object has an x, y, z coordinate, but I'll just show the x and then I can apply the same technique to the rest. Below is the head of the dataframe.
bar_head_x bar_hip_centre_x bar_left_ankle_x
0 -203.3502 -195.4573 -293.262
1 -203.4280 -195.4720 -293.251
2 -203.4954 -195.4675 -293.248
3 -203.5022 -195.9193 -293.219
4 -203.5014 -195.9092 -293.328
m_head = 0.081
m_hipc = 0.139
m_lank = 0.0465
m = [m_head,m_hipc,m_lank]
I saw in another similar question, someone has suggested this method, however this doesn't incorporate the masses, and that is where I'm having an issue:
def series_sum(pd_series):
return np.sum(np.dot(pd_series.values, np.asarray(range(1, len(pd_series)+1)))/np.sum(pd_series))
cm_x.apply(series_sum, axis=1)
Basically I want for each row, to have an associated centre of mass, using the formula for centre of mass which is sum(x_i * m_i) / sum(m_i).
The desired result would be a new column in the dataframe like so:
cm_x
0 -214.92
1 ...
2 ...
3 ...
4 ...
Any help?
If I understand correctly, you can compute the desired column like this:
>>> df.mul(m).sum(axis=1)/sum(m)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
Use DataFrame.dot and divide by sum of list m:
s = df.dot(m).div(sum(m))
print (s)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
dtype: float64
If need DataFrame add Series.to_frame:
df1 = df.dot(m).div(sum(m)).to_frame('cm_x')
print (df1)
cm_x
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503

Select specific columns

I've a scientist dataframe
radius date spin atom
0 12,50 YYYY/MM 0 he
1 11,23 YYYY/MM 2 c
2 45,2 YYYY/MM 1 z
3 11,1 YYYY/MM 1 p
I want select for each row, all rows where the difference between the radius is under, for exemple 5
I've define a function to calc (simple,it's an example):
def diff_radius (a,b)
return a-b
Is-it possible for each rows to find some rows which check the condition in calling an external function?
I try some way, not working:
for i in range(df.shape[0]):
....
df_in_radius=df.apply(lambda x : diff_radius(df[i]['radius'],x['radius']))
Can you help me?
I am assuming that the datatype of the radius column is a tuple. You can keep the diff_radius method like
def diff_radius(x):
a, b = x
return a-b
Then, you can use loc method in pandas to select the rows which matches the condition of radius differece less than 5.
df.loc[df.radius.apply(diff_radius) < 5]
Edit #1
If the datatype of the radius column is a string, then split them and typecast. The logic will go in the diff_radius method. In case of string
def diff_radius(x):
x_split = x.split(',')
a,b = int(x_split[0]), int(x_split[-1])
return a-b
I misspoke.
My dataframe is :
radius of my atom date spin atom
0 12.50 YYYY/MM 0 he
1 11.23 YYYY/MM 2 c
2 45.2 YYYY/MM 1 z
3 11.1 YYYY/MM 1 p
I do a loop , to apply on one row a special calcul of each row whose respond condition.
Example:
def diff_radius(current_row,x):
current_row['radius']-x['radius']
return a-b
df=pd.read_csv(csvfile,delimiter=";",names=('radius','date','spin','atom'))
# for each row of original dataframe
for i in range(df.shape[0]):
# first build a new and tmp dataframe with row
# which have a radius less 5 than df.iloc[i]['radius] (level of loop)
df_tmp=df[diff_radius(df.iloc[i]['radius],df['radius']) <5]
....
# start of special calc, with the df_tmp which contains all of rows
# less 5 than the current row **(i)**
I thank you sincerely for your answers

Fastest way to compare rows of two pandas dataframes?

So I have two pandas dataframes, A and B.
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.
I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.
For example, I would want
A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]
To return something that rows (3,5,8,11,15) in A match up with that (1,0,1,0,1,0,0,1,0,0) row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)
And I want to do this over every row in B.
The best way I could figure out to do this was:
import numpy as np
for i in B:
B_array = np.array(i)
Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
Matching_Rows_Index = Matching_Rows.index
This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.
I have been messing around with DataFrame.apply to no avail. Could map work better?
I was just hoping someone saw something obviously more efficient as I am fairly new to python.
Thanks and best regards!
We can abuse the fact that both dataframes have binary values 0 or 1 by collapsing the relevant columns from A and all columns from B into 1D arrays each, when considering each row as a sequence of binary numbers that could be converted to decimal number equivalents. This should reduce the problem set considerably, which would help with performance. Now, after getting those 1D arrays, we can use np.in1d to look for matches from B in A and finally np.where on it to get the matching indices.
Thus, we would have an implementation like so -
# Setup 1D arrays corresponding to selected cols from A and entire B
S = 2**np.arange(10)
A_ID = np.dot(A[range(1,11)],S)
B_ID = np.dot(B,S)
# Look for matches that exist from B_ID in A_ID, whose indices
# would be desired row indices that have matched from B
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
Sample run -
In [157]: # Setup dataframes A and B with rows 0, 4 in A having matches from B
...: A_arr = np.random.randint(0,2,(10,14))
...: B_arr = np.random.randint(0,2,(7,10))
...:
...: B_arr[2] = A_arr[4,1:11]
...: B_arr[4] = A_arr[4,1:11]
...: B_arr[5] = A_arr[0,1:11]
...:
...: A = pd.DataFrame(A_arr)
...: B = pd.DataFrame(B_arr)
...:
In [158]: S = 2**np.arange(10)
...: A_ID = np.dot(A[range(1,11)],S)
...: B_ID = np.dot(B,S)
...: out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
...:
In [159]: out_row_idx
Out[159]: array([0, 4])
You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:
A = pd.DataFrame({'A':[1,0,1,1],
'B':[0,0,1,1],
'C':[1,0,1,1],
'D':[1,1,1,0],
'E':[1,1,0,1]})
print (A)
A B C D E
0 1 0 1 1 1
1 0 0 0 1 1
2 1 1 1 1 0
3 1 1 1 0 1
B = pd.DataFrame({'0':[1,0,1],
'1':[1,0,1],
'2':[1,0,0]})
print (B)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 0
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A')))
index_B 0 1 2 index_A A B C D E
0 0 1 1 1 2 1 1 1 1 0
1 0 1 1 1 3 1 1 1 0 1
2 1 0 0 0 1 0 0 0 1 1
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A'))[['index_B','index_A']])
index_B index_A
0 0 2
1 0 3
2 1 1
You can do it in pandas by using loc or ix and telling it to find the rows where the ten columns are all equal. Like this:
A.loc[(A[1]==B[1]) & (A[2]==B[2]) & (A[3]==B[3]) & A[4]==B[4]) & (A[5]==B[5]) & (A[6]==B[6]) & (A[7]==B[7]) & (A[8]==B[8]) & (A[9]==B[9]) & (A[10]==B[10])]
This is quite ugly in my opinion but it will work and gets rid of the loop so it should be significantly faster. I wouldn't be surprised if someone could come up with a more elegant way of coding the same operation.
In this special case, your rows of 10 zeros and ones can be interpreted as 10 digit binaries. If B is in order, then it can be interpreted as a range from 0 to 1023. In this case, all we need to do is take A's rows in 10 column chunks and calculate what its binary equivalent is.
I'll start by defining a range of powers of two so I can do matrix multiplication with it.
twos = pd.Series(np.power(2, np.arange(10)))
Next, I'll relabel A's columns into a MultiIndex and stack to get my chunks of 10.
A = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A.columns = pd.MultiIndex.from_tuples(zip((A.columns / 10).tolist(), (A.columns % 10).tolist()))
A_ = A.stack(0)
A_.head()
Finally, I'll multiply A_ with twos to get integer representation of each row and unstack.
A_.dot(twos).unstack()
This is now a 1000 x 50 DataFrame where each cell represents which of B's rows we matched for that particular 10 column chunk for that particular row of A. There isn't even a need for B.

Categories

Resources