Divide columns in a DataFrame by a Series (result is only NaNs?) - python

I'm trying to do a similar thing to what is posted in this question: Python Pandas - n X m DataFrame multiplied by 1 X m Dataframe
I have an n x m DataFrame, with all non-zero float values, and a 1 x m column, with all non-zero float values, and I'm trying to divide each column in the n x m dataframe by the values in the column.
So I've got:
a b c
1 2 3
4 5 6
7 8 9
and
x
11
12
13
and I'm looking to return:
a b c
1/11 2/11 3/11
4/12 5/12 6/12
7/13 8/13 9/13
I've tried a multiplication operation first, to see if I can make it work, so I tried applying the two solutions given in the answer to the question above.
df_prod = pd.DataFrame({c:df[c]* df_1[c].ix[0] for c in df.columns})
This produces a "Key Error 0"
And using the other solution to :
df.mul(df_1.iloc[0])
This just gives me all NaN, although in the right shape.

The cause of NaNs are due to misalignment of your indexes. To get over this, you will either need to divide by numpy arrays,
# <=0.23
df.values / df2[['x']].values # or df2.values assuming there's only 1 column
# 0.24+
df.to_numpy() / df[['x']].to_numpy()
array([[0.09090909, 0.18181818, 0.27272727],
[0.33333333, 0.41666667, 0.5 ],
[0.53846154, 0.61538462, 0.69230769]])
Or perform an axis aligned division using .div:
df.div(df2['x'], axis=0)
a b c
0 0.090909 0.181818 0.272727
1 0.333333 0.416667 0.500000
2 0.538462 0.615385 0.692308

Related

Create a custom percentile rank for a pandas series

I need to calculate the percentile using a specific algorithm that is not available using either pandas.rank() or numpy.rank().
The ranking algorithm is calculated as follows for a series:
rank[i] = (# of values in series less than i + # of values equal to
i*0.5)/total # of values
so if I had the following series
s=pd.Series(data=[5,3,8,1,9,4,14,12,6,1,1,4,15])
For the first element, 5 there are 6 values less than 5 and no other values = to 5. The rank would be (6+0x0.5)/13 or 6/13.
For the fourth element (1) it would be (0+ 2x0.5)/13 or 1/13.
How could I calculate this without using a loop? I assume a combination of s.apply and/or s.where() but can't figure it out and have tried searching. I am looking to apply to the entire series at once, with the result being a series with the percentile ranks.
You could use numpy broadcasting. First convert s to a numpy column array. Then use numpy broadcasting to count the number of items less than i for each i. Then count the number of items equal to i for each i (note that we need to subract 1 since, i is equal to i itself). Finally add them and build a Series:
tmp = s.to_numpy()
s_col = tmp[:, None]
less_than_i_count = (s_col>tmp).sum(axis=1)
eq_to_i_count = ((s_col==tmp).sum(axis=1) - 1) * 0.5
ranks = pd.Series((less_than_i_count + eq_to_i_count) / len(s), index=s.index)
Output:
0 0.461538
1 0.230769
2 0.615385
3 0.076923
4 0.692308
5 0.346154
6 0.846154
7 0.769231
8 0.538462
9 0.076923
10 0.076923
11 0.346154
12 0.923077
dtype: float64

Pandas reorder rows of dataframe

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.
What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

pandas apply with different arg for each column/row

Assume I have a M (rows) by N (columns) dataFrame
df = pandas.DataFrame([...])
and a vector of length N
windows = [1,2,..., N]
I would like to apply a moving average function to each column in df, but would like the moving average to have different length for each column (e.g. column1 has MA length 1, column 2 has MA length 2, etc) - these lengths are contained in windows
Are there built in functions to do this quickly? I'm aware of the df.apply(lambda a: f(a), axis=0, args=...) but unclear how to apply different args for each column
Here's one way to do it:
In [15]: dfrm
Out[15]:
A B C
0 0.948898 0.587032 0.131551
1 0.385582 0.275673 0.107135
2 0.849599 0.696882 0.313717
3 0.993080 0.510060 0.287691
4 0.994823 0.441560 0.632076
5 0.711145 0.760301 0.813272
6 0.932131 0.531901 0.393798
7 0.965915 0.812821 0.287819
8 0.782890 0.478565 0.960353
9 0.908078 0.850664 0.912878
In [16]: windows
Out[16]: [1, 2, 3]
In [17]: pandas.DataFrame(
{c: dfrm[c].rolling(windows[i]).mean() for i, c in enumerate(dfrm.columns)}
)
Out[17]:
A B C
0 0.948898 NaN NaN
1 0.385582 0.431352 NaN
2 0.849599 0.486277 0.184134
3 0.993080 0.603471 0.236181
4 0.994823 0.475810 0.411161
5 0.711145 0.600931 0.577680
6 0.932131 0.646101 0.613049
7 0.965915 0.672361 0.498296
8 0.782890 0.645693 0.547323
9 0.908078 0.664614 0.720350
As #Manish Saraswat mentioned in the comments, you can also express the same thing as dfrm[c].rolling_mean(windows[i]). Further, you can use sequences as the items in windows if you want, and they would express a custom window shape (size and weights), or any of the other options with different rolling aggregations and keywords.

Calculate a rolling window weighted average on a Pandas column

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.
Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35

drop duplicates in Python Pandas DataFrame not removing duplicates

I have a problem with removing the duplicates. My program is based around a loop which generates tuples (x,y) which are then used as nodes in a graph. The final array/matrix of nodes is :
[[ 1. 1. ]
[ 1.12273268 1.15322175]
[..........etc..........]
[ 0.94120695 0.77802849]
**[ 0.84301344 0.91660517]**
[ 0.93096269 1.21383287]
**[ 0.84301344 0.91660517]**
[ 0.75506418 1.0798641 ]]
The length of the array is 22. Now, I need to remove the duplicate entries (see **). So I used:
def urows(array):
df = pandas.DataFrame(array)
df.drop_duplicates(take_last=True)
return df.drop_duplicates(take_last=True).values
Fantastic, but I still get :
0 1
0 1.000000 1.000000
....... etc...........
17 1.039400 1.030320
18 0.941207 0.778028
**19 0.843013 0.916605**
20 0.930963 1.213833
**21 0.843013 0.916605**
So drop duplicates is not removing anything. I tested to see if the nodes where actually the same and I get:
print urows(total_nodes)[19,:]
---> [ 0.84301344 0.91660517]
print urows(total_nodes)[21,:]
---> [ 0.84301344 0.91660517]
print urows(total_nodes)[12,:] - urows(total_nodes)[13,:]
---> [ 0. 0.]
Why is it not working ??? How can I remove those duplicate values ???
One more question....
Say two values are "nearly" equal (say x1 and x2), is there any way to replace them in a way that they are both equal ???? What I want is to replace x2 with x1 if they are "nearly" equal.
If I copy-paste in your data, I get:
>>> df
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
5 0.843013 0.916605
6 0.755064 1.079864
>>> df.drop_duplicates()
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
6 0.755064 1.079864
so it is actually removed, and your problem is that the arrays aren't exactly equal (though their difference rounds to 0 for display).
One workaround would be to round the data to however many decimal places are applicable with something like df.apply(np.round, args=[4]), then drop the duplicates. If you want to keep the original data but remove rows that are duplicate up to rounding, you can use something like
df = df.ix[~df.apply(np.round, args=[4]).duplicated()]
Here's one really clumsy way to do what you're asking for with setting nearly-equal values to be actually equal:
grouped = df.groupby([df[i].round(4) for i in df.columns])
subbed = grouped.apply(lambda g: g.apply(lambda row: g.irow(0), axis=1))
subbed.drop_index(level=list(df.columns), drop=True, inplace=True)
This reorders the dataframe, but you can then call .sort() to get them back in the original order if you need that.
Explanation: the first line uses groupby to group the data frame by the rounded values. Unfortunately, if you give a function to groupby it applies it to the labels rather than the rows (so you could maybe do df.groupby(lambda k: np.round(df.ix[k], 4)), but that sucks too).
The second line uses the apply method on groupby to replace the dataframe of near-duplicate rows, g, with a new dataframe g.apply(lambda row: g.irow(0), axis=1). That uses the apply method on dataframes to replace each row with the first row of the group.
The result then looks like
0 1
0 1
0.7551 1.0799 6 0.755064 1.079864
0.8430 0.9166 3 0.843013 0.916605
5 0.843013 0.916605
0.9310 1.2138 4 0.930963 1.213833
0.9412 0.7780 2 0.941207 0.778028
1.0000 1.0000 0 1.000000 1.000000
1.1227 1.1532 1 1.122733 1.153222
where groupby has inserted the rounded values as an index. The reset_index line then drops those columns.
Hopefully someone who knows pandas better than I do will drop by and show how to do this better.
Similar to #Dougal answer, but in a slightly different way
In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])]
Out[20]:
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
6 0.755064 1.079864

Categories

Resources