I need to create a dataframe were values reference previous rows - python

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!

Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value

I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Related

Pandas reorder rows of dataframe

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.
What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

Populate column in dataframe based on iat values

lookup={'Tier':[1,2,3,4],'Terr.1':[0.88,0.83,1.04,1.33],'Terr.2':[0.78,0.82,0.91,1.15],'Terr.3':[0.92,0.98,1.09,1.33],'Terr.4':[1.39,1.49,1.66,1.96],'Terr.5':[1.17,1.24,1.39,1.68]}
df={'Tier':[1,1,2,2,3,2,4,4,4,1],'Territory':[1,3,4,5,4,4,2,1,1,2]}
df=pd.DataFrame(df)
lookup=pd.DataFrame(lookup)
lookup contains the lookup values, and df contains the data being fed into iat.
I get the correct values when I print(lookup.iat[tier,terr]). However, when I try to set those values in a new column, it endlessly runs, or in this simple test case just copies 1 value 10 times.
for i in df["Tier"]:
tier=i-1
for j in df["Territory"]:
terr=j
#print(lookup.iat[tier,terr])
df["Rate"]=lookup.iat[tier,terr]
Any thoughts on a possible better solution?
You can use apply() after some modification to your lookup dataframe:
lookup = lookup.rename(columns={i: i.split('.')[-1] for i in lookup.columns}).set_index('Tier')
lookup.columns = lookup.columns.astype(int)
df['Rate'] = df.apply(lambda x: lookup.loc[x['Tier'],x['Territory']], axis=1)
Returns:
Tier Territory Rate
0 1 1 0.88
1 1 3 0.92
2 2 4 1.49
3 2 5 1.24
4 3 4 1.66
5 2 4 1.49
6 4 2 1.15
7 4 1 1.33
8 4 1 1.33
9 1 2 0.78
Once lookup modified a bit the same way than #rahlf23 plus using stack, you can merge both dataframes such as:
df['Rate'] = df.merge( lookup.rename(columns={ i: int(i.split('.')[-1])
for i in lookup.columns if 'Terr' in i})
.set_index('Tier').stack()
.reset_index().rename(columns={'level_1':'Territory'}),
how='left')[0]
If you have a big dataframe df, then it should be faster than using apply and loc
Also, if any couple (Tier, Territory) in df does not exist in lookup, this method won't throw an error

In pandas is there a way to compute a subsection of a expanding window; without calculating the entire array and "tail-ing" the result

I want to compute the expanding window of just the last few elements in a group...
df = pd.DataFrame({'B': [np.nan, np.nan, 1, 1, 2, 2, 1,1], 'A': [1, 2, 1, 2, 1, 2,1,2]})
df.groupby("A")["B"].expanding().quantile(0.5)
this gives:
1 0 NaN
2 1.0
4 1.5
6 1.0
2 1 NaN
3 1.0
5 1.5
7 1.0
I only really want the last two rows for each group though. the result should be:
1 4 1.5
6 1.0
2 5 1.5
7 1.0
I can easily calculate it all and then just get the sections I want. but this is very slow if my dataframe is 1000s of elements long and I dont want to roll across the whole window... just the last two "rolls"
EDIT: I have ammended the title; A lot of people are correctly answering part of the question, but ignoring what is IMO the important part (I should have been more clear)
The issue here is the time it takes. I could just "tail" the answer to get the last two; but then it involves calculating the first two "expanding windows" and then throwing away those results. If my dataframe was instead 1000s of rows long and I just needed the answer for the last few entries, much of this calculation would be wasting time. This is the main problem I have.
As I stated:
"I can easily calculate it all and then just get the sections I want" => through using tail.
Sorry for the confusion.
Also potentially using tail doesnt involve calculating the lot, but it still seems like it does from the timings that I have done... maybe this is not correct, it is an assumption I have made.
EDIT2: The other Option I have tried was using the min_windows in rolling to force it to not calculate the initial sections of the group, but this has many pitfalls such as: -if the array includes NaNs this doesnt work, -if the groupbys are not the same length.
EDIT3:
As a simpler problem and reasoning: Its a limitation of the expanding/or rolling window I think... say we had an array [1,2,3,4,5] the expanding windows are [1], [1,2], [1,2,3], [1,2,3,4], [1,2,3,4,5], and if we run the max over that we get: 1,2,3,4,5 (the max of each array). But if I just want the max of the last two expanding windows. I just need max[1,2,3,4] = 4 and max[1,2,3,4,5]. Intuitively I don't need to calculate max of the first 3 expanding window results to get the last two. But Pandas Implementation might be that it calculates max[1,2,3,4] as max[max[1,2,3],max[4]] = 4 in which case the calculation of the entire window is necessary... this might be the same for the quantile example. There might be an alternate way to do this however without using expanding... not sure... this is what I cant work out.
Maybe try using tail: https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.core.groupby.GroupBy.tail.html
df.groupby('A')['B'].rolling(4, min_periods=1).quantile(0.5).reset_index(level=0).groupby('A').tail(2)
Out[410]:
A B
4 1 1.5
6 1 1.0
5 2 1.5
7 2 1.0
rolling and expanding are similar
How about this (edited 06/12/2018):
def last_two_quantile(row, q):
return pd.Series([row.iloc[:-1].quantile(q), row.quantile(q)])
df.groupby('A')['B'].apply(last_two_quantile, 0.5)
Out[126]:
A
1 0 1.5
1 1.0
2 0 1.5
1 1.0
Name: B, dtype: float64
If this (or something like it) doesn't do what you desire I think you should provide a real example of your use case.
Is this you want?
df[-4:].groupby("A")["B"].expanding().quantile(0.5)
A
1 4 2.0
6 1.5
2 5 2.0
7 1.5
Name: B, dtype: float64
Hope can help you.
Solution1:
newdf = df.groupby("A")["B"].expanding().quantile(0.5).reset_index()
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution2:
newdf2 = df.groupby("A")["B"].expanding().quantile(0.5)
for i in range(newdf2.index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
Solution3:
for i in range(df.groupby("A")["B"].expanding().quantile(0.5).index.get_level_values("A").max()+1):
print(newdf[newdf["A"]==i][-2:],'\n')
output:
Empty DataFrame
Columns: [A, level_1, B]
Index: []
A level_1 B
2 1 4 1.5
3 1 6 1.0
A level_1 B
6 2 5 1.5
7 2 7 1.0
new solution:
newdf = pd.DataFrame(columns={"A", "B"})
for i in range(len(df["A"].unique())):
newdf = newdf.append(pd.DataFrame(df[df["A"]==i+1][:-2].sum()).T)
newdf["A"] = newdf["A"]/2
for i in range(len(df["A"].unique())):
newdf = newdf.append(df[df["A"]==df["A"].unique()[i]][-2:])
#newdf = newdf.reset_index(drop=True)
newdf["A"] = newdf["A"].astype(int)
for i in range(newdf["A"].max()+1):
print(newdf[newdf["A"]==i].groupby("A")["B"].expanding().quantile(0.5)[-2:])
output:
Series([], Name: B, dtype: float64)
A
1 4 1.5
6 1.0
Name: B, dtype: float64
A
2 5 1.5
7 1.0
Name: B, dtype: float64

Calculate a rolling window weighted average on a Pandas column

I'm relatively new to python, and have been trying to calculate some simple rolling weighted averages across rows in a pandas data frame. I have a dataframe of observations df and a dataframe of weights w. I create a new dataframe to hold the inner-product between these two sets of values, dot.
As w is of smaller dimension, I use a for loop to calculate the weighted average by row, of the leading rows equal to the length of w.
More clearly, my set-up is as follows:
import pandas as pd
df = pd.DataFrame([0,1,2,3,4,5,6,7,8], index = range(0,9))
w = pd.DataFrame([0.1,0.25,0.5], index = range(0,3))
dot = pd.DataFrame(0, columns = ['dot'], index = df.index)
for i in range(0,len(df)):
df.loc[i] = sum(df.iloc[max(1,(i-3)):i].values * w.iloc[-min(3,(i-1)):4].values)
I would expect the result to be as follows (i.e. when i = 4)
dot.loc[4] = sum(df.iloc[max(1,(4-3)):4].values * w.iloc[-min(3,(4-1)):4].values)
print dot.loc[4] #2.1
However, when running the for loop above, I receive the error:
ValueError: operands could not be broadcast together with shapes (0,1) (2,1)
Which is where I get confused - I think it must have to do with how I call i into iloc, as I don't receive shape errors when I manually calculate it, as in the example with 4 above. However, looking at other examples and documentation, I don't see why that's the case... Any help is appreciated.
Your first problem is that you are trying to multiply arrays of two different sizes. For example, when i=0 the different parts of your for loop return
df.iloc[max(1,(0-3)):0].values.shape
# (0,1)
w.iloc[-min(3,(0-1)):4].values.shape
# (2,1)
Which is exactly the error you are getting. The easiest way I can think of to make the arrays multipliable is to pad your dataframe with leading zeros, using concatenation.
df2 = pd.concat([pd.Series([0,0]),df], ignore_index=True)
df2
0
0 0
1 0
2 0
3 1
4 2
5 3
6 4
7 5
8 6
9 7
10 8
While you can now use your for loop (with some minor tweaking):
for i in range(len(df)):
dot.loc[i] = sum(df2.iloc[max(0,(i)):i+3].values * w.values)
A nicer way might be the way JohnE suggested, to use the rolling and apply functions built into pandas, there by getting rid of your for loop
import numpy as np
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w))
0
0 NaN
1 NaN
2 0.00
3 0.50
4 1.25
5 2.10
6 2.95
7 3.80
8 4.65
9 5.50
10 6.35
You can also drop the first two padding rows and reset the index
df2.rolling(3,min_periods=3).apply(lambda x: np.dot(x,w)).drop([0,1]).reset_index(drop=True)
0
0 0.00
1 0.50
2 1.25
3 2.10
4 2.95
5 3.80
6 4.65
7 5.50
8 6.35

Python Pandas: Get row by median value

I'm trying to get the row of the median value for a column.
I'm using data.median() to get the median value for 'column'.
id 30444.5
someProperty 3.0
numberOfItems 0.0
column 70.0
And data.median()['column'] is subsequently:
data.median()['performance']
>>> 70.0
How can get the row or index of the median value?
Is there anything similar to idxmax / idxmin?
I tried filtering but it's not reliable in cases multiple rows have the same value.
Thanks!
You can use rank and idxmin and apply it to each column:
import numpy as np
import pandas as pd
def get_median_index(d):
ranks = d.rank(pct=True)
close_to_median = abs(ranks - 0.5)
return close_to_median.idxmin()
df = pd.DataFrame(np.random.randn(13, 4))
df
0 1 2 3
0 0.919681 -0.934712 1.636177 -1.241359
1 -1.198866 1.168437 1.044017 -2.487849
2 1.159440 -1.764668 -0.470982 1.173863
3 -0.055529 0.406662 0.272882 -0.318382
4 -0.632588 0.451147 -0.181522 -0.145296
5 1.180336 -0.768991 0.708926 -1.023846
6 -0.059708 0.605231 1.102273 1.201167
7 0.017064 -0.091870 0.256800 -0.219130
8 -0.333725 -0.170327 -1.725664 -0.295963
9 0.802023 0.163209 1.853383 -0.122511
10 0.650980 -0.386218 -0.170424 1.569529
11 0.678288 -0.006816 0.388679 -0.117963
12 1.640222 1.608097 1.779814 1.028625
df.apply(get_median_index, 0)
0 7
1 7
2 3
3 4
May be just : data[data.performance==data.median()['performance']].

Categories

Resources