Pandas dataframe: using the output of a function in row x as input for the same function in row x+1 - python

I have a dataframe that looks as follows
SIM Sim_1 Sim_2
2015 100.0000 100.0000
2016 2.504613 0.123291
2017 3.802958 -0.919886
2018 4.513224 -1.976056
2019 -0.775783 3.914312
The following function
df = sims.shift(1, axis = 0)*(1+sims/100)
returns a dataframe which looks like this
SIMULATION Sim_1 Sim_2
2015 NaN NaN
2016 102.504613 100.123291
2017 2.599862 0.122157
2018 3.974594 -0.901709
The value in 2016 is exactly the one that should be calculated. But the value in 2017 should take the output of the formula in 2016 (102.504613 and 100.123291) as input for the calculation in 2017. Here the formula takes the original values (2.599862 and 0.122157)
Is there a simple way to run this in pyhton?

you are trying to show the growth of 100 given subsequent returns. Your problem is that the initial 100 is not in the same space. If you replace it with zero (0% return) then do a cumprod, your problem is solved.
sims.iloc[0] = 0
sims.div(100).add(1).cumprod().mul(100)

Just a crude way of implementing this:
for i in range(len(df2)):
try:
df2['Sim1'][i] = float(df2['Sim1'][i]) + float(df2['Sim1'][i-1])
df2['Sim2'][i] = float(df2['Sim2'][i]) + float(df2['Sim2'][i-1])
except:
pass
There may be a better way to optimize this.

Related

Calculate standard deviation for intervals in dataframe column

I would like to calculate standard deviations for non rolling intervals.
I have a df like this:
value std year
3 nan 2001
2 nan 2001
4 nan 2001
19 nan 2002
23 nan 2002
34 nan 2002
and so on. I would just like to calculate the standard deviation for every year and save it in every cell in the respective row in "std". I have the same amount of data for every year, thus the length of the intervals never changes.
I already tried:
df["std"] = df.groupby("year").std()
but since the right gives a new dataframe that calculates the std for every column gouped by year this obviously does not work.
Thank you all very much for your support!
IIUC:
try via transform() method:
df['std']=df.groupby("year")['value'].transform('std')
OR
If you want to find the standard deviation of multiple columns then:
df[['std1','std2']]=df.groupby("year")[['column1','column2']].transform('std')

Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill

I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.
I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()

Python/Pandas Dataframe - Calculate the difference of two far apart rows in a dataframe based on identifier

I am working on a dataframe in python (mostly pandas and numpy). Example is given below.
Name ID Guess Date Topic Delta
0 a 23 5 2019 1 (Person A's Guess in 2019 - Guess in 2018)
1 a 23 8 2018 1
2 c 7 7 2019 1 (Person C's Guess in 2019 - Guess in 2018)
3 c 7 4 2018 1
4 e 12 9 2018 1
5 a 23 3 2020 2
I want to fill the empty column Delta, which is just the difference between the last guess on the same topic and the now updated guess. I am having troubles since I need to keep both the topic and the person's ID.
The dataset is fairly large (> 1mio. entries) which is why my approach to iterate over it did cause troubles when using the full dataframe.
I sorted the dataframe in they above way in order to try and solve it with .shift(); however I guess there must be a solution without sorting the df since I have enough identifiers (ID, Date, Topic).
for i in df.index:
if df['ID'].iloc[i] == test['ID'].iloc[i+1]:
df['Delta'].iloc[i+1] = df['Guess'].iloc[i+1] - test['Guess'].iloc[i]
else:
df['Delta'].iloc[i+1] = "NaN"
If anyone know a more efficient (maybe vectorized) solution for that problem, I would greatly appreciate the hints and help

Python function to iterate over a column and calculate the forumla

i have a data set like this :
YEAR MONTH VALUE
2018 3 59.507
2018 3 26.03
2018 5 6.489
2018 2 -3.181
i am trying to perform a calculation like
((VALUE1 + 1) * (VALUE2 + 1) * (VALUE3+1).. * (VALUEn +1)-1) over VALUE column
Whats the best way to accomplish this?
Use:
df['VALUE'].add(1).prod()-1
#-26714.522733572892
If you want cumulative product to create a new column use Series.cumprod:
df['new_column']=df['VALUE'].add(1).cumprod().sub(1)
print(df)
YEAR MONTH VALUE new_column
0 2018 3 59.507 59.507000
1 2018 3 26.030 1634.504210
2 2018 5 6.489 12247.291029
3 2018 2 -3.181 -26714.522734
I think you're after...
cum_prod = (1 + df['VALUE'].cumprod()) - 1
First you should understand the objects you're dealing with, what attributes and methods they have. This is a Dataframe and the Value column is a Series.
here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

Create a new column with partial name from dataframe

I have five datasets that I have added a 'Year' column to, like this:
newyork2014['Year'] = 2014
newyork2015['Year'] = 2015
newyork2016['Year'] = 2016
newyork2017['Year'] = 2017
newyork2018['Year'] = 2018
However, I'm wondering if there's a more Pythonic way of doing this, perhaps with a function? I don't want to change the actual dataframe into a string though, but I want to "stringify" the name of the dataframe. Here's what I was thinking:
def get_year(df):
df['Year'] = last four digits of name of df
return df
You may need to adjust a little bit when you create the dataframe , need assign a name
newyork2014.name='newyork2014'
def get_year(df):
df['Year'] = df.name[-4:]
return df
get_year(newyork2014)
Out[42]:
ID Col1 Col2 New Year
2018-06-01 A 10 100 0.5 2014
2018-06-02 B 5 25 2.1 2014
2018-06-03 A 25 25 0.6 2014

Categories

Resources