I've got a dataframe with market data and one column dedicated to daily returns.
I'm having a hard time creating a portfolio to start at $100,000.00 in value, and compute its cumulative return through the life of the data series.
Ideally, I'd like to compute the 'portfolio' column using pandas but I'm having trouble doing so. See below target output. Thank you.
index date index return portfolio
0 19900101 2000 Nan 100000.00
1 19900102 2002 0.001 100100.00
2 19900103 2020 0.00899 100999.90
3 19900104 2001 -0.00941 100049.49
By using cumprod
df['P']=df['return'].add(1).fillna(1).cumprod()*100000
df
Out[843]:
index date index.1 return portfolio P
0 0 19900101 2000 NaN 100000.00 100000.00000
1 1 19900102 2002 0.00100 100100.00 100100.00000
2 2 19900103 2020 0.00899 100999.90 100999.89900
3 3 19900104 2001 -0.00941 100049.49 100049.48995
Some adjustments:
df=df.replace('Nan',np.nan)
df['return']=pd.to_numeric(df['return'])
starting_value = 100000
df = df.assign(portfolio=(1 + df['return'].fillna(0)).cumprod().mul(starting_value))
>>> df
index date index.1 return portfolio
0 0 19900101 2000 NaN 100000.00000
1 1 19900102 2002 0.00100 100100.00000
2 2 19900103 2020 0.00899 100999.89900
3 3 19900104 2001 -0.00941 100049.48995
To visualize what is happening, cumprod is calculating compounded returns, e.g. cum_r3 = (1 + r1) * (1 + r2) * (1 + r3).
>>> (1 + df['return'].fillna(0)).cumprod()
0 1.000000
1 1.001000
2 1.009999
3 1.000495
Name: return, dtype: float64
Related
df: (DataFrame)
Open High Close Volume
2020/1/1 1 2 3 323232
2020/1/2 2 3 4 321321
....
2020/12/31 4 5 6 123213
....
2021
The performance i needed is : (Graph NO.1)
Open High Close Volume Year_Sum_Volume
2020/1/1 1 2 3 323232 (323232 + 321321 +....+ 123213)
2020/1/2 2 3 4 321321 (323232 + 321321 +....+ 123213)
....
2020/12/31 4 5 6 123213 (323232 + 321321 +....+ 123213)
....
2021 (x+x+x.....x)
I want a sum of Volume in different year (the Year_Sum_Volume is the volume of each year)
This is the code i try to calculate the sum of volume in each year but how can i add this data
to daily data , i want to add Year_Sum_Volume to df,like(Graph no.1)
df.resample('Y', on='Date')['Volume'].sum()
thanks you for answering
I believe groupby.sum() and merge should be your friends
import pandas as pd
df = pd.DataFrame({"date":['2021-12-30', '2021-12-31', '2022-01-01'], "a":[1,2.1,3.2]})
df.date = pd.to_datetime(df.date)
df["year"] = df.date.dt.year
df_sums = df.groupby("year").sum().rename(columns={"a":"a_sum"})
df = df.merge(df_sums, right_index=True, left_on="year")
which gives:
date
a
year
a_sum
0
2021-12-30 00:00:00
1
2021
3.1
1
2021-12-31 00:00:00
2.1
2021
3.1
2
2022-01-01 00:00:00
3.2
2022
3.2
Based on your output, Year_Sum_Volume is the same value for every row and can be calculated using df['Volume'].sum().
Then you join a column of a scaled list:
df.join(pd.DataFrame( {'Year_Sum_Volume': [your_sum_val] * len(df['Volume'])} ))
Try below code (after converting date column to pd.to_datetime)
df.assign(Year_Sum_Volume = df.groupby(df['date'].dt.year)['a'].transform('sum'))
I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?
The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000
Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:
I've got a dataframe of the form:
2021 2022 2023
0 3 7 7
1 1 4 4
2 0 1 5
3 4 5 7
Now I'd like to find the accumulated percentages calculated relative to the last column (2023) across each row so that I'll end up with this:
2021 2022 2023
0 42.86 100.00 100.0
1 25.00 100.00 100.0
2 0.00 20.00 100.0
3 57.14 71.43 100.0
I am able to obtain the desired output using:
data = []
colnames= list(df.columns)
for row in df.iterrows():
data.append([elem/row[1][-1]*100 for elem in row][1].values)
df_acc = pd.DataFrame(data)
df_acc.columns = colnames
But this seems horribly inefficient, and I'll have to go through the steps of iterating over all rows, use a list comprehension to find the percentages using [elem/row[1][-1]*100 for elem in row][1].values, and then build a new dataframe.
Does anyone know of a better approach? Perhaps even one that uses inplace=True?
Complete code with data sample:
import pandas as pd
import numpy as np
# data
np.random.seed(1)
start = 2021
ncols = 3
nrows = 4
cols = [str(i) for i in np.arange(start, start+ncols)]
df = pd.DataFrame(np.random.randint(0,5, (nrows,ncols)), columns = cols).cumsum(axis = 1)
data = []
colnames= list(df.columns)
for row in df.iterrows():
data.append([round(elem/row[1][-1]*100, 2) for elem in row][1].values)
# data.append([elem/row[1][-1]*100 for elem in row][1].values)
df_acc = pd.DataFrame(data)
df_acc.columns = colnames
df_acc
You can df.div by last_column, then multiply by 100 and round 2 decimal points:
>>> df.div(df.iloc[:,-1], axis=0).mul(100).round(2)
2021 2022 2023
0 42.86 100.00 100.0
1 25.00 100.00 100.0
2 0.00 20.00 100.0
3 57.14 71.43 100.0
If you want percentage based on max value of each column:
>>> df.div(df.max(1), axis=0).mul(100).round(2)
2021 2022 2023
0 42.86 100.00 100.0
1 25.00 100.00 100.0
2 0.00 20.00 100.0
3 57.14 71.43 100.0
I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)
i have a data set like this :
YEAR MONTH VALUE
2018 3 59.507
2018 3 26.03
2018 5 6.489
2018 2 -3.181
i am trying to perform a calculation like
((VALUE1 + 1) * (VALUE2 + 1) * (VALUE3+1).. * (VALUEn +1)-1) over VALUE column
Whats the best way to accomplish this?
Use:
df['VALUE'].add(1).prod()-1
#-26714.522733572892
If you want cumulative product to create a new column use Series.cumprod:
df['new_column']=df['VALUE'].add(1).cumprod().sub(1)
print(df)
YEAR MONTH VALUE new_column
0 2018 3 59.507 59.507000
1 2018 3 26.030 1634.504210
2 2018 5 6.489 12247.291029
3 2018 2 -3.181 -26714.522734
I think you're after...
cum_prod = (1 + df['VALUE'].cumprod()) - 1
First you should understand the objects you're dealing with, what attributes and methods they have. This is a Dataframe and the Value column is a Series.
here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html