i have a data set like this :
YEAR MONTH VALUE
2018 3 59.507
2018 3 26.03
2018 5 6.489
2018 2 -3.181
i am trying to perform a calculation like
((VALUE1 + 1) * (VALUE2 + 1) * (VALUE3+1).. * (VALUEn +1)-1) over VALUE column
Whats the best way to accomplish this?
Use:
df['VALUE'].add(1).prod()-1
#-26714.522733572892
If you want cumulative product to create a new column use Series.cumprod:
df['new_column']=df['VALUE'].add(1).cumprod().sub(1)
print(df)
YEAR MONTH VALUE new_column
0 2018 3 59.507 59.507000
1 2018 3 26.030 1634.504210
2 2018 5 6.489 12247.291029
3 2018 2 -3.181 -26714.522734
I think you're after...
cum_prod = (1 + df['VALUE'].cumprod()) - 1
First you should understand the objects you're dealing with, what attributes and methods they have. This is a Dataframe and the Value column is a Series.
here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
Related
I'm trying to finish my workproject but I'm getting stuck at a certain point.
Part of the dataframe I have is this:
year_month
year
month
2007-01
2007
1
2009-07
2009
7
2010-03
2010
3
However, I want to add the column "season". I'm illustrating soccer seasons and the season column needs to illustrate what season the players plays. So if month is equal or smaller than 3, the "season" column needs to correspond with ((year-1), "/", year) and if larger with (year, "/", (year + 1)).
The table should look like this:
year_month
year
month
season
2007-01
2007
1
2006/2007
2009-07
2009
7
2009/2010
2010-03
2010
3
2009/2010
Hopefully someone else can help me with this problem.
Here is the code to create the first Table:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'year_month':["2007-01", "2009-07", "2010-03"],
'year':[2007, 2009, 2010],
'month':[1, 7, 3]})
# convert the 'Date' columns to datetime format
df['year_month']= pd.to_datetime(df['year_month'])
Thanks in advance!
You can use np.where() to specify the condition and get corresponding strings according to True / False of the condition, as follows:
df['season'] = np.where(df['month'] <= 3,
(df['year'] - 1).astype(str) + '/' + df['year'].astype(str),
df['year'].astype(str) + '/' + (df['year'] + 1).astype(str))
Result:
year_month year month season
0 2007-01-01 2007 1 2006/2007
1 2009-07-01 2009 7 2009/2010
2 2010-03-01 2010 3 2009/2010
You can use a lambda function with conditionals and axis=1 to apply it to each row. Using f-Strings reduces the code needed to transform values from the year column into strings as needed for your new season column.
df['season'] = df.apply(lambda x: f"{x['year']-1}/{x['year']}" if x['month'] <= 3 else f"{x['year']}/{x['year']+1}", axis=1)
Output:
year_month year month season
0 2007-01 2007 1 2006/2007
1 2009-07 2009 7 2009/2010
2 2010-03 2010 3 2009/2010
I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.
I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()
I am working on a dataframe in python (mostly pandas and numpy). Example is given below.
Name ID Guess Date Topic Delta
0 a 23 5 2019 1 (Person A's Guess in 2019 - Guess in 2018)
1 a 23 8 2018 1
2 c 7 7 2019 1 (Person C's Guess in 2019 - Guess in 2018)
3 c 7 4 2018 1
4 e 12 9 2018 1
5 a 23 3 2020 2
I want to fill the empty column Delta, which is just the difference between the last guess on the same topic and the now updated guess. I am having troubles since I need to keep both the topic and the person's ID.
The dataset is fairly large (> 1mio. entries) which is why my approach to iterate over it did cause troubles when using the full dataframe.
I sorted the dataframe in they above way in order to try and solve it with .shift(); however I guess there must be a solution without sorting the df since I have enough identifiers (ID, Date, Topic).
for i in df.index:
if df['ID'].iloc[i] == test['ID'].iloc[i+1]:
df['Delta'].iloc[i+1] = df['Guess'].iloc[i+1] - test['Guess'].iloc[i]
else:
df['Delta'].iloc[i+1] = "NaN"
If anyone know a more efficient (maybe vectorized) solution for that problem, I would greatly appreciate the hints and help
I have five datasets that I have added a 'Year' column to, like this:
newyork2014['Year'] = 2014
newyork2015['Year'] = 2015
newyork2016['Year'] = 2016
newyork2017['Year'] = 2017
newyork2018['Year'] = 2018
However, I'm wondering if there's a more Pythonic way of doing this, perhaps with a function? I don't want to change the actual dataframe into a string though, but I want to "stringify" the name of the dataframe. Here's what I was thinking:
def get_year(df):
df['Year'] = last four digits of name of df
return df
You may need to adjust a little bit when you create the dataframe , need assign a name
newyork2014.name='newyork2014'
def get_year(df):
df['Year'] = df.name[-4:]
return df
get_year(newyork2014)
Out[42]:
ID Col1 Col2 New Year
2018-06-01 A 10 100 0.5 2014
2018-06-02 B 5 25 2.1 2014
2018-06-03 A 25 25 0.6 2014
I've got a dataframe with market data and one column dedicated to daily returns.
I'm having a hard time creating a portfolio to start at $100,000.00 in value, and compute its cumulative return through the life of the data series.
Ideally, I'd like to compute the 'portfolio' column using pandas but I'm having trouble doing so. See below target output. Thank you.
index date index return portfolio
0 19900101 2000 Nan 100000.00
1 19900102 2002 0.001 100100.00
2 19900103 2020 0.00899 100999.90
3 19900104 2001 -0.00941 100049.49
By using cumprod
df['P']=df['return'].add(1).fillna(1).cumprod()*100000
df
Out[843]:
index date index.1 return portfolio P
0 0 19900101 2000 NaN 100000.00 100000.00000
1 1 19900102 2002 0.00100 100100.00 100100.00000
2 2 19900103 2020 0.00899 100999.90 100999.89900
3 3 19900104 2001 -0.00941 100049.49 100049.48995
Some adjustments:
df=df.replace('Nan',np.nan)
df['return']=pd.to_numeric(df['return'])
starting_value = 100000
df = df.assign(portfolio=(1 + df['return'].fillna(0)).cumprod().mul(starting_value))
>>> df
index date index.1 return portfolio
0 0 19900101 2000 NaN 100000.00000
1 1 19900102 2002 0.00100 100100.00000
2 2 19900103 2020 0.00899 100999.89900
3 3 19900104 2001 -0.00941 100049.48995
To visualize what is happening, cumprod is calculating compounded returns, e.g. cum_r3 = (1 + r1) * (1 + r2) * (1 + r3).
>>> (1 + df['return'].fillna(0)).cumprod()
0 1.000000
1 1.001000
2 1.009999
3 1.000495
Name: return, dtype: float64