I have the following dataframe:
import numpy as np
import pandas as pd
dates = pd.date_range(start='2014-01-01',end='2018-01-01', freq='Y')
df = pd.DataFrame(5*np.eye(4,), index=dates, columns=['Var1', 'Var2', 'Var3', 'Var4'])
print(df)
Var1 Var2 Var3 Var4
2014-12-31 5.0 0.0 0.0 0.0
2015-12-31 0.0 5.0 0.0 0.0
2016-12-31 0.0 0.0 5.0 0.0
2017-12-31 0.0 0.0 0.0 5.0
I would like to compute the NPV value of each variable for the years 2014 and 2015 for 3 years.
Right now I know how to obtain the present value for one variable and one row at the time:
Var1_2014 = df.loc['2014':'2016','Var1'].tolist()
NPV_Var1_2014 = np.npv(0.7,[0]+Var1_2014)
However I do not know how to vectorize the function to compute directly the entire column. I would like to obtain something like that:
Var1 Var2 Var3 Var4 Var1_NPV
2014-12-31 5.0 0.0 0.0 0.0 a
2015-12-31 0.0 5.0 0.0 0.0 b
2016-12-31 0.0 0.0 5.0 0.0 Nan
2017-12-31 0.0 0.0 0.0 5.0 Nan
where I could say something like df['Var1_NPV']= npv('Var1',duration=3years,discount_rate=0.7)
Any idea on how I could vectorize that function efficiently?
Many thanks,
I find a solution with apply and offset:
def give_npv(date,df,var,wacc):
date2 = date + pd.DateOffset(years=2)
data = df.loc[date:date2,var].tolist()
NPV_var = np.npv(wacc,[0]+data)
return NPV_var
df['index2'] = df.index
df['test'] = df.apply(lambda x: give_npv(x['index2'],df,'Var2',0.07) ,axis=1 )
print(df)
Var1 Var2 Var3 Var4 index2 test
2014-12-31 5.0 0.0 0.0 0.0 2014-12-31 4.367194
2015-12-31 0.0 5.0 0.0 0.0 2015-12-31 4.672897
2016-12-31 0.0 0.0 5.0 0.0 2016-12-31 0.000000
2017-12-31 0.0 0.0 0.0 5.0 2017-12-31 0.000000
Related
I have a dataframe which I pivoted and I now want to select spefici rows from the data. I have seen similar questions such as the one here: Selecting columns in a pandas pivot table based on specific row value?. In my case I want to return all the columns but I want to select only specific rows.
timestamp,value
2008-03-01 00:00:00,55.0
2008-03-01 00:15:00,20.0
2008-03-01 00:30:00,13.0
2008-03-01 00:45:00,78.0
2008-03-01 01:00:00,34.0
2008-03-01 01:15:00,123.0
2008-03-01 01:30:00,25.0
2008-03-01 01:45:00,91.0
2008-03-02 00:00:00,55.0
2008-03-02 00:15:00,46.0
2008-03-02 00:30:00,66.0
2008-03-02 00:45:00,24.0
2008-03-02 01:00:00,70.0
2008-03-02 01:15:00,32.0
2008-03-02 01:30:00,15.0
2008-03-02 01:45:00,92.0
I have done the below to generate the below output
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.read_csv('df.csv')
df.timestamp = pd.to_datetime(df.timestamp)
df = df.set_index('timestamp')
df['date'] = df.index.map(lambda t: t.date())
df['time'] = df.index.map(lambda t: t.time())
df_pivot = pd.pivot_table(df, values='value', index='timestamp', columns='time')
df_pivot = df_pivot.fillna(0.0)
print(df_pivot)
Generated output
time 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00
timestamp
2008-03-01 00:00:00 55.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2008-03-01 00:15:00 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0
2008-03-01 00:30:00 0.0 0.0 13.0 0.0 0.0 0.0 0.0 0.0
2008-03-01 00:45:00 0.0 0.0 0.0 78.0 0.0 0.0 0.0 0.0
2008-03-01 01:00:00 0.0 0.0 0.0 0.0 34.0 0.0 0.0 0.0
2008-03-01 01:15:00 0.0 0.0 0.0 0.0 0.0 123.0 0.0 0.0
2008-03-01 01:30:00 0.0 0.0 0.0 0.0 0.0 0.0 25.0 0.0
2008-03-01 01:45:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 91.0
2008-03-02 00:00:00 55.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2008-03-02 00:15:00 0.0 46.0 0.0 0.0 0.0 0.0 0.0 0.0
2008-03-02 00:30:00 0.0 0.0 66.0 0.0 0.0 0.0 0.0 0.0
2008-03-02 00:45:00 0.0 0.0 0.0 24.0 0.0 0.0 0.0 0.0
2008-03-02 01:00:00 0.0 0.0 0.0 0.0 70.0 0.0 0.0 0.0
2008-03-02 01:15:00 0.0 0.0 0.0 0.0 0.0 32.0 0.0 0.0
2008-03-02 01:30:00 0.0 0.0 0.0 0.0 0.0 0.0 15.0 0.0
2008-03-02 01:45:00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 92.0
I want to select e.g., only the data for 2008-03-01 00:00:00, 2008-03-01 01:15:00, and 2008-03-02 01:00:00.
Expected output
time 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00
timestamp
2008-03-01 00:00:00 55.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2008-03-01 01:15:00 0.0 0.0 0.0 0.0 0.0 123.0 0.0 0.0
2008-03-02 01:00:00 0.0 0.0 0.0 0.0 70.0 0.0 0.0 0.0
How can I do that
Use list of datetimes converted by to_datetime and select by DataFrame.loc:
#create DatetimeIndex
df = pd.read_csv('df.csv', index_col='timestamp', parse_dates=['timestamp'])
#used pandas methods
df['date'] = df.index.date
df['time'] = df.index.time
#added fill_value parameter
df_pivot = pd.pivot_table(df,values='value',index='timestamp',columns='time',fill_value=0)
L = ['2008-03-01 00:00:00','2008-03-01 01:15:00','2008-03-02 01:00:00']
df = df_pivot.loc[pd.to_datetime(L)]
print (df)
time 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 \
2008-03-01 00:00:00 55 0 0 0 0
2008-03-01 01:15:00 0 0 0 0 0
2008-03-02 01:00:00 0 0 0 0 70
time 01:15:00 01:30:00 01:45:00
2008-03-01 00:00:00 0 0 0
2008-03-01 01:15:00 123 0 0
2008-03-02 01:00:00 0 0 0
Suppose we have a list of dataframes A which contains three dataframes df_1, df_2, and df_3:
A = [df_a, df_b, df_c]
df_a =
morning noon night
date
2019-12-31 B 3.0 3.0 0.0
C 0.0 0.0 1.0
D 0.0 1.0 0.0
E 142.0 142.0 142.0
df_b =
morning noon night
date
2020-01-31 A 3.0 0.0 0.0
B 1.0 0.0 0.0
E 142.0 145.0 145.0
df_c =
morning noon night
date
2020-02-29 F 145.0 145.0 145.0
All dataframes have morning, noon, night columns and have same index which is date and [A,B,C,D,E,F] column and I want to concatenate those three dataframes into one dataframe (let's say full_df) which every date have equal rows/indexes.
But as you see each dataframe have different number of rows, df_1,df_2, and df_3 have [B,C,D,E], [A,B,E] and [F] respectively.
Is there some way we can concat those dataframes but this time, each dataframe have index of all unique index from those three combined ? It returns 0.0 if the corresponding index is not available on the original dataframe.
This is what I was thinking about full_df:
full_df =
morning noon night
date
2019-12-31 A 0.0 0.0 0.0
B 3.0 3.0 0.0
C 0.0 0.0 1.0
D 0.0 1.0 0.0
E 142.0 142.0 142.0
F 0.0 0.0 0.0
2020-01-31 A 3.0 0.0 0.0
B 1.0 0.0 0.0
C 0.0 0.0 0.0
D 0.0 0.0 0.0
E 142.0 145.0 145.0
F 0.0 0.0 0.0
2020-02-29 A 0.0 0.0 0.0
B 0.0 0.0 0.0
C 0.0 0.0 0.0
D 0.0 0.0 0.0
E 0.0 0.0 0.0
F 145.0 145.0 145.0
You can try:
pd.concat(A).unstack(level=-1, fill_value=0).stack()
I have two date columns Start Date and End Date and I want to find the Year and number of days in each month between those two dates. I can find a year but no idea how to find the number of days in each month. Not sure if it is feasible to get this o/p.
from pandas import DataFrame
import re
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
}
df = DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
Expected O/P:
I come up with solution using pd.date_range and resample. You need convert both columns StartDate and EndDate to datetime dtype
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
def days_of_month(x):
s = pd.date_range(*x, freq='D').to_series()
return s.resample('M').count().rename(lambda x: x.month)
df1 = df[['StartDate', 'EndDate']].apply(days_of_month, axis=1).fillna(0)
Out[1036]:
1 2 3 4 5 6 7 8 9 10 11 12
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 21.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 31.0
2 31.0 28.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 22.0 30.0 31.0 31.0 30.0 31.0 30.0 0.0
4 0.0 0.0 22.0 30.0 31.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0
Finally join back to original dataframe
df_final = df[['StartDate', 'EndDate']].join([df['StartDate'].dt.year.rename('Year'), df1])
Out[1042]:
StartDate EndDate Year 1 2 3 4 5 6 7 8 \
0 2019-12-10 2019-12-30 2019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2019-12-01 2019-12-31 2019 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2019-01-01 2019-03-30 2019 31.0 28.0 30.0 0.0 0.0 0.0 0.0 0.0
3 2019-05-10 2019-11-30 2019 0.0 0.0 0.0 0.0 22.0 30.0 31.0 31.0
4 2019-03-10 2019-06-10 2019 0.0 0.0 22.0 30.0 31.0 10.0 0.0 0.0
9 10 11 12
0 0.0 0.0 0.0 21.0
1 0.0 0.0 0.0 31.0
2 0.0 0.0 0.0 0.0
3 30.0 31.0 30.0 0.0
4 0.0 0.0 0.0 0.0
Solution
You can employ a combination of vectorization with pandas and numpy to achieve this as follows. The custom function is provided below, for ease of use. Since, it uses vectorization, it is supposed to be fairly fast.
Note: The assumption used here, based on the sample data:
The date range is only for one year
both start and end dates fall in the same year.
If you have data from different years, you would need to apply this to each year's data there. Also, if start and end dates fall on different years, you will have to adapt this method for that. Since the problem presented here, does not state that requirement, I leave this implementation as a guide for anyone interested in its application to a multiyear-spanned dataset.
If you would like to try out this solution in jupyter notebook environment, you can access it here on github. It has a Google Colaboratory link as well. So, you could also, directly open it in Google Colab Notebook.
# Updated DataFrame
df = process_dataframe(df) # custom function
display(df.head())
Dummy Data and Custom Function
Tested with pandas==0.25.3 and numpy==0.17.4 in Google Colab Environment.
import numpy as np
import pandas as pd
#from pandas.tseries.offsets import MonthEnd
from IPython.display import display
# Dummy Data
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2019-12-30' ,'2019-12-31','2019-03-30','2019-11-30','2019-06-10']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
# Function for Processing the DataFrame
def process_dataframe(df):
"""Returns the updated dataframe. """
df.StartDate = pd.to_datetime(df.StartDate)
df.EndDate = pd.to_datetime(df.EndDate)
month_ends = pd.date_range(start='2019-01', freq='M', periods=12)
month_headers = month_ends.month_name().str.upper().str[:3].tolist()
month_days = month_ends.day.to_numpy()
month_nums = (np.arange(12) + 1)
# Evaluate expressions to avoid multiple times evaluation
start_date_month_num = df.StartDate.dt.month.to_numpy().reshape(-1,1)
end_date_month_num = df.EndDate.dt.month.to_numpy().reshape(-1,1)
#start_month_days = pd.to_datetime(df.StartDate, format="%Y%m") + MonthEnd(1) - df.StartDate
# start_month_days.dt.days.to_numpy()
# Number of days not in the end_month
end_month_days_excluded = month_days[df.EndDate.dt.month.to_numpy() - 1] - df.EndDate.dt.day.to_numpy()
# Determine the months that fall within the start and end dates (inclusive
# of start and end months) and then calculate the number of days in each
# month.
# add all days for relevant months
result = ((start_date_month_num <= month_nums) & \
(end_date_month_num >= month_nums)).astype(int) \
* month_days.reshape(1,-1)
# subtract number of days not in starting month
result = result + \
(-1) * (start_date_month_num == month_nums).astype(int) \
* (df.StartDate.dt.day.to_numpy() - 1).reshape(-1,1)
# subtract number of days not in ending month
result = result + \
(-1) * (end_date_month_num == month_nums).astype(int) \
* end_month_days_excluded.reshape(-1,1)
return pd.merge(df, pd.DataFrame(result, columns = month_headers), left_index=True, right_index=True)
Original DataFrame:
# Original DataFrame
display(df.head())
Output:
Is it possible to add missing date columns from created date_range to grouped dataframe df without for loop and fill zeros as missing values?
date_range has 7 date elements. df has 4 date columns. So how to add 3 missing columns to df?
import pandas as pd
from datetime import datetime
start = datetime(2018,6,4, )
end = datetime(2018,6,10,)
date_range = pd.date_range(start=start, end=end, freq='D')
DatetimeIndex(['2018-06-04', '2018-06-05', '2018-06-06', '2018-06-07',
'2018-06-08', '2018-06-09', '2018-06-10'],
dtype='datetime64[ns]', freq='D')
df = pd.DataFrame({
'date':
['2018-06-07', '2018-06-10', '2018-06-09','2018-06-09',
'2018-06-08','2018-06-09','2018-06-08','2018-06-10',
'2018-06-10','2018-06-10',],
'name':
['sogan', 'lyam','alex','alex',
'kovar','kovar','kovar','yamo','yamo','yamo',]
})
df['date'] = pd.to_datetime(df['date'])
df = (df
.groupby(['name', 'date',])['date',]
.count()
.unstack(fill_value=0)
)
df
date date date date
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0 0 2 0
kovar 0 2 1 0
lyam 0 0 0 1
sogan 1 0 0 0
yamo 0 0 0 3
I would pivot the table for making the date columns as rows then use the .asfreq function of pandas as below:
DataFrame.asfreq(freq, method=None, how=None, normalize=False, fill_value=None)
source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.asfreq.html
Thanks Sina Shabani for clue to making date columns as rows. And in this situation more suitable setting date as index and using .reindex appeared
df = (df.groupby(['date', 'name'])['name']
.size()
.reset_index(name='count')
.pivot(index='date', columns='name', values='count')
.fillna(0))
df
name alex kovar lyam sogan yamo
date
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.index = pd.DatetimeIndex(df.index)
df = (df.reindex(pd.date_range(start, freq='D', periods=7), fill_value=0)
.sort_index())
df
name alex kovar lyam sogan yamo
2018-06-04 0.0 0.0 0.0 0.0 0.0
2018-06-05 0.0 0.0 0.0 0.0 0.0
2018-06-06 0.0 0.0 0.0 0.0 0.0
2018-06-07 0.0 0.0 0.0 1.0 0.0
2018-06-08 0.0 2.0 0.0 0.0 0.0
2018-06-09 2.0 1.0 0.0 0.0 0.0
2018-06-10 0.0 0.0 1.0 0.0 3.0
df.T
date 2018-06-07 00:00:00 2018-06-08 00:00:00 2018-06-09 00:00:00 2018-06-10 00:00:00
name
alex 0.0 0.0 2.0 0.0
kovar 0.0 2.0 1.0 0.0
lyam 0.0 0.0 0.0 1.0
sogan 1.0 0.0 0.0 0.0
yamo 0.0 0.0 0.0 3.0
I have the data like this:
OwnerUserId Score
CreationDate
2015-01-01 00:16:46.963 1491895.0 0.0
2015-01-01 00:23:35.983 1491895.0 1.0
2015-01-01 00:30:55.683 1491895.0 1.0
2015-01-01 01:10:43.830 2141635.0 0.0
2015-01-01 01:11:08.927 1491895.0 1.0
2015-01-01 01:12:34.273 3297613.0 1.0
..........
This is a whole year data with different user's score ,I hope to get the data like:
OwnerUserId 1491895.0 1491895.0 1491895.0 2141635.0 1491895.0
00:00 0.0 3.0 0.0 3.0 5.8
00:01 5.0 3.0 0.0 3.0 5.8
00:02 3.0 33.0 20.0 3.0 5.8
......
23:40 12.0 33.0 10.0 3.0 5.8
23:41 32.0 33.0 20.0 3.0 5.8
23:42 12.0 13.0 10.0 3.0 5.8
The element of dataframe is the score(mean or sum).
I have been try like follow:
pd.pivot_table(data_series.reset_index(),index=['CreationDate'],columns=['OwnerUserId'],
fill_value=0).resample('W').sum()['Score']
Get the result like the image.
I think you need:
#remove `[]` and add parameter values for remove MultiIndex in columns
df = pd.pivot_table(data_series.reset_index(),
index='CreationDate',
columns='OwnerUserId',
values='Score',
fill_value=0)
#truncate seconds and convert to timedeltaindex
df.index = pd.to_timedelta(df.index.floor('T').strftime('%H:%M:%S'))
#or round to minutes
#df.index = pd.to_timedelta(df.index.round('T').strftime('%H:%M:%S'))
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:16:00 0 0 0
00:23:00 1 0 0
00:30:00 1 0 0
01:10:00 0 0 0
01:11:00 1 0 0
01:12:00 0 0 1
idx = pd.timedelta_range('00:00:00', '23:59:00', freq='T')
#resample by minutes, aggregate sum, for add missing rows use reindex
df = df.resample('T').sum().fillna(0).reindex(idx, fill_value=0)
print (df)
OwnerUserId 1491895.0 2141635.0 3297613.0
00:00:00 0.0 0.0 0.0
00:01:00 0.0 0.0 0.0
00:02:00 0.0 0.0 0.0
00:03:00 0.0 0.0 0.0
00:04:00 0.0 0.0 0.0
00:05:00 0.0 0.0 0.0
00:06:00 0.0 0.0 0.0
...
...