dataframe math in pandas - python

TOTALLY RE WROTE ORIGINAL QUESTION
I read raw data from a csv file "CloseWeight4.csv"
df=pd.read_csv('CloseWeights4.csv')
Date Symbol ClosingPrice Weight
3/1/2010 OGDC 116.51 0.1820219
3/2/2010 OGDC 117.32 0.1820219
3/3/2010 OGDC 116.4 0.1820219
3/4/2010 OGDC 116.58 0.1820219
3/5/2010 OGDC 117.61 0.1820219
3/1/2010 WTI 78.7 0.5348142
3/2/2010 WTI 79.68 0.5348142
3/3/2010 WTI 80.87 0.5348142
3/4/2010 WTI 80.21 0.5348142
3/5/2010 WTI 81.5 0.5348142
3/1/2010 FX 85.07 0.1312427
3/2/2010 FX 85.1077 0.1312427
3/3/2010 FX 85.049 0.1312427
3/4/2010 FX 84.9339 0.1312427
3/5/2010 FX 84.8 0.1312427
3/1/2010 PIB 98.1596499 0.1519211
3/2/2010 PIB 98.1596499 0.1519211
3/3/2010 PIB 98.1764222 0.1519211
3/4/2010 PIB 98.1770656 0.1519211
3/5/2010 PIB 98.1609364 0.1519211
From Which I generate a dataframe df2
df2=df.iloc[:,0:3].pivot('Date', 'Symbol', 'ClosingPrice')
df2
Out[10]:
Symbol FX OGDC PIB WTI
Date
2010-03-01 85.0700 116.51 98.159650 78.70
2010-03-02 85.1077 117.32 98.159650 79.68
2010-03-03 85.0490 116.40 98.176422 80.87
2010-03-04 84.9339 116.58 98.177066 80.21
2010-03-05 84.8000 117.61 98.160936 81.50
from this I calculate returns using:
ret=np.log(df2/df2.shift(1))
In [12] ret
Out[12]:
Symbol FX OGDC PIB WTI
Date
2010-03-01 NaN NaN NaN NaN
2010-03-02 0.000443 0.006928 0.000000 0.012375
2010-03-03 -0.000690 -0.007873 0.000171 0.014824
2010-03-04 -0.001354 0.001545 0.000007 -0.008195
2010-03-05 -0.001578 0.008796 -0.000164 0.015955
I have weights of each security from df
df3=df.iloc[:,[1,3]].drop_duplicates().reset_index(drop=True)
df3
Out[14]:
Weight
Symbol
OGDC 0.182022
WTI 0.534814
FX 0.131243
PIB 0.151921
I am trying to get the following weighted return results for each day but don't know how to do the math in pandas:
Date Portfolio_weighted_returns
2010-03-02 0.008174751
2010-03-03 0.006061657
2010-03-04 -0.005002414
2010-03-05 0.009058151
where the Portfolio_weighted_returns of 2010-03-02 is calculated as follows:
0.006928*0.182022+.012375*0.534814+0.000443*0.131243+0*0.151921 = 0.007937512315
I then need to have these results multiplied by a decay factor where the decay factor is defineD as decFac =decay^(t). Using a decay = 0.5 gives decFac values of:
Date decFac
2010-03-02 0.0625
2010-03-03 0.125
2010-03-04 0.25
2010-03-05 0.5
I then need to take the SQRT of the sum of the squared Portfolio_weighted_returns for each day multiplied by the respective decFac as such:
SQRT(Sum(0.008174751^2*.0625+0.006061657^2*.125+(-0.005002414^2)*.25+.009058151^2*.5)) = 0.007487

IIUC you can do it this way:
In [267]: port_ret = ret.dot(df3)
In [268]: port_ret
Out[268]:
Weight
Date
2010-03-01 NaN
2010-03-02 0.007938
2010-03-03 0.006431
2010-03-04 -0.004278
2010-03-05 0.009902
In [269]: decay = 0.5
In [270]: decay_df = pd.DataFrame({'decFac':decay**np.arange(len(ret), 0, -1)}, index=ret.index)
In [271]: decay_df
Out[271]:
decFac
Date
2010-03-01 0.03125
2010-03-02 0.06250
2010-03-03 0.12500
2010-03-04 0.25000
2010-03-05 0.50000
In [272]: (port_ret.Weight**2 * decay_df.decFac).sum() ** 0.5
Out[272]: 0.007918790111274962
port_ret.Weight**2 * decay_df.decFac
In [277]: port_ret.Weight**2 * decay_df.decFac
Out[277]:
Date
2010-03-01 NaN
2010-03-02 0.000004
2010-03-03 0.000005
2010-03-04 0.000005
2010-03-05 0.000049
dtype: float64

import numpy as np
import pandas as pd
define the variables
data = np.mat(''' 85.0700 116.51 98.159650 78.70;
85.1077 117.32 98.159650 79.68;
85.0490 116.40 98.176422 80.87;
84.9339 116.58 98.177066 80.21;
84.8000 117.61 98.160936 81.50''')
cols = ['FX', 'OGDC' , 'PIB' , 'WTI']
dts = pd.Series( data=pd.date_range('2010-03-01', '2010-03-05'), name='Date' )
df2 = pd.DataFrame( data=data, columns=cols, index=dts )
# this is your df3 variable
wgt = pd.DataFrame( data=[0.131243, 0.182022, 0.151921, 0.534814], index=pd.Series(cols, name='Symbol') , columns=['Weight'] )
To calculate daily returns I use the .shift operator
# Calculate the daily returns for each security
df_ret = np.log( df2 / df2.shift(1) )
# FX OGDC PIB WTI
# Date
# 2010-03-01 NaN NaN NaN NaN
# 2010-03-02 0.000443 0.006928 0.000000 0.012375
# 2010-03-03 -0.000690 -0.007873 0.000171 0.014824
# 2010-03-04 -0.001354 0.001545 0.000007 -0.008195
# 2010-03-05 -0.001578 0.008796 -0.000164 0.015955
You need to multiply the Weight column of wgt with ret to get the desired result. wgt['Weight'] will return a pd.Series which is more like a 1-D array than a 2D array which a pd.DataFrame can be generally thought of.
df_wgt_ret = wgt['Weight'] * df_ret
# FX OGDC PIB WTI
# Date
# 2010-03-01 NaN NaN NaN NaN
# 2010-03-02 0.000081 0.003705 0.000000e+00 0.001880
# 2010-03-03 -0.000126 -0.004210 2.242285e-05 0.002252
# 2010-03-04 -0.000247 0.000826 8.609014e-07 -0.001245
# 2010-03-05 -0.000287 0.004704 -2.156434e-05 0.002424
Sum over the columns (axis=1) to get the portfolio returns. Note this returns a pd.Series not a dataframe
port_ret = df_wgt_ret.sum(axis=1)
# Date
# 2010-03-01 NaN
# 2010-03-02 0.005666
# 2010-03-03 -0.002061
# 2010-03-04 -0.000664
# 2010-03-05 0.006820
Finally, multiply the decay rate with the portfolio, note that because the operation happens over the columns you need to
total_ret = (port_ret * sr_dec).sum()
final_res = total_ret**0.5
The One liner
I'm assuming decFac is a dataframe with column name decFac and using df3 and ret as you've defined them.
result = (( (df3.Weight * ret).sum(axis=1)**2 * decFac.decFac ).sum())**.5

Related

Subtracting value from column gives NaN only

I have multiple column csv file and I want to subtract values of column X31-X27,Y31-Y27,Z31-Z27 from the same dataframe but when I am subtracting it gives me NaN values.
Here is the values of csv file:
It gives me the result as shown in picture
Help me to figure out this problem
import pandas as pd
import os
import numpy as np
df27 = pd.read_csv('D:27.txt', names=['No27','X27','Y27','Z27','Date27','Time27'], sep='\s+')
df28 = pd.read_csv('D:28.txt', names=['No28','X28','Y28','Z28','Date28','Time28'], sep='\s+')
df29 = pd.read_csv('D:29.txt', names=['No29','X29','Y29','Z29','Date29','Time29'], sep='\s+')
df30 = pd.read_csv('D:30.txt', names=['No30','X30','Y30','Z30','Date30','Time30'], sep='\s+')
df31 = pd.read_csv('D:31.txt', names=['No31','X31','Y31','Z31','Date31','Time31'], sep='\s+')
total=pd.concat([df27,df28,df29,df30,df31], axis=1)
total.to_csv('merge27-31.csv', index = False)
print(total)
df2731 = pd.read_csv('C:\\Users\\finalmerge27-31.csv')
df2731.reset_index(inplace=True)
print(df2731)
df227 = df2731[['X31', 'Y31', 'Z31']] - df2731[['X27', 'Y27', 'Z27']]
print(df227)
# input data
df = pd.DataFrame({'x27':[-1458.88, 181.78, 1911.84, 3739.3, 5358.19], 'y27':[-5885.8, -5878.1,-5786.5,-5735.7, -5545.6],
'z27':[1102,4139,4616,4108,1123], 'x31':[-1458, 181, 1911, np.nan, 5358], 'y31':[-5885, -5878, -5786, np.nan, -5554],
'z31':[1102,4138,4616,np.nan,1123]})
df
x27 y27 z27 x31 y31 z31
0 -1458.88 -5885.8 1102 -1458.0 -5885.0 1102.0
1 181.78 -5878.1 4139 181.0 -5878.0 4138.0
2 1911.84 -5786.5 4616 1911.0 -5786.0 4616.0
3 3739.30 -5735.7 4108 NaN NaN NaN
4 5358.19 -5545.6 1123 5358.0 -5554.0 1123.0
pd.DataFrame(df1.values - df2.values).rename(columns={0:'x32-x27', 1:'y31-y27', 2:'z31-x31'})
Out:
x32-x27 y31-y27 z31-x31
0 -0.88 -0.8 0.0
1 0.78 -0.1 1.0
2 0.84 -0.5 0.0
3 NaN NaN NaN
4 0.19 8.4 0.0

merge pandas dataframe causing duplicate row values

I have two pandas dataframe that I want to merge. My first dataframe, names, is a list of stock tickers and corresponding dates. Example below:
Date Symbol DateSym
0 2017-01-05 AGRX AGRX01-05-2017
1 2017-01-05 TMDI TMDI01-05-2017
2 2017-01-06 ATHE ATHE01-06-2017
3 2017-01-06 AVTX AVTX01-06-2017
4 2017-01-09 CVM CVM01-09-2017
5 2017-01-10 DFFN DFFN01-10-2017
6 2017-01-10 VKTX VKTX01-10-2017
7 2017-01-11 BIOC BIOC01-11-2017
8 2017-01-11 CVM CVM01-11-2017
9 2017-01-11 PALI PALI01-11-2017
I created another dataframe, price1, that loops through the tickers and creates a dataframe with the open, high, low, close and other relevant info I need. When I merge the two dataframes together, I want to only show the names dataframe on the left with the corresponding price data on the right. What I ran a test of the first 10 tickers, I noticed that the combined dataframe is outputting redundant rows. (See CVM in row 4 and 5 below), even though the price1 dataframe doesn't have duplicate values. What am I doing wrong?
def price_stats(df):
# df['ticker'] = df
df['Gap Up%'] = df['Open'] / df['Close'].shift(1) - 1
df['HOD%'] = df['High'] / df['Open'] - 1
df['Close vs Open%'] = df['Close'] / df['Open'] - 1
df['Close%'] = df['Close'] / df['Close'].shift(1) - 1
df['GU and Goes Red'] = np.where((df['Low'] < df['Close'].shift(1)) & (df['Open'] > df['Close'].shift(1)), 1, 0)
df['Yday Intraday>30%'] = np.where((df['Close vs Open%'].shift(1) > .30), 1, 0)
df['Gap Up?'] = np.where((df['Gap Up%'] > 0), 1, 0)
df['Sloppy $ Vol'] = (df['High'] + df['Low'] + df['Close']) / 3 * df['Volume']
df['Prev Day Sloppy $ Vol'] = (df['High'].shift(1) + df['Low'].shift(1) + df['Close'].shift(1)) / 3 * df[
'Volume'].shift(1)
df['Prev Open'] = df['Open'].shift(1)
df['Prev High'] = df['High'].shift(1)
df['Prev Low'] = df['Low'].shift(1)
df['Prev Close'] = df['Close'].shift(1)
df['Prev Vol'] = df['Volume'].shift(1)
df['D-2 Close'] = df['Close'].shift(2)
df['D-2 Vol'] = df['Volume'].shift(2)
df['D-3 Close']= df['Close'].shift(3)
df['D-2 Open'] = df['Open'].shift(2)
df['D-2 High'] = df['High'].shift(2)
df['D-2 Low'] = df['Low'].shift(2)
df['D-2 Intraday Rnage'] = df['D-2 Close']/df['D-2 Open']-1
df['D-2 Close%'] = df['D-2 Close']/df['D-3 Close']-1
df.dropna(inplace=True)
vol_excel = pd.read_excel('C://U******.xlsx')
names = vol_excel.Symbol.to_list()
price1 = []
price1 = pd.DataFrame(price1)
for name in names[0:10]:
print(name)
price = yf.download(name, start="2016-12-01", end="2022-03-04")
price['ticker'] = name
price_stats(price)
price1 = pd.concat([price1, price])
price1 = price1.reset_index()
orig_day = pd.to_datetime(price1['Date'])
price1['Prev Day Date'] = orig_day - pd.tseries.offsets.CustomBusinessDay(1, holidays=nyse.holidays().holidays)
price1['DateSym'] = price1['ticker']+ price1['Date'].dt.strftime('%m-%d-%Y')
price1 = price1.rename(columns={'ticker':'Symbol'})
datesym = price1['DateSym']
price1.drop(labels=['DateSym'], axis=1,inplace = True)
price1.insert(0, 'DateSym', datesym)
vol_excel['DateSym'] = vol_excel['Symbol']+vol_excel['Date'].dt.strftime('%m-%d-%Y')
dfcombo = vol_excel.merge(price1,on=['Date','Symbol'],how='inner')
See how CVM is duplicated twice when i print out dfcombo
Date Symbol DateSym_x DateSym_y Open High Low Close Adj Close Volume ... Prev Vol D-2 Close D-2 Vol D-3 Close D-2 Open D-2 High D-2 Low D-2 Intraday Rnage D-2 Close% Prev Day Date
0 2017-01-05 AGRX AGRX01-05-2017 AGRX01-05-2017 2.71 2.71 2.40 2.52 2.52 2408400 ... 18584900.0 5.000 2390400.0 5.700000 5.770 5.813000 4.460 -0.133449 -0.122807 2017-01-04
1 2017-01-05 TMDI TMDI01-05-2017 TMDI01-05-2017 15.60 16.50 12.90 13.50 13.50 43830 ... 114327.0 10.500 61543.0 7.200000 7.500 10.500000 7.500 0.400000 0.458333 2017-01-04
2 2017-01-06 ATHE ATHE01-06-2017 ATHE01-06-2017 2.58 2.60 2.23 2.42 2.42 222500 ... 1750700.0 1.930 53900.0 1.750000 1.790 1.950000 1.790 0.078212 0.102857 2017-01-05
3 2017-01-06 AVTX AVTX01-06-2017 AVTX01-06-2017 1.24 1.24 1.02 1.07 1.07 480500 ... 1246100.0 0.883 44900.0 0.890000 0.896 0.950000 0.827 -0.014509 -0.007865 2017-01-05
4 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
5 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
6 2017-01-10 DFFN DFFN01-10-2017 DFFN01-10-2017 111.00 232.50 108.75 125.25 125.25 165407 ... 43167.0 30.900 67.0 34.650002 31.500 34.349998 30.900 -0.019048 -0.108225 2017-01-09
7 2017-01-10 VKTX VKTX01-10-2017 VKTX01-10-2017 1.64 1.64 1.43 1.56 1.56 981700 ... 1550400.0 1.260 264900.0 1.230000 1.250 1.299000 1.210 0.008000 0.024390 2017-01-09
8 2017-01-11 BIOC BIOC01-11-2017 BIOC01-11-2017 858.00 1017.00 630.00 813.00 813.00 210182 ... 78392.0 306.000 5368.0 285.000000 285.000 315.000000 285.000 0.073684 0.073684 2017-01-10
9 2017-01-11 CVM CVM01-11-2017 CVM01-11-2017 4.25 4.50 3.00 3.75 3.75 487584 ... 672692.0 2.750 376520.0 2.750000 2.750 3.000000 2.750 0.000000 0.000000 2017-01-10
I'm wondering since the names dataframe may have the same tickers, but different dates, in the dataframe and each time the price1 dataframe is pulling the price data and adding to that price1 dataframe maybe causing the issue.
For example, in the names dataframe, AGRX can be listed for the date 2017-01-05 and 2020-12-20. My loop function as shown pulls from yahoo data and appends it to the price1 dataframe even though its the same set of data. Along the same token, is there a way for me to skip appending that duplicate ticker and will that solve the issue?

how can I normalize my dataframe , in way that my line plots start from a same point?

I have a dataframe like the following(named net_asset), from 2015 to today
a b c d e f g h i j k l m n o p q r
Date
2015-04-30 162.20100 38.69620 98.88842 11.75094 8.92177 1.07767 112.81237 110.08090 NaN 4.20428 221.5440 NaN 1.63142 155.30297 8.19891 13.94684 7.40493 27.85345
2015-05-29 164.04053 39.19910 101.54701 11.97325 8.94295 1.12211 114.48715 113.24696 NaN 4.30719 215.7512 NaN 1.65257 154.85456 8.33938 14.29280 7.47724 27.32846
2015-06-30 163.17050 39.00262 101.77694 11.93908 8.96241 1.13880 114.23190 112.75483 10.0000 4.22515 207.5485 NaN 1.67049 158.25418 8.57353 14.13962 7.61546 26.99618
2015-07-31 160.73069 38.49814 102.63752 11.95354 8.93894 1.14438 111.00177 110.01403 10.1106 4.19375 205.0794 NaN 1.65833 161.83255 8.67075 14.25327 7.67866 27.31167
to be more easier to compare the data after plotting, I want all the columns start at the same point,here at 100.(at 2015 should be all 100)
I'd tried the code bellow, but couldn't get what I imagined,which was 100 at 2015.
net_asset.apply(lambda x: (x - x.min()) / (x.max() - x.min()))
the above code returns. net_asset.head()
Date
2015-04-30 29.481157 20.728226 12.566996 14.006493 24.887183 85.363231 11.168351 20.119944 NaN 26.292755 38.674209 NaN 19.586481 9.290352 5.570366 9.204228 4.566915 100.000000
2015-05-29 31.475018 22.683843 15.138121 16.334712 25.302741 95.113764 12.794772 25.172351 NaN 31.434296 34.177011 NaN 21.440216 9.022051 7.029734 11.419483 5.223939 95.558550
2015-06-30 30.531995 21.919795 15.360487 15.976855 25.684553 98.775698 12.546892 24.387008 26.207877 27.335452 27.808905 NaN 23.010851 11.056174 9.462360 10.438639 6.479836 92.747440
2015-07-31 27.887493 19.958033 16.192755 16.128292 25.224064 100.000000 9.410033 20.013232 27.427053 25.766660 25.892037 NaN 21.945063 13.197250 10.472396 11.166364 7.054085 95.416506
net_asset.tail()
2020-11-30 67.200005 72.608636 76.959357 85.856731 88.155809 57.219650 94.367147 84.263184 84.411962 49.771676 78.669830 91.698367 91.659509 95.793550 97.312319 100.000000 98.638703 12.572080
2020-12-31 79.321960 80.759312 87.806721 94.821595 96.394572 69.535073 99.215011 97.320232 87.610922 62.294533 89.893726 100.000000 100.000000 100.000000 100.000000 99.515149 100.000000 20.818697
2021-01-29 82.292270 80.581521 87.481611 92.795622 97.256100 70.575071 99.335197 93.571979 89.231346 58.588387 91.402937 92.293295 96.259225 96.302455 93.245683 95.127478 94.362002 20.405762
2021-02-26 91.587476 90.773715 91.445362 94.800335 98.102520 81.569651 95.674504 91.847156 97.434880 70.743028 97.713593 85.960528 89.612951 93.915749 88.721404 87.146839 88.763620 21.716141
2021-03-31 100.000000 100.000000 100.000000 100.000000 100.000000 91.807271 100.000000 97.903339 100.000000 81.996363 100.000000 94.200479 87.929251 89.484993 86.827664 86.035818 87.447754 19.689448
what is the way to do this?
thank you
some columns start with Nan but got value later
in excel I do it by dividing each row to the first and multiply by hundred. =(A2/$A$2)*100
if you want to apply normalization each column, you have to use axis=0
Z-Score Normalization
"The formula for calculating a z-score is is z = (x-μ)/σ, where x is the raw score, μ is the population mean, and σ is the population standard deviation. As the formula shows, the z-score is simply the raw score minus the population mean, divided by the population standard deviation."
#get mean each column
mean = df.mean(axis=0)
#get standard deviation
std = df.std(axis=0)
#normalization
normalization = ((df - mean) / std)
or in one line
normalization = (df - df.mean()) / df.std()
Min-max normalization
normalization = (df-df.min()) / (df.max()-df.min())
if you want to fix your values to 100, just multiply with 100
normalization = ( (df-df.min()) / (df.max()-df.min()) * 100 )

Tiling in groupby on dataframe

I have a data frame that contains returns, size and sedols for a couple of dates.
My goal is to identify the top and bottom values for a certain condition per date, i.e I want the top decile largest size entries and the bottom decile smallest size entries for each date and flag them in a new column by 'xx' and 'yy'.
I am confused how to apply the tiling while grouping as well as creating a new column, here is what I already have.
import pandas as pd
import numpy as np
import datetime as dt
from random import choice
from string import ascii_uppercase
def create_dummy_data(start_date, days, entries_pday):
date_sequence_lst = [dt.datetime.strptime(start_date,'%Y-%m-%d') +
dt.timedelta(days=x) for x in range(0,days)]
date_sequence_lst = date_sequence_lst * entries_pday
returns_lst = [round(np.random.uniform(low=-0.10,high=0.20),2) for _ in range(entries_pday*days)]
size_lst = [round(np.random.uniform(low=10.00,high=10000.00),0) for _ in range(entries_pday*days)]
rdm_sedol_lst = [(''.join(choice(ascii_uppercase) for i in range(7))) for x in range(entries_pday)]
rdm_sedol_lst = rdm_sedol_lst * days
dates_returns_df = pd.DataFrame({'Date':date_sequence_lst , 'Sedols':rdm_sedol_lst, 'Returns':returns_lst,'Size':size_lst})
dates_returns_df = dates_returns_df.sort_values('Date',ascending=True)
dates_returns_df = dates_returns_df.reset_index(drop=True)
return dates_returns_df
def order_df_by(df_in,column_name):
df_out = df_in.sort_values(['Date',column_name],ascending=[True,False])
return df_out
def get_ntile(df_in,ntile):
df_in['Tiled'] = df_in.groupby(['Date'])['Size'].transform(lambda x : pd.qcut(x,ntile))
return df_in
if __name__ == "__main__":
# create dummy returns
data_df = create_dummy_data('2001-01-01',31,10)
# sort by attribute
data_sorted_df = order_df_by(data_df,'Size')
#ntile data per date
data_ntiled = get_ntile(data_sorted_df, 10)
for key, item in data_ntiled:
print(data_ntiled.get_group(key))
so far I would be expecting deciled results based on 'Size' for each date, the next step would be to filter only for decile 1 and decile 10 and flag the entries 'xx' and 'yy' respectively.
thanks
Consider using transform on the pandas.qcut method with labels 1 through ntile+1 for a decile column, then conditionally set flag with np.where using decile values:
...
def get_ntile(df_in, ntile):
df_in['Tiled'] = df_in.groupby(['Date'])['Size'].transform(lambda x: pd.qcut(x, ntile, labels=list(range(1, ntile+1))))
return df_in
if __name__ == "__main__":
# create dummy returns
data_df = create_dummy_data('2001-01-01',31,10)
# sort by attribute
data_sorted_df = order_df_by(data_df,'Size')
#ntile data per date
data_ntiled = get_ntile(data_sorted_df, 10)
data_ntiled['flag'] = np.where(data_ntiled['Tiled']==1.0, 'YY',
np.where(data_ntiled['Tiled']==10.0, 'XX', np.nan))
print(data_ntiled.reset_index(drop=True).head(15))
# Date Returns Sedols Size Tiled flag
# 0 2001-01-01 -0.03 TEEADVJ 8942.0 10.0 XX
# 1 2001-01-01 -0.03 PDBWGBJ 7142.0 9.0 nan
# 2 2001-01-01 0.03 QNVVPIC 6995.0 8.0 nan
# 3 2001-01-01 0.04 NTKEAKB 6871.0 7.0 nan
# 4 2001-01-01 0.20 ZVVCLSJ 6541.0 6.0 nan
# 5 2001-01-01 0.12 IJKXLIF 5131.0 5.0 nan
# 6 2001-01-01 0.14 HVPDRIU 4490.0 4.0 nan
# 7 2001-01-01 -0.08 XNOGFET 3397.0 3.0 nan
# 8 2001-01-01 -0.06 JOARYWC 2582.0 2.0 nan
# 9 2001-01-01 0.12 FVKBQGU 723.0 1.0 YY
# 10 2001-01-02 0.03 ZVVCLSJ 9291.0 10.0 XX
# 11 2001-01-02 0.14 HVPDRIU 8875.0 9.0 nan
# 12 2001-01-02 0.08 PDBWGBJ 7496.0 8.0 nan
# 13 2001-01-02 0.02 FVKBQGU 7307.0 7.0 nan
# 14 2001-01-02 -0.01 QNVVPIC 7159.0 6.0 nan

Align years of daily data

Starting from a multi-annual record of temperature measured at different time in the day, I would like to end up with a rectangular array of daily averages, each row representing one year of data.
The data looks like this
temperature.head()
date
1996-01-01 00:00:00 7.39
1996-01-01 03:00:00 6.60
1996-01-01 06:00:00 7.39
1996-01-01 09:00:00 9.50
1996-01-01 12:00:00 11.00
Name: temperature, dtype: float64
I computed daily averages with
import pandas as pd
daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()
Which yields
daily.head()
date
1996-01-01 9.89625
1996-01-02 10.73625
1996-01-03 6.98500
1996-01-04 5.62250
1996-01-05 8.84625
Freq: D, Name: temperature, dtype: float64
Now for the final part I thought of something like
yearly_daily_mean = daily.groupby(pd.TimeGrouper(freq='12M', closed="left"))
but there are some issues here.
I need to drop the tail of the data not filling a complete year.
What happens if there is missing data?
How to deal with the leap years?
What is the next step? Namely, how to “stack” (in numpy's, not pandas' sense) the years of data?
I am using
array_temperature = np.column_stack([group[1] for group in yearly_daily_mean if len(group[1]) == 365])
but there should be a better way.
As a subsidiary question, how can I choose the starting day of the years of data?
If I understand you correctly, you want to reshape your timeseries of daily means (which you already calculated) to a rectangular dataframe with the different days as columns and the different years as rows.
This can be achieved easily with the pandas reshaping functions, eg with pivot:
Some dummy data:
In [45]: index = pd.date_range(start=date(1996, 1,1), end=date(2010, 6, 30), freq='D')
In [46]: daily = pd.DataFrame(index=index, data=np.random.random(size=len(index)), columns=['temperature'])
First, I add columns with the year and day of the year:
In [47]: daily['year'] = daily.index.year
In [48]: daily['day'] = daily.index.dayofyear
In [49]: daily.head()
Out[49]:
temperature year day
1996-01-01 0.081774 1996 1
1996-01-02 0.694968 1996 2
1996-01-03 0.478050 1996 3
1996-01-04 0.123844 1996 4
1996-01-05 0.426150 1996 5
Now, we can reshape this dataframe:
In [50]: daily.pivot(index='year', columns='day', values='temperature')
Out[50]:
day 1 2 ... 365 366
year ...
1996 0.081774 0.694968 ... 0.679461 0.700833
1997 0.043134 0.981707 ... 0.009357 NaN
1998 0.257077 0.297290 ... 0.701941 NaN
... ... ... ... ... ...
2008 0.047145 0.750354 ... 0.996396 0.761159
2009 0.348667 0.827057 ... 0.881424 NaN
2010 0.269743 0.872655 ... NaN NaN
[15 rows x 366 columns]
Here is how I would do it. Very simply: create a new df with the exact shape you want, then fill it with the means of the things you want.
from datetime import datetime
import numpy as np
import pandas as pd
# This is my re-creation of the data you have. (I'm calling it df1.)
# It's essential that your date-time be in datetime.datetime format, not strings
byear = 1996 # arbitrary
eyear = 2005 # arbitrary
obs_n = 50000 # arbitrary
start_time = datetime.timestamp(datetime(byear,1,1,0,0,0,0))
end_time = datetime.timestamp(datetime(eyear,12,31,23,59,59,999999))
obs_times = np.linspace(start_time,end_time,num=obs_n)
index1 = pd.Index([datetime.fromtimestamp(i) for i in obs_times])
df1 = pd.DataFrame(data=np.random.rand(obs_n)*20,index=index1,columns=['temp'])
# ^some random data
# Here is the new empty dataframe (df2) where you will put your daily averages.
index2 = pd.Index(range(byear,eyear+1))
columns2 = range(1,367) # change to 366 if you want to assume 365-day years
df2 = pd.DataFrame(index=index2,columns=columns2)
# Some quick manipulations that allow the two dfs' indexes to talk to one another.
df1['year'] = df1.index.year # a new column with the observation's year as an integer
df1['day'] = df1.index.dayofyear # a new column with the day of the year as integer
df1 = df1.reset_index().set_index(['year','day'])
# Now get the averages for each day and assign them to df2.
for year in index2:
for day in columns2[:365]: # for all but the last entry in the range
df2.loc[year,day] = df1.loc[(year,day),'temp'].mean()
if (year,366) in df1.index: # then if it's a leap year...
df2.loc[year,366] = df1.loc[(year,366),'temp'].mean()
If you don't want the final df to have any null values on that 366th day, then you can just remove the final if-statement, and rewrite columns2 = range(1,366), and then df2 will have all non-null values (assuming there was at least one measurement on every day in the observed time period).
Assuming you already have daily averages (with pd.DateTimeIndex) from your higher-frequency data as a result of:
daily = temperature.groupby(pd.TimeGrouper(freq='D')).mean()
IIUC, you want to transform the daily average into a DataFrame with an equal number of columns per row to capture annual data. You mention leap years as a potential issue when aiming for an equal number of columns.
I can imagine two ways of going about this:
Select a number of days per row - probably 365. Select rolling blocks of 365 consecutive daily data points for each row and align by index for each of these blocks.
Select years of data, filling in the gaps for leap years, and align by either MM-DD or number of day in year.
Starting with 20 1/2 years of daily random data as mock daily average temperatures:
index = pd.date_range(start=date(1995, 1,1), end=date(2015, 6, 30), freq='D')
df = pd.DataFrame(index=index, data=np.random.random(size=len(index)) * 30, columns=['temperature'])
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 7486 entries, 1995-01-01 to 2015-06-30
Freq: D
Data columns (total 1 columns):
temperature 7486 non-null float64
dtypes: float64(1)
memory usage: 117.0 KB
None
df.head()
temperature
1995-01-01 4.119212
1995-01-02 27.107131
1995-01-03 26.704931
1995-01-04 7.430203
1995-01-05 4.230398
df.tail()
temperature
2015-06-26 10.902779
2015-06-27 8.494378
2015-06-28 17.800131
2015-06-29 19.543815
2015-06-30 16.390435
Here's a solution to the first approach:
Select blocks of 365 consecutive days using .groupby(pd.TimeGrouper('365D')), and return each resulting groupby object of daily averages as a pd.DataFrame with an integer index that runs from 0 to 364 for each sequence:
aligned = df.groupby(pd.TimeGrouper(freq='365D')).apply(lambda x: pd.DataFrame(x.squeeze().tolist())) # .squeeze() converts single columns `DataFrame` to pd.Series
To align the 21 blocks of data, just transpose the pd.DataFrame , and they will align by integer index in the columns, with the start date of each sequence in theindex. This operation will produce an extraindex, and the lastrow` will have some missing data. Clean up both with:
aligned.dropna().reset_index(-1, drop=True)
to get a [20 x 365] DataFrame as follows:
DatetimeIndex: 20 entries, 1995-01-01 to 2013-12-27
Freq: 365D
Columns: 365 entries, 0 to 364
dtypes: float64(365)
memory usage: 57.2 KB
0 1 2 3 4 5 \
1995-01-01 29.456090 25.313968 4.146206 5.347690 25.767425 11.978152
1996-01-01 25.585481 26.846486 8.336905 16.749842 6.247542 17.723733
1996-12-31 23.410462 10.168599 5.601917 11.996500 8.650726 23.362815
1997-12-31 7.586873 23.882106 22.145595 3.287160 21.642547 1.949321
1998-12-31 14.691420 3.611475 28.287327 25.347787 13.291708 20.571616
1999-12-31 25.713866 17.588570 18.562117 19.420944 12.406293 11.870750
2000-12-30 5.099561 17.894763 21.168223 4.786461 24.521417 21.443607
2001-12-30 11.791223 8.352493 12.731769 0.459697 20.680396 27.554783
2002-12-30 3.785876 0.359850 20.828764 15.376991 14.086626 0.477615
2003-12-30 23.633243 12.726250 8.197824 16.355956 8.094145 1.410746
2004-12-29 1.139949 4.161267 9.043062 14.109888 13.538735 1.566002
2005-12-29 25.504224 19.346419 3.300641 26.933084 23.634321 18.323450
2006-12-29 10.535785 9.168498 27.222106 11.962343 10.004678 23.893257
2007-12-29 27.482856 6.910670 6.033291 12.673530 26.362971 4.492178
2008-12-28 11.152316 25.233664 22.124299 11.012285 1.992814 25.542204
2009-12-28 23.131021 16.363467 1.242393 10.387653 4.858851 26.553950
2010-12-28 13.134843 9.195658 19.075850 28.539387 3.075934 8.089347
2011-12-28 28.860275 10.121573 0.663906 19.687892 29.376377 11.488446
2012-12-27 7.644073 19.649330 25.497595 6.592940 8.879444 17.733670
2013-12-27 11.713996 2.602284 3.835302 22.244623 27.279810 14.144943
6 7 8 9 ... 355 \
1995-01-01 8.210005 8.129146 28.798472 25.646924 ... 24.177163
1996-01-01 0.481487 16.772357 3.934185 22.640157 ... 23.340931
1996-12-31 10.813812 16.276504 3.422665 14.916229 ... 13.817015
1997-12-31 19.184753 28.628326 22.134871 12.721064 ... 23.905483
1998-12-31 2.839492 7.889141 17.951959 25.233585 ... 28.002751
1999-12-31 6.958672 26.335427 23.361470 5.911806 ... 7.778412
2000-12-30 8.405042 25.229016 19.746462 15.332004 ... 5.703830
2001-12-30 0.558788 15.457327 20.987186 25.452723 ... 29.771372
2002-12-30 19.002685 26.455754 25.468178 25.383786 ... 14.238987
2003-12-30 22.984328 15.934398 25.361599 12.221306 ... 1.189949
2004-12-29 22.121901 21.421103 26.175702 16.040881 ... 19.945408
2005-12-29 2.557901 15.193412 27.049389 4.825570 ... 7.629859
2006-12-29 8.582602 26.037375 0.933591 13.469771 ... 29.453932
2007-12-29 29.437921 26.470153 9.917871 16.875801 ... 5.702116
2008-12-28 3.809633 10.583385 18.029571 0.440077 ... 11.337894
2009-12-28 24.406696 28.294553 19.929563 4.683991 ... 25.697446
2010-12-28 29.765551 16.716723 6.467946 10.998447 ... 26.988863
2011-12-28 28.962746 11.407137 9.957111 4.502521 ... 14.606937
2012-12-27 1.374502 5.571244 11.212960 9.949830 ... 23.345868
2013-12-27 26.373866 4.781510 16.828510 10.280078 ... 0.552726
356 357 358 359 360 361 \
1995-01-01 13.511951 10.126835 28.121730 23.275360 11.785242 27.907039
1996-01-01 13.362737 14.336780 24.114908 28.479688 8.509069 17.408937
1996-12-31 19.192674 1.146844 27.499688 7.090407 2.777819 22.826814
1997-12-31 21.502186 10.495148 21.786895 12.229181 8.068271 6.522108
1998-12-31 21.338355 11.978265 9.186161 21.053924 3.033370 29.934703
1999-12-31 5.960120 20.325684 0.915052 15.059979 12.194240 20.138567
2000-12-30 11.883186 2.764768 27.324304 29.630706 21.852058 20.416199
2001-12-30 7.802891 25.384479 9.044486 8.809446 7.606603 6.051890
2002-12-30 7.362494 8.940783 5.259984 7.035818 24.094134 7.197113
2003-12-30 25.596902 9.756372 6.345198 1.520188 22.752717 3.470268
2004-12-29 26.789064 9.708466 18.287838 21.134643 29.862135 19.926086
2005-12-29 26.398394 24.717514 16.606042 28.189245 24.574806 14.297410
2006-12-29 8.795342 18.019536 16.579878 20.368811 22.052442 26.393676
2007-12-29 8.696240 25.901889 16.410934 15.274897 14.365867 10.523388
2008-12-28 18.581513 25.974784 21.025297 10.521118 5.864974 2.373023
2009-12-28 14.437944 21.717456 4.017870 14.024522 0.959989 17.215403
2010-12-28 11.426540 13.751451 4.664761 15.373878 7.731613 7.269089
2011-12-28 1.952897 9.406866 28.957258 20.239517 11.156958 29.238761
2012-12-27 7.588643 21.186675 17.348911 1.354323 13.918083 3.034123
2013-12-27 22.916065 2.089675 22.832061 14.787841 25.697875 14.087893
362 363 364
1995-01-01 13.107523 10.740551 20.511825
1996-01-01 25.016219 17.885332 2.438875
1996-12-31 24.692327 0.221760 6.749919
1997-12-31 24.856169 0.930019 22.603652
1998-12-31 18.361414 13.587695 25.161495
1999-12-31 0.512120 26.482288 1.035197
2000-12-30 15.401012 28.334219 5.965014
2001-12-30 10.292213 10.951915 8.270319
2002-12-30 21.945734 27.076438 6.795688
2003-12-30 14.788929 19.456459 11.216835
2004-12-29 7.086443 25.463503 17.549196
2005-12-29 12.252487 29.081547 25.507369
2006-12-29 0.012617 0.086186 17.421958
2007-12-29 4.191633 21.588891 7.516187
2008-12-28 26.194288 20.500256 24.876032
2009-12-28 28.445254 27.338754 7.849899
2010-12-28 28.888573 26.801262 23.117027
2011-12-28 19.871547 20.324514 18.369134
2012-12-27 15.907752 9.417700 4.922940
2013-12-27 21.132385 20.707216 5.288128
[20 rows x 365 columns]
If you want to simply gather the years of data and align by date, so that the non-leap years have a missing day around no 60 (as opposed to 366), you can:
df.groupby(pd.TimeGrouper(freq='A')).apply(lambda x: pd.DataFrame(x.squeeze().tolist()).T).reset_index(-1, drop=True).iloc
0 1 2 3 4 5 \
1995-12-31 1.245796 28.487530 0.574299 10.033485 19.221512 8.718728
1996-12-31 12.258653 3.864652 25.237088 13.982809 24.494746 13.822292
1997-12-31 22.239412 4.796824 21.389404 11.151171 25.577368 1.754948
1998-12-31 24.968287 2.089894 25.888487 28.291714 19.115844 24.426285
1999-12-31 9.285363 19.339405 26.012193 3.243394 25.176499 8.766770
2000-12-31 26.996573 26.404391 1.793644 21.314488 13.118279 26.703532
2001-12-31 16.303829 14.021771 20.828238 11.427195 3.099290 18.730795
2002-12-31 14.614617 10.694258 5.226033 24.900849 17.395822 22.154202
2003-12-31 10.564132 8.267639 7.778573 26.704936 5.671499 0.470963
2004-12-31 22.649623 15.725867 18.445629 7.529507 11.868134 10.965534
2005-12-31 2.406615 9.709624 23.284616 11.479254 23.814725 1.656826
2006-12-31 19.164459 23.177769 16.091672 28.936777 28.636072 4.838555
2007-12-31 12.371377 3.417582 21.067689 25.493921 25.410295 15.526614
2008-12-31 29.080385 4.653984 16.567333 24.248921 27.338538 9.353291
2009-12-31 29.608734 6.046593 22.738628 22.631714 26.061903 21.217846
2010-12-31 27.458254 15.146497 18.917073 8.473955 26.782767 10.891648
2011-12-31 25.433759 8.959650 14.343507 16.249726 17.031174 12.944418
2012-12-31 22.940797 4.791280 11.765939 25.925645 3.649440 27.483407
2013-12-31 11.684391 27.701678 27.423083 27.656086 9.374896 14.250936
2014-12-31 23.660098 27.768960 25.753294 3.014606 23.330226 17.570492
6 7 8 9 ... 356 \
1995-12-31 17.079137 26.100763 12.376462 12.315219 ... 16.910185
1996-12-31 26.718277 10.349412 12.940624 9.453769 ... 19.235435
1997-12-31 20.201528 22.895552 1.443243 20.584140 ... 29.665815
1998-12-31 21.493163 16.724328 5.946833 15.230762 ... 2.617883
1999-12-31 9.776013 13.381424 11.028295 1.905501 ... 7.200409
2000-12-31 9.773097 14.565345 22.578398 0.688273 ... 18.119020
2001-12-31 1.095308 14.817514 25.652418 8.327481 ... 15.385689
2002-12-31 29.744794 15.545211 6.373948 13.451261 ... 7.446414
2003-12-31 14.971959 25.948332 21.596976 5.355589 ... 23.676867
2004-12-31 0.604113 2.858745 0.120340 19.365223 ... 0.336213
2005-12-31 6.260722 9.819337 19.573953 11.132919 ... 26.107100
2006-12-31 10.341241 15.126506 3.349634 23.619127 ... 15.508680
2007-12-31 20.033540 22.103483 7.674852 1.263726 ... 15.148461
2008-12-31 28.233973 27.982105 17.037928 5.389418 ... 8.773618
2009-12-31 4.400039 7.284556 11.825382 4.201001 ... 6.734423
2010-12-31 26.086305 26.275027 8.069376 19.200344 ... 19.056528
2011-12-31 29.215028 0.985623 4.813478 7.752540 ... 14.395423
2012-12-31 4.690336 9.618306 25.492041 10.400292 ... 8.853903
2013-12-31 8.227096 11.013431 0.996911 15.276574 ... 26.227540
2014-12-31 23.440591 16.544698 2.263684 3.919315 ... 24.987387
357 358 359 360 361 362 \
1995-12-31 24.791125 21.443534 21.092439 8.289222 9.745293 20.084046
1996-12-31 2.632656 2.102163 24.828437 18.104255 7.951859 3.266873
1997-12-31 11.246534 14.086539 29.635519 19.518642 24.086108 6.041870
1998-12-31 29.961162 9.924863 9.401790 25.597344 13.885467 16.537406
1999-12-31 3.057125 15.241720 8.472388 3.248545 11.302522 19.283612
2000-12-31 22.999729 17.518504 10.058249 2.953903 10.167712 17.309525
2001-12-31 18.267445 23.205300 25.658591 19.915797 10.704525 26.604965
2002-12-31 11.497110 3.641206 9.693428 24.571510 6.438652 29.280098
2003-12-31 23.931401 19.967615 0.307896 0.385782 0.579257 7.534806
2004-12-31 21.321146 9.224362 1.703842 6.180944 28.173925 5.178336
2005-12-31 17.990409 28.746179 2.524899 10.555224 25.487723 19.877390
2006-12-31 9.748760 29.069966 1.717175 3.283069 9.615215 25.787787
2007-12-31 29.772930 20.892030 16.597493 20.079373 17.320327 9.583089
2008-12-31 22.787891 26.636413 13.872783 29.305847 21.287553 1.263788
2009-12-31 1.574188 23.172773 0.967153 1.928999 12.201354 0.125939
2010-12-31 20.566125 0.429552 4.413156 16.106451 27.745684 18.280928
2011-12-31 9.348584 2.604338 23.397221 7.378340 16.757224 29.364973
2012-12-31 4.704570 7.278321 19.034622 24.597784 13.694635 15.912901
2013-12-31 21.657446 14.110146 23.976991 8.203509 20.083490 4.471119
2014-12-31 14.465823 9.105391 15.984162 6.796756 8.232619 18.761280
363 364 365
1995-12-31 28.165022 9.735041 NaN
1996-12-31 11.644543 4.139818 5.420238
1997-12-31 2.500165 18.290531 NaN
1998-12-31 23.856333 10.064951 NaN
1999-12-31 3.090008 26.203395 NaN
2000-12-31 22.216599 27.942821 0.791318
2001-12-31 25.682003 4.766435 NaN
2002-12-31 19.785159 28.972659 NaN
2003-12-31 15.692168 21.388069 NaN
2004-12-31 9.079675 7.392328 12.583179
2005-12-31 18.202333 21.895494 NaN
2006-12-31 20.951937 26.220226 NaN
2007-12-31 23.603166 28.165377 NaN
2008-12-31 20.532933 9.401494 25.296916
2009-12-31 5.879644 10.377044 NaN
2010-12-31 0.436284 20.875852 NaN
2011-12-31 13.205290 6.832805 NaN
2012-12-31 23.253155 17.760731 23.270751
2013-12-31 19.807798 2.453238 NaN
2014-12-31 12.817601 11.756561 NaN
[20 rows x 366 columns]

Categories

Resources