Rolling mean returns over DataFrame - python

I want to add columns to the following Dataframe for each stock of 5 year (60 month) rolling returns. The following code is used to obtain the financial data over the period 1995 to 2010.
quandl.ApiConfig.api_key = 'Enter Key'
stocks = ['MSFT', 'AAPL', 'WMT', 'GE', 'KO']
stockdata = quandl.get_table('WIKI/PRICES', ticker = stocks, paginate=True,
qopts = { 'columns': ['date', 'ticker', 'adj_close'] },
date = { 'gte': '1995-1-1', 'lte': '2010-12-31' })
# Setting date as index with columns of tickers and adjusted closing price
df = stockdata.pivot(index = 'date',columns='ticker')
df.index = pd.to_datetime(df.index)
df.resample('1M').mean()
df = df.pct_change()
df.head()
Out[1]:
rets
ticker AAPL BA F GE JNJ KO
date
1995-01-03 NaN NaN NaN NaN NaN NaN
1995-01-04 0.026055 -0.002567 0.026911 0.000000 0.006972 -0.019369
1995-01-05 -0.012697 0.002573 -0.008735 0.002549 -0.002369 -0.004938
1995-01-06 0.080247 0.018824 0.000000 -0.004889 -0.006758 0.000000
1995-01-09 -0.019048 0.000000 0.017624 -0.009827 -0.011585 -0.014887
df.tail()
Out[2]:
rets
ticker AAPL BA F GE JNJ KO
date
2010-12-27 0.003337 -0.004765 0.005364 0.008315 -0.005141 -0.007777
2010-12-28 0.002433 0.001699 -0.008299 0.007147 0.001938 0.004457
2010-12-29 -0.000553 0.002929 0.000598 -0.002729 0.001289 0.001377
2010-12-30 -0.005011 -0.000615 -0.002987 -0.004379 -0.003058 0.000764
2010-12-31 -0.003399 0.003846 0.005992 0.005498 -0.001453 0.004122
Any assistance of how to do this would be awesome!

The problem is in the multi-level index in the columns. We can start by selecting the second level index, and after that the rolling mean works:
means = df['rets'].rolling(60).mean()
means.tail()
Gives:

The error you are receiving is due to you passing the entire dataframe into the rolling function since your frame uses a multi index. You cant pass a multi index frame to a rolling function since rolling only accepts numpy arrays of 1 column. You’ll have to probably create a for loop and return the values individually per ticker

Related

Python-pandas : a strange filtering/remove error

I've been using pandas for a while but I am having a really strange issue with simple filtering in pandas.
OS: Mac OS
IDE: VSCode
Pandas version: 1.4.2
I am fetching the latest data from crypto exchanges(via ccxt api) and append them into a dataframe.
limit = 28
timeframe = '1h'
futures_exchange = ccxt.kucoinfutures({
'apiKey' : MY_API_KEY,
'secret' : MY_API_SECRET,
'enableRateLimit': True,
'password': MY_KUCOIN_PASS_PHRASE,
})
all_future_list = [ 'BTC/USDT:USDT', 'BTC/USD:BTC', 'ETH/USDT:USDT', 'BCH/USDT:USDT', 'BSV/USDT:USDT', ]
attempts = 0
while attempts<= 20:
alldf = pd.DataFrame()
try:
for i in all_future_list:
df = futures_exchange.fetchOHLCV(i, limit=limit, timeframe=timeframe)
df = pd.DataFrame(df, columns=[['timestamp', 'open', 'high', 'low', 'close', 'volume']])
df['ticker'] = i
df['timestamp'] = df['timestamp'].astype('datetime64[ms]')
df = df[['timestamp', 'close', 'volume', 'ticker']]
df = df.tail(1)
alldf = pd.concat([alldf, df], ignore_index=True)
time.sleep(0.1)
except:
print(traceback.format_exc())
print(' ___ Network Error, restart fetching data _____ ')
attempts += 1
time.sleep(8)
continue
break
So far so good. Dataframe looks like...
timestamp close volume ticker
0 2022-09-17 10:00:00 19836.00 789336.0 BTC/USDT:USDT
1 2022-09-17 10:00:00 1411.2 982840 ETH/USDT:USDT
2 2022-09-17 10:00:00 120.95 55564.0 BCH/USDT:USDT
However, from there if I want to remove/filter a ticker from dataframe
new_df = alldf[alldf['ticker']!= 'BTC/USDT:USDT']
print(new_df)
Erroneous output:
timestamp close volume ticker
0 NaT NaN NaN NaN
1 NaT NaN NaN ETH/USDT:USDT
2 NaT NaN NaN BCH/USDT:USDT
3 NaT NaN NaN BSV/USDT:USDT
This should be a simple remove/filter but I dont understand why 'timestamp', 'close' and 'volume' columns are NaN
It seems if I write dataframe to a csv and then read it then I can get what I want.
alldf.to_csv('alldf.csv')
alldf = pd.read_csv('alldf.csv',index_col=0)
new_df = alldf[alldf['ticker']!= 'BTC/USDT:USDT']
print(new_df)
Output:
timestamp close volume ticker
1 2022-09-17 10:00:00 1411.2 982840 ETH/USDT:USDT
0 2022-09-17 10:00:00 120.95 55564.0 BCH/USDT:USDT
2 2022-09-17 11:00:00 51.95 3628.0 BSV/USDT:USDT
However, I dont want to write dataframe to csv and read it again just to filter out some tickers.
Can anyone help me ? Not sure whats going on here.
I guess this happens because that this piece of code
df = df[['timestamp', 'close', 'volume', 'ticker']]
df = df.tail(1)
subsets existing dataframe without actually copying the data. Therefore, target dataframe would get reference to original data, and when its changed - it gets broken.
Try this:
df = df[['timestamp', 'close', 'volume', 'ticker']]
df = df.tail(1).copy()
BTW, AFAIK, it's more efficient to keep new records in a dictionary and convert it to a dataframe at the end, rather then merging them one by one in a loop.

finding first and last available days of a month in pandas

I have a pandas dataframe from 2007 to 2017. The data is like this:
date closing_price
2007-12-03 728.73
2007-12-04 728.83
2007-12-05 728.83
2007-12-07 728.93
2007-12-10 728.22
2007-12-11 728.50
2007-12-12 728.51
2007-12-13 728.65
2007-12-14 728.65
2007-12-17 728.70
2007-12-18 728.73
2007-12-19 728.73
2007-12-20 728.73
2007-12-21 728.52
2007-12-24 728.52
2007-12-26 728.90
2007-12-27 728.90
2007-12-28 728.91
2008-01-05 728.88
2008-01-08 728.86
2008-01-09 728.84
2008-01-10 728.85
2008-01-11 728.85
2008-01-15 728.86
2008-01-16 728.89
As you can see, some days are missing for each month. I want to take the first and last 'available' days of each month, and calculate the difference of their closing_price, and put the results in a new dataframe. For example for the first month, the days will be 2007-12-03 and 2007-12-28, and the closing prices would be 728.73 and 728.91, so the result would be 0.18. How can I do this?
you can group df by month and apply a function to do it. Notice the to_period, this function convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency.
def calculate(x):
start_closing_price = x.loc[x.index.min(), "closing_price"]
end_closing_price = x.loc[x.index.max(), "closing_price"]
return end_closing_price-start_closing_price
result = df.groupby(df["date"].dt.to_period("M")).apply(calculate)
# result
date
2007-12 0.18
2008-01 0.01
Freq: M, dtype: float64
First make sure they are datetime and sorted:
import pandas as pd
df['date'] = pd.to_datetime(df.date)
df = df.sort_values('date')
Groupby
gp = df.groupby([df.date.dt.year.rename('year'), df.date.dt.month.rename('month')])
gp.closing_price.last() - gp.closing_price.first()
#year month
#2007 12 0.18
#2008 1 0.01
#Name: closing_price, dtype: float64
or
gp = df.groupby(pd.Grouper(key='date', freq='1M'))
gp.last() - gp.first()
# closing_price
#date
#2007-12-31 0.18
#2008-01-31 0.01
Resample
gp = df.set_index('date').resample('1M')
gp.last() - gp.first()
# closing_price
#date
#2007-12-31 0.18
#2008-01-31 0.01
Problem: Get first or last date of indexed dataframe
Solution: Resample the index and then extract the data.
lom = pd.Series(x.index, index = x.index).resample('m').last()
xlast = x[x.index.isin(lom)] # .resample('m').last() to get monthly freq
fom = pd.Series(x.index, index = x.index).resample('m').first()
xfirst = x[x.index.isin(fom)]

How to get a print function output in a data frame in python

returns is a python data frame and this is the head. this is just for 2 stocks daily returns
date NOW BBY
2013-09-30 NaN NaN
2013-10-01 -0.008855 0.012000
2013-10-02 0.015149 -0.007642
2013-10-03 -0.002296 0.000796
2013-10-04 0.043720 0.012206
I have a simple code that calculates annualized sharpe ratio for stocks
Function
N= 252
sharpe = np.sqrt(N)* returns.mean()/returns.std()
print (sharpe)
and this is the output when i print(sharpe)
NOW 0.906136
BBY 0.667774
dtype: float64
i want to get this value in a data frame, with column name = ticker, and sharpe ratio
so it should look like this
Ticker Sharpe
NOW 0.906136
BBY 0.667774
I want to get this in a data frame as I have several other print functions, like VAr etc, so I can merge them and then export the data frame to excel.
please help me how to get print output in a data frame in python.
import numpy as np
import pandas as pd
# Construct initial dataframe
df = pd.DataFrame({
'date': ['2013-0-30', '2013-10-01', '2013-10-02', '2013-10-03', '2013-10-04'],
'NOW': [np.nan, -0.008855, 0.015149, -0.002296, 0.043720],
'BBY': [np.nan, 0.012000, -0.007642, 0.000796, 0.012206],
})
df = df.set_index('date')
# Calculate Sharpe ratio
N = 252
sharpe = np.sqrt(N) * df.mean() / df.std()
# Transform Sharpe ratio data from Series to DataFrame
df2 = sharpe.to_frame('Sharpe')
df2.index.name = 'Ticker'
df2 = df2.reset_index()
which gives as result:
In [1]: df2
Out[1]:
Ticker Sharpe
0 NOW 8.061887
1 BBY 7.174034

pandas how to use groupby to group columns by date in the label?

I have a dataframe 10730 rows × 249 columns, i have columns:
Index(['RegionID', 'Metro', 'CountyName', 'SizeRank', '1996-04', '1996-05',
'1996-06', '1996-07', '1996-08', '1996-09',
...
'2015-11', '2015-12', '2016-01', '2016-02', '2016-03', '2016-04',
'2016-05', '2016-06', '2016-07', '2016-08'],
dtype='object', length=249)
so what i need to do is group the columns by the quarter, jan to march Q1, and so on till Q4(using mean for the values). i know how to group 3 columns for example, but how do i group all the columns since i cannot specify the name of the column one by one.
This is the dataframe head in csv to use for testing:
'State,RegionName,RegionID,Metro,CountyName,SizeRank,1996-04,1996-05,1996-06,1996-07,1996-08,1996-09,1996-10,1996-11,1996-12,1997-01,1997-02,1997-03,1997-04,1997-05,1997-06,1997-07,1997-08,1997-09,1997-10,1997-11,1997-12,1998-01,1998-02,1998-03,1998-04,1998-05,1998-06,1998-07,1998-08,1998-09,1998-10,1998-11,1998-12,1999-01,1999-02,1999-03,1999-04,1999-05,1999-06,1999-07,1999-08,1999-09,1999-10,1999-11,1999-12,2000-01,2000-02,2000-03,2000-04,2000-05,2000-06,2000-07,2000-08,2000-09,2000-10,2000-11,2000-12,2001-01,2001-02,2001-03,2001-04,2001-05,2001-06,2001-07,2001-08,2001-09,2001-10,2001-11,2001-12,2002-01,2002-02,2002-03,2002-04,2002-05,2002-06,2002-07,2002-08,2002-09,2002-10,2002-11,2002-12,2003-01,2003-02,2003-03,2003-04,2003-05,2003-06,2003-07,2003-08,2003-09,2003-10,2003-11,2003-12,2004-01,2004-02,2004-03,2004-04,2004-05,2004-06,2004-07,2004-08,2004-09,2004-10,2004-11,2004-12,2005-01,2005-02,2005-03,2005-04,2005-05,2005-06,2005-07,2005-08,2005-09,2005-10,2005-11,2005-12,2006-01,2006-02,2006-03,2006-04,2006-05,2006-06,2006-07,2006-08,2006-09,2006-10,2006-11,2006-12,2007-01,2007-02,2007-03,2007-04,2007-05,2007-06,2007-07,2007-08,2007-09,2007-10,2007-11,2007-12,2008-01,2008-02,2008-03,2008-04,2008-05,2008-06,2008-07,2008-08,2008-09,2008-10,2008-11,2008-12,2009-01,2009-02,2009-03,2009-04,2009-05,2009-06,2009-07,2009-08,2009-09,2009-10,2009-11,2009-12,2010-01,2010-02,2010-03,2010-04,2010-05,2010-06,2010-07,2010-08,2010-09,2010-10,2010-11,2010-12,2011-01,2011-02,2011-03,2011-04,2011-05,2011-06,2011-07,2011-08,2011-09,2011-10,2011-11,2011-12,2012-01,2012-02,2012-03,2012-04,2012-05,2012-06,2012-07,2012-08,2012-09,2012-10,2012-11,2012-12,2013-01,2013-02,2013-03,2013-04,2013-05,2013-06,2013-07,2013-08,2013-09,2013-10,2013-11,2013-12,2014-01,2014-02,2014-03,2014-04,2014-05,2014-06,2014-07,2014-08,2014-09,2014-10,2014-11,2014-12,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,2015-11,2015-12,2016-01,2016-02,2016-03,2016-04,2016-05,2016-06,2016-07,2016-08\nNY,New York,6181,New York,Queens,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,432600.0,438700.0,440500.0,433900.0,422000.0,415700.0,421200.0,431100.0,435100.0,431900.0,428400.0,430700.0,438800.0,446800.0,455400.0,465500.0,472600.0,478200.0,487600.0,498600.0,508800.0,515300.0,517000.0,517800.0,520800.0,521500.0,523000.0,526300.0,524800.0,519100.0,516200.0,516400.0,516300.0,515500.0,512200.0,509200.0,509800.0,511600.0,512700.0,514000.0,513400.0,510700.0,508100.0,506700.0,505200.0,503700.0,502900.0,502400.0,500500.0,496400.0,491900.0,487500.0,484400.0,481700.0,477900.0,473600.0,469700.0,466100.0,461700.0,457700.0,455300.0,454800.0,456000.0,457800.0,461300.0,466100.0,470200.0,472800.0,475300.0,477100.0,478400.0,479100.0,478900.0,477700.0,476700.0,477100.0,478000.0,478000.0,476800.0,475300.0,473800.0,472000.0,470600.0,469900.0,469500.0,468200.0,465800.0,463500.0,461800.0,460100.0,459700.0,460800.0,461700.0,462500.0,463900.0,466000.0,467500.0,468200.0,468700.0,469400.0,469400,469100.0,468700,469300,470300,472100,474300,477600,481400,485100,488800,492600,495900,499500,503500,506400,509900,515700,520800,522200,522400,523800,526200,528400,529600,530800,532200,533800,536200,540600,545600,551400,557200,563000,568700,573600,576200,578400,582200,588000,592200,592500,590200,588000,586400\nCA,Los Angeles,12447,Los Angeles-Long Beach-Anaheim,Los Angeles,2,155000.0,154600.0,154400.0,154200.0,154100.0,154300.0,154300.0,154200.0,154800.0,155900.0,157000.0,157700.0,158200.0,158600.0,158800.0,158900.0,159100.0,159800.0,160700.0,161900.0,163400.0,165400.0,167000.0,168500.0,169900.0,171400.0,172900.0,174300.0,175800.0,177800.0,180100.0,182600.0,184400.0,185600.0,186900.0,188200.0,189600.0,191300.0,193100.0,194700.0,196300.0,197700.0,199100.0,200700.0,202300.0,204400.0,207000.0,209800.0,212300.0,214500.0,216600.0,219000.0,221100.0,222800.0,224300.0,226100.0,228100.0,230600.0,233000.0,235400.0,237300.0,239100.0,240900.0,242900.0,245000.0,247300.0,250100.0,253100.0,255900.0,258800.0,261900.0,265200.0,268600.0,272600.0,276900.0,281800.0,287000.0,292200.0,297000.0,302100.0,307600.0,313400.0,319000.0,324300.0,329600.0,334600.0,339300.0,344500.0,350600.0,356800.0,363400.0,370700.0,378400.0,386500.0,394900.0,404300.0,414600.0,425500.0,436600.0,447400.0,456700.0,464400.0,471200.0,477400.0,483500.0,489100.0,494700.0,501400.0,509700.0,518300.0,527200.0,536100.0,545400.0,555200.0,564500.0,571900.0,576800.0,579700.0,581800.0,583800.0,585300.0,587300.0,589900.0,592200.0,593300.0,593400.0,593100.0,592900.0,591600.0,590900.0,591800.0,592600.0,592100.0,590200.0,586200.0,581600.0,577500.0,572800.0,567600.0,562100.0,554400.0,545000.0,535500.0,525400.0,513600.0,502000.0,491200.0,480200.0,469000.0,459300.0,451200.0,443900.0,436800.0,430900.0,426100.0,421800.0,417800.0,413700.0,410200.0,407900.0,406300.0,404900.0,404200.0,402900.0,405900.0,412000.0,415000.0,413100.0,412100.0,411300.0,410100.0,408400.0,406800.0,405100.0,403300.0,401900.0,401000.0,399200.0,397100.0,395000.0,392700.0,390200.0,387400.0,384700.0,382100.0,379500.0,377200.0,375700.0,373800.0,371500.0,370000.0,370300.0,372100.0,375300.0,378600.0,382100.0,385600.0,389000.0,391800.0,396400.0,401500,405700.0,410700,418200,425500,432700,440400,448100,455200,461900,467800,472300,475700,479400,484000,489400,494200,498100,501800,505600,509000,512600,516000,518900,521700,525100,528900,532400,535300,538200,541000,544000,547200,550600,554200,558200,560800,562800,565600,569700,574000,577800,580600,583000,585100\nIL,Chicago,17426,Chicago,Cook,3,109700.0,109400.0,109300.0,109300.0,109100.0,109000.0,109000.0,109600.0,110200.0,110800.0,111300.0,111700.0,112200.0,112300.0,112100.0,112200.0,113000.0,113700.0,114200.0,114800.0,115500.0,116200.0,117100.0,117600.0,117800.0,118300.0,119200.0,120000.0,120600.0,121500.0,122300.0,122700.0,122900.0,123300.0,123700.0,124500.0,125700.0,127300.0,128800.0,130200.0,131400.0,132600.0,133700.0,134600.0,135500.0,136800.0,138300.0,140100.0,141900.0,143700.0,145300.0,146700.0,147900.0,149000.0,150400.0,152000.0,154000.0,155600.0,157000.0,158200.0,159900.0,161800.0,163700.0,165300.0,166400.0,167500.0,168800.0,170400.0,172100.0,173900.0,175600.0,177000.0,177800.0,177600.0,177300.0,177700.0,178800.0,180400.0,182300.0,183800.0,185000.0,185600.0,186800.0,188900.0,191300.0,194100.0,197500.0,200200.0,202300.0,203700.0,204000.0,204000.0,204400.0,205300.0,206300.0,207000.0,207600.0,208600.0,209600.0,210900.0,212800.0,214600.0,216400.0,218300.0,220300.0,222300.0,224000.0,225400.0,226900.0,228600.0,230100.0,231800.0,233200.0,234500.0,236000.0,237500.0,239000.0,240800.0,242500.0,243900.0,244900.0,245300.0,245400.0,245800.0,245800.0,245500.0,245900.0,246900.0,247300.0,247400.0,247300.0,247000.0,246700.0,246400.0,246100.0,246100.0,246300.0,246400.0,246700.0,247100.0,246700.0,245300.0,243900.0,242000.0,239800.0,237900.0,236000.0,233500.0,231800.0,230700.0,229200.0,226700.0,225200.0,224500.0,223800.0,223000.0,221900.0,219700.0,217500.0,215600.0,213800.0,212900.0,212300.0,211900.0,210800.0,209300.0,207300.0,205300.0,204200.0,204100.0,203100.0,201100.0,199000.0,196700.0,193800.0,191100.0,189200.0,188100.0,187600.0,186500.0,184400.0,181700.0,178700.0,175900.0,174100.0,172800.0,171400.0,170100.0,169100.0,167900.0,166700.0,166200.0,166400.0,166800.0,167900.0,168900.0,168400.0,167100.0,166900.0,167300.0,167500,167700.0,168300,169100,170400,172400,175100,178200,181000,183200,184600,185800,187200,189100,191100,192500,192600,192400,192900,193900,195600,197800,200100,201700,202000,201200,200500,201500,204000,206500,207600,207700,208100,209100,209000,207800,206900,206200,205800,206200,207300,208200,209100,211000,213000\nPA,Philadelphia,13271,Philadelphia,Philadelphia,4,50000.0,49900.0,49600.0,49400.0,49400.0,49300.0,49300.0,49400.0,49700.0,49600.0,49500.0,49700.0,49800.0,49700.0,49700.0,49800.0,49700.0,49700.0,49800.0,49900.0,49900.0,50000.0,50300.0,50600.0,50800.0,50800.0,50800.0,50800.0,50700.0,50500.0,50500.0,50700.0,50700.0,50800.0,50900.0,51100.0,51200.0,51400.0,51500.0,51400.0,51500.0,51800.0,52100.0,52100.0,52300.0,52700.0,53100.0,53200.0,53400.0,53700.0,53800.0,53800.0,54100.0,54500.0,54700.0,54600.0,54800.0,55100.0,55400.0,55500.0,55400.0,55500.0,55700.0,55900.0,56300.0,56600.0,57000.0,57500.0,58100.0,58600.0,59100.0,59700.0,60300.0,60700.0,61200.0,61800.0,62200.0,62500.0,63000.0,63600.0,63900.0,64200.0,64700.0,65300.0,65700.0,66100.0,66800.0,67700.0,68500.0,69200.0,69800.0,70700.0,71700.0,72800.0,73700.0,74700.0,75700.0,76700.0,77800.0,79100.0,80500.0,82100.0,84000.0,85600.0,87000.0,88200.0,89600.0,91300.0,93000.0,94900.0,96700.0,98400.0,100200.0,101900.0,103400.0,104900.0,106400.0,107500.0,108200.0,109300.0,110800.0,112500.0,113800.0,114800.0,115600.0,116000.0,116400.0,116700.0,116800.0,116900.0,117300.0,117800.0,118200.0,118600.0,119300.0,120200.0,120900.0,121400.0,121300.0,120900.0,120200.0,119600.0,119600.0,119500.0,118800.0,118100.0,117500.0,117100.0,117000.0,116700.0,116300.0,115800.0,115500.0,115900.0,116300.0,116400.0,116400.0,116100.0,116000.0,116200.0,116700.0,117300.0,118000.0,118200.0,119500.0,120900.0,121300.0,121300.0,122100.0,123000.0,123300.0,122300.0,120000.0,118200.0,117600.0,117900.0,117800.0,117400.0,117000.0,116900.0,116700.0,116500.0,115700.0,115300.0,115500.0,115600.0,115200.0,114800.0,114100.0,113500.0,112900.0,111800.0,110800.0,110400.0,110400.0,110200.0,109900.0,109700.0,110000.0,110700.0,111800,112100.0,111900,112000,112200,111800,111200,111000,110900,111100,111800,112700,112900,113100,113900,114200,113600,113500,114100,114900,115500,115500,115400,115600,116000,116100,116100,116400,117000,117900,119000,120100,121300,122300,122700,122300,121600,121800,123300,125200,126400,127000,127400,128300,129100\nAZ,Phoenix,40326,Phoenix,Maricopa,5,87200.0,87700.0,88200.0,88400.0,88500.0,88900.0,89400.0,89700.0,90100.0,90700.0,91400.0,91700.0,91800.0,92000.0,92300.0,92600.0,93000.0,93400.0,94000.0,94600.0,95300.0,96100.0,96800.0,97300.0,97700.0,98400.0,99200.0,100100.0,100500.0,100700.0,100900.0,101700.0,102600.0,103400.0,103900.0,104400.0,105100.0,105900.0,106200.0,106600.0,107400.0,108300.0,109000.0,109700.0,110400.0,111000.0,111700.0,112800.0,113700.0,114300.0,115100.0,115600.0,115900.0,116500.0,117200.0,117400.0,117600.0,118400.0,119700.0,120700.0,121200.0,121500.0,122000.0,122400.0,122700.0,123000.0,123600.0,124300.0,125000.0,125800.0,126600.0,127200.0,127900.0,128400.0,128800.0,129500.0,130500.0,131600.0,132500.0,133200.0,134000.0,134900.0,135700.0,136500.0,137200.0,138000.0,138600.0,138900.0,139200.0,139400.0,139600.0,140300.0,141400.0,142500.0,143700.0,144900.0,145900.0,147100.0,148400.0,150300.0,153100.0,156200.0,159400.0,162900.0,166500.0,170000.0,173900.0,178800.0,185000.0,192300.0,200700.0,209400.0,217000.0,223600.0,229800.0,234900.0,238600.0,241300.0,243000.0,244100.0,244800.0,245400.0,245600.0,245600.0,245300.0,244600.0,243800.0,243400.0,243400.0,243600.0,243200.0,242200.0,241300.0,240200.0,238400.0,236400.0,234700.0,233300.0,231600.0,229100.0,226100.0,222800.0,218800.0,214300.0,209500.0,205200.0,201100.0,197300.0,193700.0,190300.0,186700.0,182800.0,180500.0,179600.0,178000.0,175100.0,172100.0,168400.0,164200.0,160000.0,156000.0,151800.0,147600.0,143900.0,138900.0,133400.0,130200.0,129200.0,127700.0,126200.0,124800.0,123100.0,120700.0,118500.0,117000.0,115800.0,114800.0,114100.0,113200.0,111800.0,110100.0,108000.0,105900.0,104100.0,102900.0,102300.0,102400.0,103000.0,104100.0,105800.0,107600.0,109100.0,111200.0,114000.0,117200.0,120400.0,123300.0,125800.0,128300.0,130500.0,132500,134400.0,136200,138400,141600,144700,147400,150500,153600,156100,158100,160000,161600,162700,163300,163700,164100,164200,164500,164700,165200,166200,167200,168400,169900,171000,171500,172100,172900,174100,175500,177100,179100,181000,182400,183800,185300,186600,188000,189100,190200,191300,192800,194500,195900\n'
I changed the column index to date by dropping the non dates from the df quarter = df.drop(['RegionID','Metro','CountyName','SizeRank'],axis=1)
then change the columns to date quarter.columns = pd.to_datetime(quarter.columns) then i would like to do something likequarter = quarter.groupby(pd.TimeGrouper(freq='3M'),axis=1) but it's not working, then i would merge it back to the non-date columns. Also with this approach i wouldnt know how to put the right label for it like [2015Q4,2016Q1,2016Q2,2016Q3,2016Q4]
Here is a vectorized solution which uses pd.PeriodIndex and groupby(..., axis=1):
Data:
In [69]: x
Out[69]:
2016-01 2016-02 2016-03 2016-04 2016-05 2016-06
0 1 0 1 0 0 0
1 2 0 1 0 0 0
2 1 1 2 0 1 0
Solution:
In [70]: x.groupby(pd.PeriodIndex(x.columns, freq='Q'), axis=1).mean()
Out[70]:
2016Q1 2016Q2
0 0.666667 0.000000
1 1.000000 0.000000
2 1.333333 0.333333
Explanation:
In [71]: pd.PeriodIndex(x.columns, freq='Q')
Out[71]: PeriodIndex(['2016Q1', '2016Q1', '2016Q1', '2016Q2', '2016Q2', '2016Q2'], dtype='period[Q-DEC]', freq='Q-DEC')
It's not pretty, but this is the first thing I thought of. It sounds like the date columns can be manipulated separately. Break those out into a separate dataframe of the form below. If the other fields are kept, the conversion to datetime will throw an error.
import numpy as np
import pandas as pd
csv_df = pd.DataFrame({'2016-01':[1,2,1], '2016-02':[0,0,1], '2016-03':[1,1,2], '2016-04':[0,0,0], '2016-05':[0,0,1], '2016-06':[0,0,0]})
# convert columns into datetime format
csv_df.rename(columns=lambda x: pd.to_datetime(x, format='%Y-%m'), inplace=True)
# now strip out the year and the quarter
csv_df.rename(columns=lambda x: str(x.year) + 'Q' + str(x.quarter), inplace=True)
# #lucarlig improved my suggestion by using groupby as follows
csv_df = csv_df.groupby(csv_df.columns, axis=1).mean()
Consider melting the dataframe from the wide format to long format, parse out the quarter and year using datetime and run a pivot_table() to transform back from long to wide aggregating the values with mean:
import pandas as pd
import datetime as dt
import numpy as np
...
# MELT DATAFRAME
meltdf = pd.melt(df, id_vars = ['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
var_name = 'Date', value_name = 'Data')
# EXTRACT QUARTER
meltdf['Date'] = pd.to_datetime(meltdf['Date'] + '-01')
meltdf['YearQuarter'] = meltdf['Date'].dt.year.astype(str) + 'Q' + \
meltdf['Date'].dt.quarter.astype(str)
# PIVOT DATAFRAME
pivotdf = pd.pivot_table(meltdf, index=['State','RegionName','RegionID',
'Metro','CountyName','SizeRank'],
columns=['YearQuarter'], values='Data', aggfunc=np.mean)
Output
print(pivotdf.head())
# State RegionName RegionID Metro CountyName SizeRank 1996Q2 1996Q3 1996Q4 1997Q1 1997Q2 1997Q3 ...
# AZ Phoenix 40326 Phoenix Maricopa 5 87700 88600 89733.33333 91266.66667 92033.33333 93000
# CA Los Angeles 12447 Los Angeles...Los Angeles 2 154666.6667 154200 154433.3333 156866.6667 158533.3333 159266.6667
# IL Chicago 17426 Chicago Cook 3 109466.6667 109133.3333 109600 111266.6667 112200 112966.6667
# NY New York 6181 New York Queens 1
# PA Philadelphia 13271 Philadelphia Philadelphia 4 49833.33333 49366.66667 49466.66667 49600 49733.33333 49733.33333

Updating pandas DataFrame by key

I have a dataframe of historical stock trades. The frame has columns like ['ticker', 'date', 'cusip', 'profit', 'security_type']. Initially:
trades['cusip'] = np.nan
trades['security_type'] = np.nan
I have historical config files that I can load into frames that have columns like ['ticker', 'cusip', 'date', 'name', 'security_type', 'primary_exchange'].
I would like to UPDATE the trades frame with the cusip and security_type from config, but only where the ticker and date match.
I thought I could do something like:
pd.merge(trades, config, on=['ticker', 'date'], how='left')
But that doesn't update the columns, it just adds the config columns to trades.
The following works, but I think there has to be a better way. If not, I will probably do it outside of pandas.
for date in trades['date'].unique():
config = get_config_file_as_df(date)
## config['date'] == date
for ticker in trades['ticker'][trades['date'] == date]:
trades['cusip'][
(trades['ticker'] == ticker)
& (trades['date'] == date)
] \
= config['cusip'][config['ticker'] == ticker].values[0]
trades['security_type'][
(trades['ticker'] == ticker)
& (trades['date'] == date)
] \
= config['security_type'][config['ticker'] == ticker].values[0]
Suppose you have this setup:
import pandas as pd
import numpy as np
import datetime as DT
nan = np.nan
trades = pd.DataFrame({'ticker' : ['IBM', 'MSFT', 'GOOG', 'AAPL'],
'date' : pd.date_range('1/1/2000', periods = 4),
'cusip' : [nan, nan, 100, nan]
})
trades = trades.set_index(['ticker', 'date'])
print(trades)
# cusip
# ticker date
# IBM 2000-01-01 NaN
# MSFT 2000-01-02 NaN
# GOOG 2000-01-03 100 # <-- We do not want to overwrite this
# AAPL 2000-01-04 NaN
config = pd.DataFrame({'ticker' : ['IBM', 'MSFT', 'GOOG', 'AAPL'],
'date' : pd.date_range('1/1/2000', periods = 4),
'cusip' : [1,2,3,nan]})
config = config.set_index(['ticker', 'date'])
# Let's permute the index to show `DataFrame.update` correctly matches rows based on the index, not on the order of the rows.
new_index = sorted(config.index)
config = config.reindex(new_index)
print(config)
# cusip
# ticker date
# AAPL 2000-01-04 NaN
# GOOG 2000-01-03 3
# IBM 2000-01-01 1
# MSFT 2000-01-02 2
Then you can update NaN values in trades with values from config using the DataFrame.update method. Note that DataFrame.update matches rows based on indices (which is why set_index was called above).
trades.update(config, join = 'left', overwrite = False)
print(trades)
# cusip
# ticker date
# IBM 2000-01-01 1
# MSFT 2000-01-02 2
# GOOG 2000-01-03 100 # If overwrite = True, then 100 is overwritten by 3.
# AAPL 2000-01-04 NaN

Categories

Resources