Convert quandl fx hourly data to daily - python

I would like to convert hourly financial data imported into a pandas dataframe that has the following csv header to daily data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
I've imported the data with pandas.read_csv(). I have eliminated all but one symbol from the data for testing purposes, and have figured out this part so far:
df.groupby('date').agg({'highask': [max], 'lowask': [min]})
I'm still pretty new to python, so I'm not really sure how to continue. I'm guessing I can use some kind of anonymous function to create additional fields. For example, I'd like to get the open ask price for each date at hour 0, and the close ask price for each data at hour 23. Ideally, I would add additional columns and create a new dataframe. I want add a new column for market price, that is a an average of the ask/bid for low, high, open, and close.
Any advice would be greatly appreciated. Thanks!
edit
As requested, here is the output I would expect for just 2018-07-24:
symbol,date,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,0.7422,0.74297,0.7429,0.74196,0.74257,0.743,0.74197,0.74258,5191
openbid is the openbid at the lowest hour column for a single date, closebid is the closebid at the highest hour for a single date, etc. Total ticks is the sum. What I am really struggling with is determining openbid, openask, closebid, and closeask.
Sample data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,22,0.7422,0.74249,0.74196,0.7423,0.74225,0.74252,0.74197,0.74234,1470
AUD/USD,2018-07-24,23,0.7423,0.74297,0.7423,0.74257,0.74234,0.743,0.74234,0.74258,3721
AUD/USD,2018-07-25,0,0.74257,0.74334,0.74237,0.74288,0.74258,0.74335,0.74239,0.74291,7443
AUD/USD,2018-07-25,1,0.74288,0.74492,0.74105,0.74111,0.74291,0.74501,0.74107,0.74111,14691
AUD/USD,2018-07-25,2,0.74111,0.74127,0.74015,0.74073,0.74111,0.74129,0.74018,0.74076,6898
AUD/USD,2018-07-25,3,0.74073,0.74076,0.73921,0.73987,0.74076,0.74077,0.73923,0.73989,6207
AUD/USD,2018-07-25,4,0.73987,0.74002,0.73921,0.73953,0.73989,0.74003,0.73923,0.73956,3453
AUD/USD,2018-07-25,5,0.73953,0.74094,0.73946,0.74041,0.73956,0.74096,0.73947,0.74042,7187
AUD/USD,2018-07-25,6,0.74041,0.74071,0.73921,0.74056,0.74042,0.74069,0.73922,0.74059,10646
AUD/USD,2018-07-25,7,0.74056,0.74066,0.73973,0.74035,0.74059,0.74068,0.73974,0.74037,9285
AUD/USD,2018-07-25,8,0.74035,0.74206,0.73996,0.74198,0.74037,0.74207,0.73998,0.742,10234
AUD/USD,2018-07-25,9,0.74198,0.74274,0.74176,0.74225,0.742,0.74275,0.74179,0.74227,8224
AUD/USD,2018-07-25,10,0.74225,0.74237,0.74122,0.74142,0.74227,0.74237,0.74124,0.74143,7143
AUD/USD,2018-07-25,11,0.74142,0.74176,0.74093,0.74152,0.74143,0.74176,0.74095,0.74152,7307
AUD/USD,2018-07-25,12,0.74152,0.74229,0.74078,0.74219,0.74152,0.74229,0.74079,0.74222,10523
AUD/USD,2018-07-25,13,0.74219,0.74329,0.74138,0.74141,0.74222,0.74332,0.74136,0.74145,13983
AUD/USD,2018-07-25,14,0.74141,0.74217,0.74032,0.74065,0.74145,0.7422,0.74034,0.74067,21814
AUD/USD,2018-07-25,15,0.74065,0.74151,0.73989,0.74113,0.74067,0.74152,0.73988,0.74115,16085
AUD/USD,2018-07-25,16,0.74113,0.74144,0.74056,0.7411,0.74115,0.74146,0.74058,0.74111,7752
AUD/USD,2018-07-25,17,0.7411,0.7435,0.74092,0.74346,0.74111,0.74353,0.74094,0.74348,11348
AUD/USD,2018-07-25,18,0.74346,0.74445,0.74331,0.74373,0.74348,0.74446,0.74333,0.74373,9898
AUD/USD,2018-07-25,19,0.74373,0.74643,0.74355,0.74559,0.74373,0.74643,0.74358,0.7456,11756
AUD/USD,2018-07-25,20,0.74559,0.74596,0.74478,0.74549,0.7456,0.746,0.74481,0.74562,5607
AUD/USD,2018-07-25,21,0.74549,0.74562,0.74417,0.74438,0.74562,0.74576,0.74422,0.74442,3613
AUD/USD,2018-07-26,22,0.73762,0.73792,0.73762,0.73774,0.73772,0.73798,0.73768,0.73779,1394
AUD/USD,2018-07-26,23,0.73774,0.73813,0.73744,0.73807,0.73779,0.73816,0.73746,0.73808,3465
AUD/USD,2018-07-27,0,0.73807,0.73826,0.73733,0.73763,0.73808,0.73828,0.73735,0.73764,6582
AUD/USD,2018-07-27,1,0.73763,0.73854,0.73734,0.73789,0.73764,0.73857,0.73736,0.73788,7373
AUD/USD,2018-07-27,2,0.73789,0.73881,0.73776,0.73881,0.73788,0.73883,0.73778,0.73882,3414
AUD/USD,2018-07-27,3,0.73881,0.7393,0.73849,0.73875,0.73882,0.73932,0.73851,0.73877,4639
AUD/USD,2018-07-27,4,0.73875,0.739,0.73852,0.73858,0.73877,0.73901,0.73852,0.73859,2487
AUD/USD,2018-07-27,5,0.73858,0.73896,0.7381,0.73887,0.73859,0.73896,0.73812,0.73888,5332
AUD/USD,2018-07-27,6,0.73887,0.73902,0.73792,0.73879,0.73888,0.73902,0.73793,0.73881,7623
AUD/USD,2018-07-27,7,0.73879,0.7395,0.73844,0.73885,0.73881,0.7395,0.73846,0.73887,9577
AUD/USD,2018-07-27,8,0.73885,0.73897,0.73701,0.73727,0.73887,0.73899,0.73702,0.73729,12280
AUD/USD,2018-07-27,9,0.73727,0.73784,0.737,0.73721,0.73729,0.73786,0.73701,0.73723,8634
AUD/USD,2018-07-27,10,0.73721,0.73798,0.73717,0.73777,0.73723,0.73798,0.73718,0.73779,7510
AUD/USD,2018-07-27,11,0.73777,0.73789,0.73728,0.73746,0.73779,0.73789,0.7373,0.73745,4947
AUD/USD,2018-07-27,12,0.73746,0.73927,0.73728,0.73888,0.73745,0.73929,0.73729,0.73891,16853
AUD/USD,2018-07-27,13,0.73888,0.74083,0.73853,0.74066,0.73891,0.74083,0.73855,0.74075,14412
AUD/USD,2018-07-27,14,0.74066,0.74147,0.74025,0.74062,0.74075,0.74148,0.74026,0.74064,15187
AUD/USD,2018-07-27,15,0.74062,0.74112,0.74002,0.74084,0.74064,0.74114,0.74003,0.74086,10044
AUD/USD,2018-07-27,16,0.74084,0.74091,0.73999,0.74001,0.74086,0.74092,0.74,0.74003,6893
AUD/USD,2018-07-27,17,0.74001,0.74022,0.73951,0.74008,0.74003,0.74025,0.73952,0.74009,5865
AUD/USD,2018-07-27,18,0.74008,0.74061,0.74002,0.74046,0.74009,0.74062,0.74004,0.74047,4334
AUD/USD,2018-07-27,19,0.74046,0.74072,0.74039,0.74041,0.74047,0.74073,0.74041,0.74043,3654
AUD/USD,2018-07-27,20,0.74041,0.74066,0.74005,0.74011,0.74043,0.74068,0.74018,0.74023,1547
AUD/USD,2018-07-25,22,0.74438,0.74526,0.74436,0.74489,0.74442,0.7453,0.74439,0.74494,2220
AUD/USD,2018-07-25,23,0.74489,0.74612,0.74489,0.7459,0.74494,0.74612,0.74492,0.74592,4886
AUD/USD,2018-07-26,0,0.7459,0.74625,0.74536,0.74571,0.74592,0.74623,0.74536,0.74573,6602
AUD/USD,2018-07-26,1,0.74571,0.74633,0.74472,0.74479,0.74573,0.74634,0.74471,0.74481,10123
AUD/USD,2018-07-26,2,0.74479,0.74485,0.74375,0.74434,0.74481,0.74487,0.74378,0.74437,7844
AUD/USD,2018-07-26,3,0.74434,0.74459,0.74324,0.744,0.74437,0.74461,0.74328,0.744,6037
AUD/USD,2018-07-26,4,0.744,0.74428,0.74378,0.74411,0.744,0.7443,0.74379,0.74414,3757
AUD/USD,2018-07-26,5,0.74411,0.74412,0.74346,0.74349,0.74414,0.74414,0.74344,0.74349,5713
AUD/USD,2018-07-26,6,0.74349,0.74462,0.74291,0.74299,0.74349,0.74464,0.74293,0.743,12650
AUD/USD,2018-07-26,7,0.74299,0.74363,0.74267,0.74361,0.743,0.74363,0.74269,0.74362,8067
AUD/USD,2018-07-26,8,0.74361,0.74375,0.74279,0.74287,0.74362,0.74376,0.7428,0.74288,6988
AUD/USD,2018-07-26,9,0.74287,0.74322,0.74212,0.74318,0.74288,0.74323,0.74212,0.74319,7784
AUD/USD,2018-07-26,10,0.74318,0.74329,0.74249,0.74276,0.74319,0.74331,0.7425,0.74276,5271
AUD/USD,2018-07-26,11,0.74276,0.74301,0.74179,0.74201,0.74276,0.74303,0.7418,0.74199,7434
AUD/USD,2018-07-26,12,0.74201,0.74239,0.74061,0.74064,0.74199,0.74241,0.74063,0.74066,20513
AUD/USD,2018-07-26,13,0.74064,0.74124,0.73942,0.74008,0.74066,0.74124,0.73943,0.74005,19715
AUD/USD,2018-07-26,14,0.74008,0.74014,0.73762,0.73887,0.74005,0.74013,0.73764,0.73889,21137
AUD/USD,2018-07-26,15,0.73887,0.73936,0.73823,0.73831,0.73889,0.73936,0.73824,0.73833,11186
AUD/USD,2018-07-26,16,0.73831,0.73915,0.73816,0.73908,0.73833,0.73916,0.73817,0.73908,6016
AUD/USD,2018-07-26,17,0.73908,0.73914,0.73821,0.73884,0.73908,0.73917,0.73823,0.73887,6197
AUD/USD,2018-07-26,18,0.73884,0.73885,0.73737,0.73773,0.73887,0.73887,0.73737,0.73775,6127
AUD/USD,2018-07-26,19,0.73773,0.73794,0.73721,0.73748,0.73775,0.73797,0.73724,0.73751,3614
AUD/USD,2018-07-26,20,0.73748,0.73787,0.73746,0.73767,0.73751,0.7379,0.73748,0.73773,1801
AUD/USD,2018-07-26,21,0.73767,0.73807,0.73755,0.73762,0.73773,0.73836,0.73769,0.73772,1687

To assign a new column avg_market_price as the average:
df = df.assign(avg_market_price=df[['openbid', 'highbid', 'lowbid', 'closebid',
'openask', 'highask', 'lowask', 'closeask']].mean(axis=1))
You then want to set the index to a datetime index by combining the date and time fields, then resample your data to daily time periods (1d). Finally, use apply to get the max, min and averge values on specific columns.
import numpy as np
>>> (df
.set_index(df['date'] + pd.to_timedelta(df['hour'], unit='h'))
.resample('1d')
.apply({'highask': 'max', 'lowask': 'min', 'avg_market_price': np.mean}))
highask lowask avg_market_price
2018-07-24 0.74300 0.74197 0.742402
2018-07-25 0.74643 0.73922 0.742142
2018-07-26 0.74634 0.73724 0.741239
2018-07-27 0.74148 0.73701 0.739011

Related

Upsample timeseries with weather data in a correct way

I have a dataset that holds weather data for each month from 1st day to 20th of month and for each hour of the day throw a year and the last 10 days(with it's hours) of each month are removed.
The weather data are :
(temperature - humidity - wind_speed - visibility - dew_temperature - solar_radiation - rainfall -snowfall)
I want to upsample the dataset as time series to fill the missing data of the days but i face many issue due too the changes of climate.
Here it what is tried so far
def get_hour_month_mean(data,date,hour,max_id):
return { 'ID':max_id,
'temperature':data['temperature'].mean(),
'humidity':data['humidity'].mean(),
'date':date,
'hour':hour,
'wind_speed':data['wind_speed'].mean(),
'visibility':data['visibility'].mean(),
'dew_temperature':data['dew_temperature'].mean(),
'solar_radiation':data['solar_radiation'].mean(),
'rainfall':data['rainfall'].mean(),
'count':data['count'].mean() if str(date.date()) not in seoul_not_func else 0,
'snowfall':data['snowfall'].mean(),
'season':data['season'].mode()[0],
'is_holiday':'No Holiday' if str(date.date()) not in seoul_p_holidays_17_18 else 'Holiday' ,
'functional_day':'Yes' if str(date.date()) not in seoul_not_func else 'No' ,
}
def upsample_data_with_missing_dates(data):
data_range = pd.date_range(
start="2017-12-20", end="2018-11-30", freq='D')
missing_range=data_range.difference(df['date'])
hour_range=range(0,24)
max_id=data['ID'].max()
data_copy=data.copy()
for date in missing_range:
for hour in hour_range:
max_id+=1
year=data_copy.year
month=date.month
if date.month==11:
year-=1
month=12
else:
month+=1
month_mask=((data_copy['year'] == year) &
(data_copy['month'] == month) &
(data_copy['hour'] == hour) &(data_copy['day'].isin([1,2])))
data_filter=data_copy[month_mask]
dict_row=get_hour_month_mean(data_filter,date,hour,max_id)
data = data.append(dict_row, ignore_index=True)
return data
any ideas what is the best way to get the values of the missing days if i have the previous 20 days and the next 20 days ?
There is a lot of manners to deal with missing timeseries values in fact.
You already tried the traditional way, imputing data with mean values. But the drawback of this method is the bias caused by so many values on the data.
You can try a genetic algorithm (GA), Support Vector Machine(SVR), autoregressive(AR) and moving average(MA) for time series imputation and modeling. To overcome the bias problem caused by the tradional method (mean), these methods are used to forecast or/and impute time series.
(Consider that you have a multivariate timeseries)
Here are some ressources you can use :
A Survey on Deep Learning Approaches
time.series.missing-values-in-time-series-in-python
Interpolation in Python to fill Missing Values

How do I transfer values of a CSV files between certain dates to another CSV file based on the dates in the rows in that file?

Long question: I have two CSV files, one called SF1 which has quarterly data (only 4 times a year) with a datekey column, and one called DAILY which gives data every day. This is financial data so there are ticker columns.
I need to grab the quarterly data for SF1 and write it to the DAILY csv file for all the days that are in between when we get the next quarterly data.
For example, AAPL has quarterly data released in SF1 on 2010-01-01 and its next earnings report is going to be on 2010-03-04. I then need every row in the DAILY file with ticker AAPL between the dates 2010-01-01 until 2010-03-04 to have the same information as that one row on that date in the SF1 file.
So far, I have made a python dictionary that goes through the SF1 file and adds the dates to a list which is the value of the ticker keys in the dictionary. I thought about potentially getting rid of the previous string and just referencing the string that is in the dictionary to go and search for the data to write to the DAILY file.
Some of the columns needed to transfer from the SF1 file to the DAILY file are:
['accoci', 'assets', 'assetsavg', 'assetsc', 'assetsnc', 'assetturnover', 'bvps', 'capex', 'cashneq', 'cashnequsd', 'cor', 'consolinc', 'currentratio', 'de', 'debt', 'debtc', 'debtnc', 'debtusd', 'deferredrev', 'depamor', 'deposits', 'divyield', 'dps', 'ebit']
Code so far:
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date.setdefault(sf1_ticker, []).append(sf1_date)
What would be the best way to solve this problem?
SF1 csv:
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,2020-09-14,2020-09-14,2020-09-14,2020-09-14,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
Daily csv:
ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps
A,2020-09-14,2020-09-14,31617.1,36.3,26.8,30652.1,6.2,44.4,5.9
Ideal csv after code run (with all the numbers for the assets under them):
ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
The solution is merge_asof it allows to merge date columns to the closer immediately after or before in the second dataframe.
As is it not explicit, I will assume here that daily.date and sf1.datekey are both true date columns, meaning that their dtype is datetime64[ns]. merge_asof cannot use string columns with an object dtype.
I will also assume that you do not want the ev evebit evebitda marketcap pb pe and ps columns from the sf1 dataframes because their names conflict with columns from daily (more on that later):
Code could be:
df = pd.merge_asof(daily, sf1.drop(columns=['dimension', 'calendardate',
'reportperiod','lastupdated',
'ev', 'evebit', 'evebitda',
'marketcap', 'pb', 'pe', 'ps']),
by = 'ticker', left_on='date',
right_on='datekey')
You get the following list of columns: ticker, date, lastupdated, ev, evebit, evebitda, marketcap, pb, pe, ps, datekey, accoci, assets, assetsavg, assetsc, assetsnc, assetturnover, bvps, capex, cashneq, cashnequsd, cor, consolinc, currentratio, de, debt, debtc, debtnc, debtusd, deferredrev, depamor, deposits, divyield, dps, ebit, ebitda, ebitdamargin, ebitdausd, ebitusd, ebt, eps, epsdil, epsusd, equity, equityavg, equityusd, fcf, fcfps, fxusd, gp, grossmargin, intangibles, intexp, invcap, invcapavg, inventory, investments, investmentsc, investmentsnc, liabilities, liabilitiesc, liabilitiesnc, ncf, ncfbus, ncfcommon, ncfdebt, ncfdiv, ncff, ncfi, ncfinv, ncfo, ncfx, netinc, netinccmn, netinccmnusd, netincdis, netincnci, netmargin, opex, opinc, payables, payoutratio, pe1, ppnenet, prefdivis, price, ps1, receivables, retearn, revenue, revenueusd, rnd, roa, roe, roic, ros, sbcomp, sgna, sharefactor, sharesbas, shareswa, shareswadil, sps, tangibles, taxassets, taxexp, taxliabilities, tbvps, workingcapital with their relevant values
If you want to keep the columns existing in both dataframe, you will have to rename them. Here is a possible code adding _d to the names of column from daily:
df2 = pd.merge_asof(daily, sf1.drop(columns=['dimension', 'calendardate',
'reportperiod','lastupdated']),
by = 'ticker', left_on='date',
right_on='datekey', suffixes=('_d', ''))
The list of columns is now: ticker, date, lastupdated, ev_d, evebit_d, evebitda_d, marketcap_d, pb_d, pe_d, ps_d, datekey, accoci, ...

Make a function to return length based on a condition

I have 2 DataFrames - 1 contains Stock Tickers and a maximum/minimum price range along with other columns.
The other DataFrame has dates as indexes and is grouped by tickers with various metrics like open,close,high low, etc. Now, I want to take a count of days from this DataFrame when for a given stock, close price was higher than minimum price.
I am stuck here:Now I want to find for example how many days AMZN was trading below the period max price.
I want to take count of days from the second dataframe based on values from the first dataframe,count of days where closing price was lesser/greater than Max/Min period price.
I have added the code to reproduce the DataFrames.
Please check the screen shots.
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
import yfinance as yf
start=datetime.datetime.today()-relativedelta(years=2)
end=datetime.datetime.today()
us_stock_list='FB AMZN BABA'
data_metric = yf.download(us_stock_list, start=start, end=end,group_by='column',auto_adjust=True)
data_ticker= yf.download(us_stock_list, start=start, end=end,group_by='ticker',auto_adjust=True)
stock_list=[stock for stock in data_ticker.stack()]
# max_price
max_values=pd.DataFrame(data_ticker.max().unstack()['High'])
# min_price
min_values=pd.DataFrame(data_ticker.min().unstack()['Low'])
# latest_price
latest_day=pd.DataFrame(data_ticker.tail(1).unstack())
latest_day=latest_day.unstack().unstack().unstack().reset_index()
# latest_day=latest_day.unstack().reset_index()
latest_day=latest_day.drop(columns=['level_0','Date'])
latest_day.set_index('level_3',inplace=True)
latest_day.rename(columns={0:'Values'},inplace=True)
latest_day=latest_day.groupby(by=['level_3','level_2']).max().unstack()
latest_day.columns=[ '_'.join(x) for x in latest_day.columns ]
latest_day=latest_day.join(max_values,how='inner')
latest_day=latest_day.join(min_values,how='inner')
latest_day.rename(columns={'High':'Period_High_Max','Low':'Period_Low_Min'},inplace=True)
close_price_data=pd.DataFrame(data_metric['Close'].unstack().reset_index())
close_price_data= close_price_data.rename(columns={'level_0':'Stock',0:'Close_price'})
close_price_data.set_index('Stock',inplace=True)
Use this to reproduce:
{"Values_Close":{"AMZN":2286.0400390625,"BABA":194.4799957275,"FB":202.2700042725},"Values_High":{"AMZN":2362.4399414062,"BABA":197.3800048828,"FB":207.2799987793},"Values_Low":{"AMZN":2258.1899414062,"BABA":192.8600006104,"FB":199.0500030518},"Values_Open":{"AMZN":2336.8000488281,"BABA":195.75,"FB":201.6000061035},"Values_Volume":{"AMZN":9754900.0,"BABA":22268800.0,"FB":30399600.0},"Period_High_Max":{"AMZN":2475.0,"BABA":231.1399993896,"FB":224.1999969482},"Period_Low_Min":{"AMZN":1307.0,"BABA":129.7700042725,"FB":123.0199966431},"%_Position":{"AMZN":0.8382192115,"BABA":0.6383544892,"FB":0.7832576338}}
{"Stock":{
"0":"AMZN",
"1":"AMZN",
"2":"AMZN",
"3":"AMZN",
"4":"AMZN",
"5":"AMZN",
"6":"AMZN",
"7":"AMZN",
"8":"AMZN",
"9":"AMZN",
"10":"AMZN",
"11":"AMZN",
"12":"AMZN",
"13":"AMZN",
"14":"AMZN",
"15":"AMZN",
"16":"AMZN",
"17":"AMZN",
"18":"AMZN",
"19":"AMZN"},
"Date":{
"0":1525305600000,
"1":1525392000000,
"2":1525651200000,
"3":1525737600000,
"4":1525824000000,
"5":1525910400000,
"6":1525996800000,
"7":1526256000000,
"8":1526342400000,
"9":1526428800000,
"10":1526515200000,
"11":1526601600000,
"12":1526860800000,
"13":1526947200000,
"14":1527033600000,
"15":1527120000000,
"16":1527206400000,
"17":1527552000000,
"18":1527638400000,
"19":1527724800000 },
"Close_price":{
"0":1572.0799560547,
"1":1580.9499511719,
"2":1600.1400146484,
"3":1592.3900146484,
"4":1608.0,
"5":1609.0799560547,
"6":1602.9100341797,
"7":1601.5400390625,
"8":1576.1199951172,
"9":1587.2800292969,
"10":1581.7600097656,
"11":1574.3699951172,
"12":1585.4599609375,
"13":1581.4000244141,
"14":1601.8599853516,
"15":1603.0699462891,
"16":1610.1500244141,
"17":1612.8699951172,
"18":1624.8900146484,
"19":1629.6199951172}}
Do a merge between both dataframes, groupby company (index level=0) and apply a custom function:
df_merge = close_price_data.merge(
latest_day[['Period_High_Max', 'Period_Low_Min']],
left_index=True,
right_index=True)
def fun(df):
d = {}
d['days_above_min'] = (df.Close_price > df.Period_Low_Min).sum()
d['days_below_max'] = (df.Close_price < df.Period_High_Max).sum()
return pd.Series(d)
df_merge.groupby(level=0).apply(fun)
Period_Low_Min and Period_High_Max are the min and max respectively so all closing prices will be in that range, if this is not what you are trying to accomplish let me know.

transform raw date format into pandas date object

I have a CSV file which looks like this:
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:31:12,24736
[30/Apr/1998:21:31:19,3781
[30/Apr/1998:21:31:22,-
[30/Apr/1998:21:31:27,24736
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:32,929
[30/Apr/1998:21:31:43,-
[30/Apr/1998:21:31:44,1139
[30/Apr/1998:21:31:52,24736
[30/Apr/1998:21:31:52,3029
[30/Apr/1998:21:32:06,24736
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:17,-
[30/Apr/1998:21:32:30,14521
[30/Apr/1998:21:32:33,11324
[30/Apr/1998:21:32:35,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:38,1647
[30/Apr/1998:21:32:38,1271
[30/Apr/1998:21:32:52,5933
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231
upto one billion,
forget about numbers column, I have a concern to convert this time-date format in my CSV file to pandas time stamp, so I can plot my dataset and visualize it according to time, as I am new in datascience,here is my approach:
step 1: take all the time colum from my CSV file into an array,
step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time,
step 3: remove "[" from date array,
step 4: replace all forward slash into dashes in the date array,
step 5: and then append date and time array to make a single pandas format,
which will be looks like this, 2017-03-22 15:16:45 as you known that I am new and my approach is naive and also wrong, if someone can help me with providing me code snippet, I will be really happy, thanks
You can pass a format to pd.to_datetime(), in this case: [%d/%b/%Y:%H:%M:%S.
Be careful with erroneous data though as seen in row 3 in sample data below ([30/Apr/1998:21:32:3l8,671). To not get an error you can pass errors=coerce, will return Not a Time (NaT).
The other way would be to replace those rows manually or write some sort of regex/replace funtion first.
import pandas as pd
data = '''\
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep=',', na_values=['-'])
df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce')
print(df)
Returns:
time Numbers
0 1998-04-30 21:30:17 24736.0
1 1998-04-30 21:30:53 24736.0
2 NaT 671.0
3 1998-04-30 21:32:38 1512.0
4 1998-04-30 21:32:38 1136.0
5 1998-04-30 21:32:58 NaN
6 1998-04-30 21:32:59 231.0
Note that: na_values=['-'] was used here to help pandas understand the Numbers column is actually numbers and not strings.
And now we can perform actions like grouping (on minute for instance):
print(df.groupby(df.time.dt.minute)['Numbers'].mean())
#time
#30.0 24736.000000
#32.0 959.666667

Difficult adding up elements in a pandas DataFrame

I'm currently having trouble adding up the rows for the following DataFrame which I have constructed for the returns for six companies' stocks:
def importdata(data):
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
return returns_with_dates
which outputs:
Out[345]:
Company 1 Company 2 Company 3 Company 4 Company 5 Company 6
Dates
1997-01-02 31.087620 3.094705 24.058686 31.694404 37.162890 13.462241
1997-01-03 31.896592 3.109631 22.423629 32.064378 37.537013 13.511706
1997-01-06 31.723241 3.184358 18.803148 32.681000 37.038183 13.684925
1997-01-07 31.781024 3.199380 19.503886 33.544272 37.038183 13.660193
1997-01-08 31.607673 3.169431 19.387096 32.927650 37.537013 13.585995
1997-01-09 31.492106 3.199380 19.737465 33.420948 37.038183 13.759214
1997-01-10 32.589996 3.184358 19.270307 34.284219 37.661721 13.858235
1997-01-13 32.416645 3.199380 19.153517 35.147491 38.035844 13.660193
1997-01-14 32.301077 3.184358 19.503886 35.517465 39.407629 13.783946
1997-01-15 32.127726 3.199380 19.387096 35.887438 38.409967 13.759214
1997-01-16 32.532212 3.229232 19.737465 36.257412 39.282921 13.635460
1997-01-17 33.167833 3.259180 20.087835 37.490657 39.033505 13.858235
1997-01-20 33.456751 3.229232 20.438204 35.640789 39.657044 14.377892
1997-01-21 33.225616 3.244158 20.671783 36.010763 40.779413 14.179940
1997-01-22 33.110049 3.289033 21.489312 36.010763 40.654705 14.254138
1997-01-23 32.705563 3.199380 20.905363 35.394140 40.904121 14.229405
1997-01-24 32.127726 3.139579 20.204624 35.764114 40.405290 13.957165
1997-01-27 32.127726 3.094705 20.204624 35.270816 40.779413 13.882968
1997-01-28 31.781024 3.079778 20.788573 34.407544 41.153536 13.684925
1997-01-29 32.185510 3.094705 21.138942 34.654193 41.278244 13.858235
1997-01-30 32.647779 3.094705 21.022153 34.407544 41.652367 13.981898
1997-01-31 32.532212 3.064757 20.204624 34.037570 42.275905 13.858235
For countless hours I have tried summing them up in such a way that I add up the rows from 1997-01-02 to 1997-01-08, 1997-01-09 to 1997-01-15 etc., thus adding up the first five rows, and then the following five rows. Furthermore, I seek to keep the date as an index for the 5th element, so in the case of adding up the elements from 1997-01-02 to 1997-01-08 I seek to keep 1997-01-08 as the index corresponding to the summed up element. It is worth mentioning that I have been using the five row addition as an example, but ideally I seek to add up every n rows, and then the following n rows, whilst maintaining the date in the same way said previously. I have figured out a way - shown in the code below - of doing it in array form, but I don't get to keep the dates in this situation.
returns=pd.read_excel(data) # Imports the data from Excel
returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index
returns_mat=returns_with_dates.as_matrix()
ndays=int(len(returns_mat)/n) # Number of "ndays" in our time-period
nday_returns=np.empty((ndays,min(np.shape(returns_mat)))) # Creates an empty array to fill
# and accommodate the n-day log-returns
for i in range(1,asset_number+1):
for j in range(1,ndays+1):
nday_returns[j-1,i-1]=np.sum(returns_mat[(n*j)-n:n*j,i-1])
return nday_returns
Is there any way of doing this but in a DataFrame context whilst maintaining the dates in the way I said before? I've been trying to do this for sooo long without any kind of success and it's really stressing me out! For some reason everyone finds Pandas extremely useful and easy to use, but I happen to find it the opposite. Any kind of help would be very much appreciated. Thanks in advance.
groupby
df.groupby(np.arange(len(df)) // 5).sum()
To include the date index as requested
g = np.arange(len(df)) // 5
i = df.index.to_series().groupby(g).last()
df.groupby(g).sum().set_index(i)
If you have the same number of missing dates you can resample it by the number of days you desire. Using resample keeps the dates in the index. You can also use the loffset parameter to shift the dates.
df.resample('7D', loffset='6D').sum()
Company 1 Company 2 Company 3 Company 4 Company 5 \
Dates
1997-01-08 158.096150 15.757505 104.176445 162.911704 186.313282
1997-01-15 160.927550 15.966856 97.052271 174.257561 190.553344
1997-01-22 165.492461 16.250835 102.424599 181.410384 199.407588
1997-01-29 160.927549 15.608147 103.242126 175.490807 204.520604
1997-02-05 65.179991 6.159462 41.226777 68.445114 83.928272
Company 6
Dates
1997-01-08 67.905060
1997-01-15 68.820802
1997-01-22 70.305665
1997-01-29 69.612698
1997-02-05 27.840133

Categories

Resources