Convert quandl fx hourly data to daily - python
I would like to convert hourly financial data imported into a pandas dataframe that has the following csv header to daily data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
I've imported the data with pandas.read_csv(). I have eliminated all but one symbol from the data for testing purposes, and have figured out this part so far:
df.groupby('date').agg({'highask': [max], 'lowask': [min]})
I'm still pretty new to python, so I'm not really sure how to continue. I'm guessing I can use some kind of anonymous function to create additional fields. For example, I'd like to get the open ask price for each date at hour 0, and the close ask price for each data at hour 23. Ideally, I would add additional columns and create a new dataframe. I want add a new column for market price, that is a an average of the ask/bid for low, high, open, and close.
Any advice would be greatly appreciated. Thanks!
edit
As requested, here is the output I would expect for just 2018-07-24:
symbol,date,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,0.7422,0.74297,0.7429,0.74196,0.74257,0.743,0.74197,0.74258,5191
openbid is the openbid at the lowest hour column for a single date, closebid is the closebid at the highest hour for a single date, etc. Total ticks is the sum. What I am really struggling with is determining openbid, openask, closebid, and closeask.
Sample data:
symbol,date,hour,openbid,highbid,lowbid,closebid,openask,highask,lowask,closeask,totalticks
AUD/USD,2018-07-24,22,0.7422,0.74249,0.74196,0.7423,0.74225,0.74252,0.74197,0.74234,1470
AUD/USD,2018-07-24,23,0.7423,0.74297,0.7423,0.74257,0.74234,0.743,0.74234,0.74258,3721
AUD/USD,2018-07-25,0,0.74257,0.74334,0.74237,0.74288,0.74258,0.74335,0.74239,0.74291,7443
AUD/USD,2018-07-25,1,0.74288,0.74492,0.74105,0.74111,0.74291,0.74501,0.74107,0.74111,14691
AUD/USD,2018-07-25,2,0.74111,0.74127,0.74015,0.74073,0.74111,0.74129,0.74018,0.74076,6898
AUD/USD,2018-07-25,3,0.74073,0.74076,0.73921,0.73987,0.74076,0.74077,0.73923,0.73989,6207
AUD/USD,2018-07-25,4,0.73987,0.74002,0.73921,0.73953,0.73989,0.74003,0.73923,0.73956,3453
AUD/USD,2018-07-25,5,0.73953,0.74094,0.73946,0.74041,0.73956,0.74096,0.73947,0.74042,7187
AUD/USD,2018-07-25,6,0.74041,0.74071,0.73921,0.74056,0.74042,0.74069,0.73922,0.74059,10646
AUD/USD,2018-07-25,7,0.74056,0.74066,0.73973,0.74035,0.74059,0.74068,0.73974,0.74037,9285
AUD/USD,2018-07-25,8,0.74035,0.74206,0.73996,0.74198,0.74037,0.74207,0.73998,0.742,10234
AUD/USD,2018-07-25,9,0.74198,0.74274,0.74176,0.74225,0.742,0.74275,0.74179,0.74227,8224
AUD/USD,2018-07-25,10,0.74225,0.74237,0.74122,0.74142,0.74227,0.74237,0.74124,0.74143,7143
AUD/USD,2018-07-25,11,0.74142,0.74176,0.74093,0.74152,0.74143,0.74176,0.74095,0.74152,7307
AUD/USD,2018-07-25,12,0.74152,0.74229,0.74078,0.74219,0.74152,0.74229,0.74079,0.74222,10523
AUD/USD,2018-07-25,13,0.74219,0.74329,0.74138,0.74141,0.74222,0.74332,0.74136,0.74145,13983
AUD/USD,2018-07-25,14,0.74141,0.74217,0.74032,0.74065,0.74145,0.7422,0.74034,0.74067,21814
AUD/USD,2018-07-25,15,0.74065,0.74151,0.73989,0.74113,0.74067,0.74152,0.73988,0.74115,16085
AUD/USD,2018-07-25,16,0.74113,0.74144,0.74056,0.7411,0.74115,0.74146,0.74058,0.74111,7752
AUD/USD,2018-07-25,17,0.7411,0.7435,0.74092,0.74346,0.74111,0.74353,0.74094,0.74348,11348
AUD/USD,2018-07-25,18,0.74346,0.74445,0.74331,0.74373,0.74348,0.74446,0.74333,0.74373,9898
AUD/USD,2018-07-25,19,0.74373,0.74643,0.74355,0.74559,0.74373,0.74643,0.74358,0.7456,11756
AUD/USD,2018-07-25,20,0.74559,0.74596,0.74478,0.74549,0.7456,0.746,0.74481,0.74562,5607
AUD/USD,2018-07-25,21,0.74549,0.74562,0.74417,0.74438,0.74562,0.74576,0.74422,0.74442,3613
AUD/USD,2018-07-26,22,0.73762,0.73792,0.73762,0.73774,0.73772,0.73798,0.73768,0.73779,1394
AUD/USD,2018-07-26,23,0.73774,0.73813,0.73744,0.73807,0.73779,0.73816,0.73746,0.73808,3465
AUD/USD,2018-07-27,0,0.73807,0.73826,0.73733,0.73763,0.73808,0.73828,0.73735,0.73764,6582
AUD/USD,2018-07-27,1,0.73763,0.73854,0.73734,0.73789,0.73764,0.73857,0.73736,0.73788,7373
AUD/USD,2018-07-27,2,0.73789,0.73881,0.73776,0.73881,0.73788,0.73883,0.73778,0.73882,3414
AUD/USD,2018-07-27,3,0.73881,0.7393,0.73849,0.73875,0.73882,0.73932,0.73851,0.73877,4639
AUD/USD,2018-07-27,4,0.73875,0.739,0.73852,0.73858,0.73877,0.73901,0.73852,0.73859,2487
AUD/USD,2018-07-27,5,0.73858,0.73896,0.7381,0.73887,0.73859,0.73896,0.73812,0.73888,5332
AUD/USD,2018-07-27,6,0.73887,0.73902,0.73792,0.73879,0.73888,0.73902,0.73793,0.73881,7623
AUD/USD,2018-07-27,7,0.73879,0.7395,0.73844,0.73885,0.73881,0.7395,0.73846,0.73887,9577
AUD/USD,2018-07-27,8,0.73885,0.73897,0.73701,0.73727,0.73887,0.73899,0.73702,0.73729,12280
AUD/USD,2018-07-27,9,0.73727,0.73784,0.737,0.73721,0.73729,0.73786,0.73701,0.73723,8634
AUD/USD,2018-07-27,10,0.73721,0.73798,0.73717,0.73777,0.73723,0.73798,0.73718,0.73779,7510
AUD/USD,2018-07-27,11,0.73777,0.73789,0.73728,0.73746,0.73779,0.73789,0.7373,0.73745,4947
AUD/USD,2018-07-27,12,0.73746,0.73927,0.73728,0.73888,0.73745,0.73929,0.73729,0.73891,16853
AUD/USD,2018-07-27,13,0.73888,0.74083,0.73853,0.74066,0.73891,0.74083,0.73855,0.74075,14412
AUD/USD,2018-07-27,14,0.74066,0.74147,0.74025,0.74062,0.74075,0.74148,0.74026,0.74064,15187
AUD/USD,2018-07-27,15,0.74062,0.74112,0.74002,0.74084,0.74064,0.74114,0.74003,0.74086,10044
AUD/USD,2018-07-27,16,0.74084,0.74091,0.73999,0.74001,0.74086,0.74092,0.74,0.74003,6893
AUD/USD,2018-07-27,17,0.74001,0.74022,0.73951,0.74008,0.74003,0.74025,0.73952,0.74009,5865
AUD/USD,2018-07-27,18,0.74008,0.74061,0.74002,0.74046,0.74009,0.74062,0.74004,0.74047,4334
AUD/USD,2018-07-27,19,0.74046,0.74072,0.74039,0.74041,0.74047,0.74073,0.74041,0.74043,3654
AUD/USD,2018-07-27,20,0.74041,0.74066,0.74005,0.74011,0.74043,0.74068,0.74018,0.74023,1547
AUD/USD,2018-07-25,22,0.74438,0.74526,0.74436,0.74489,0.74442,0.7453,0.74439,0.74494,2220
AUD/USD,2018-07-25,23,0.74489,0.74612,0.74489,0.7459,0.74494,0.74612,0.74492,0.74592,4886
AUD/USD,2018-07-26,0,0.7459,0.74625,0.74536,0.74571,0.74592,0.74623,0.74536,0.74573,6602
AUD/USD,2018-07-26,1,0.74571,0.74633,0.74472,0.74479,0.74573,0.74634,0.74471,0.74481,10123
AUD/USD,2018-07-26,2,0.74479,0.74485,0.74375,0.74434,0.74481,0.74487,0.74378,0.74437,7844
AUD/USD,2018-07-26,3,0.74434,0.74459,0.74324,0.744,0.74437,0.74461,0.74328,0.744,6037
AUD/USD,2018-07-26,4,0.744,0.74428,0.74378,0.74411,0.744,0.7443,0.74379,0.74414,3757
AUD/USD,2018-07-26,5,0.74411,0.74412,0.74346,0.74349,0.74414,0.74414,0.74344,0.74349,5713
AUD/USD,2018-07-26,6,0.74349,0.74462,0.74291,0.74299,0.74349,0.74464,0.74293,0.743,12650
AUD/USD,2018-07-26,7,0.74299,0.74363,0.74267,0.74361,0.743,0.74363,0.74269,0.74362,8067
AUD/USD,2018-07-26,8,0.74361,0.74375,0.74279,0.74287,0.74362,0.74376,0.7428,0.74288,6988
AUD/USD,2018-07-26,9,0.74287,0.74322,0.74212,0.74318,0.74288,0.74323,0.74212,0.74319,7784
AUD/USD,2018-07-26,10,0.74318,0.74329,0.74249,0.74276,0.74319,0.74331,0.7425,0.74276,5271
AUD/USD,2018-07-26,11,0.74276,0.74301,0.74179,0.74201,0.74276,0.74303,0.7418,0.74199,7434
AUD/USD,2018-07-26,12,0.74201,0.74239,0.74061,0.74064,0.74199,0.74241,0.74063,0.74066,20513
AUD/USD,2018-07-26,13,0.74064,0.74124,0.73942,0.74008,0.74066,0.74124,0.73943,0.74005,19715
AUD/USD,2018-07-26,14,0.74008,0.74014,0.73762,0.73887,0.74005,0.74013,0.73764,0.73889,21137
AUD/USD,2018-07-26,15,0.73887,0.73936,0.73823,0.73831,0.73889,0.73936,0.73824,0.73833,11186
AUD/USD,2018-07-26,16,0.73831,0.73915,0.73816,0.73908,0.73833,0.73916,0.73817,0.73908,6016
AUD/USD,2018-07-26,17,0.73908,0.73914,0.73821,0.73884,0.73908,0.73917,0.73823,0.73887,6197
AUD/USD,2018-07-26,18,0.73884,0.73885,0.73737,0.73773,0.73887,0.73887,0.73737,0.73775,6127
AUD/USD,2018-07-26,19,0.73773,0.73794,0.73721,0.73748,0.73775,0.73797,0.73724,0.73751,3614
AUD/USD,2018-07-26,20,0.73748,0.73787,0.73746,0.73767,0.73751,0.7379,0.73748,0.73773,1801
AUD/USD,2018-07-26,21,0.73767,0.73807,0.73755,0.73762,0.73773,0.73836,0.73769,0.73772,1687
To assign a new column avg_market_price as the average:
df = df.assign(avg_market_price=df[['openbid', 'highbid', 'lowbid', 'closebid',
'openask', 'highask', 'lowask', 'closeask']].mean(axis=1))
You then want to set the index to a datetime index by combining the date and time fields, then resample your data to daily time periods (1d). Finally, use apply to get the max, min and averge values on specific columns.
import numpy as np
>>> (df
.set_index(df['date'] + pd.to_timedelta(df['hour'], unit='h'))
.resample('1d')
.apply({'highask': 'max', 'lowask': 'min', 'avg_market_price': np.mean}))
highask lowask avg_market_price
2018-07-24 0.74300 0.74197 0.742402
2018-07-25 0.74643 0.73922 0.742142
2018-07-26 0.74634 0.73724 0.741239
2018-07-27 0.74148 0.73701 0.739011
Related
Upsample timeseries with weather data in a correct way
I have a dataset that holds weather data for each month from 1st day to 20th of month and for each hour of the day throw a year and the last 10 days(with it's hours) of each month are removed. The weather data are : (temperature - humidity - wind_speed - visibility - dew_temperature - solar_radiation - rainfall -snowfall) I want to upsample the dataset as time series to fill the missing data of the days but i face many issue due too the changes of climate. Here it what is tried so far def get_hour_month_mean(data,date,hour,max_id): return { 'ID':max_id, 'temperature':data['temperature'].mean(), 'humidity':data['humidity'].mean(), 'date':date, 'hour':hour, 'wind_speed':data['wind_speed'].mean(), 'visibility':data['visibility'].mean(), 'dew_temperature':data['dew_temperature'].mean(), 'solar_radiation':data['solar_radiation'].mean(), 'rainfall':data['rainfall'].mean(), 'count':data['count'].mean() if str(date.date()) not in seoul_not_func else 0, 'snowfall':data['snowfall'].mean(), 'season':data['season'].mode()[0], 'is_holiday':'No Holiday' if str(date.date()) not in seoul_p_holidays_17_18 else 'Holiday' , 'functional_day':'Yes' if str(date.date()) not in seoul_not_func else 'No' , } def upsample_data_with_missing_dates(data): data_range = pd.date_range( start="2017-12-20", end="2018-11-30", freq='D') missing_range=data_range.difference(df['date']) hour_range=range(0,24) max_id=data['ID'].max() data_copy=data.copy() for date in missing_range: for hour in hour_range: max_id+=1 year=data_copy.year month=date.month if date.month==11: year-=1 month=12 else: month+=1 month_mask=((data_copy['year'] == year) & (data_copy['month'] == month) & (data_copy['hour'] == hour) &(data_copy['day'].isin([1,2]))) data_filter=data_copy[month_mask] dict_row=get_hour_month_mean(data_filter,date,hour,max_id) data = data.append(dict_row, ignore_index=True) return data any ideas what is the best way to get the values of the missing days if i have the previous 20 days and the next 20 days ?
There is a lot of manners to deal with missing timeseries values in fact. You already tried the traditional way, imputing data with mean values. But the drawback of this method is the bias caused by so many values on the data. You can try a genetic algorithm (GA), Support Vector Machine(SVR), autoregressive(AR) and moving average(MA) for time series imputation and modeling. To overcome the bias problem caused by the tradional method (mean), these methods are used to forecast or/and impute time series. (Consider that you have a multivariate timeseries) Here are some ressources you can use : A Survey on Deep Learning Approaches time.series.missing-values-in-time-series-in-python Interpolation in Python to fill Missing Values
How do I transfer values of a CSV files between certain dates to another CSV file based on the dates in the rows in that file?
Long question: I have two CSV files, one called SF1 which has quarterly data (only 4 times a year) with a datekey column, and one called DAILY which gives data every day. This is financial data so there are ticker columns. I need to grab the quarterly data for SF1 and write it to the DAILY csv file for all the days that are in between when we get the next quarterly data. For example, AAPL has quarterly data released in SF1 on 2010-01-01 and its next earnings report is going to be on 2010-03-04. I then need every row in the DAILY file with ticker AAPL between the dates 2010-01-01 until 2010-03-04 to have the same information as that one row on that date in the SF1 file. So far, I have made a python dictionary that goes through the SF1 file and adds the dates to a list which is the value of the ticker keys in the dictionary. I thought about potentially getting rid of the previous string and just referencing the string that is in the dictionary to go and search for the data to write to the DAILY file. Some of the columns needed to transfer from the SF1 file to the DAILY file are: ['accoci', 'assets', 'assetsavg', 'assetsc', 'assetsnc', 'assetturnover', 'bvps', 'capex', 'cashneq', 'cashnequsd', 'cor', 'consolinc', 'currentratio', 'de', 'debt', 'debtc', 'debtnc', 'debtusd', 'deferredrev', 'depamor', 'deposits', 'divyield', 'dps', 'ebit'] Code so far: for ind, row in sf1.iterrows(): sf1_date = row['datekey'] sf1_ticker = row['ticker'] company_date.setdefault(sf1_ticker, []).append(sf1_date) What would be the best way to solve this problem? SF1 csv: ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital A,ARQ,2020-09-14,2020-09-14,2020-09-14,2020-09-14,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000 Daily csv: ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps A,2020-09-14,2020-09-14,31617.1,36.3,26.8,30652.1,6.2,44.4,5.9 Ideal csv after code run (with all the numbers for the assets under them): ticker,date,lastupdated,ev,evebit,evebitda,marketcap,pb,pe,ps,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
The solution is merge_asof it allows to merge date columns to the closer immediately after or before in the second dataframe. As is it not explicit, I will assume here that daily.date and sf1.datekey are both true date columns, meaning that their dtype is datetime64[ns]. merge_asof cannot use string columns with an object dtype. I will also assume that you do not want the ev evebit evebitda marketcap pb pe and ps columns from the sf1 dataframes because their names conflict with columns from daily (more on that later): Code could be: df = pd.merge_asof(daily, sf1.drop(columns=['dimension', 'calendardate', 'reportperiod','lastupdated', 'ev', 'evebit', 'evebitda', 'marketcap', 'pb', 'pe', 'ps']), by = 'ticker', left_on='date', right_on='datekey') You get the following list of columns: ticker, date, lastupdated, ev, evebit, evebitda, marketcap, pb, pe, ps, datekey, accoci, assets, assetsavg, assetsc, assetsnc, assetturnover, bvps, capex, cashneq, cashnequsd, cor, consolinc, currentratio, de, debt, debtc, debtnc, debtusd, deferredrev, depamor, deposits, divyield, dps, ebit, ebitda, ebitdamargin, ebitdausd, ebitusd, ebt, eps, epsdil, epsusd, equity, equityavg, equityusd, fcf, fcfps, fxusd, gp, grossmargin, intangibles, intexp, invcap, invcapavg, inventory, investments, investmentsc, investmentsnc, liabilities, liabilitiesc, liabilitiesnc, ncf, ncfbus, ncfcommon, ncfdebt, ncfdiv, ncff, ncfi, ncfinv, ncfo, ncfx, netinc, netinccmn, netinccmnusd, netincdis, netincnci, netmargin, opex, opinc, payables, payoutratio, pe1, ppnenet, prefdivis, price, ps1, receivables, retearn, revenue, revenueusd, rnd, roa, roe, roic, ros, sbcomp, sgna, sharefactor, sharesbas, shareswa, shareswadil, sps, tangibles, taxassets, taxexp, taxliabilities, tbvps, workingcapital with their relevant values If you want to keep the columns existing in both dataframe, you will have to rename them. Here is a possible code adding _d to the names of column from daily: df2 = pd.merge_asof(daily, sf1.drop(columns=['dimension', 'calendardate', 'reportperiod','lastupdated']), by = 'ticker', left_on='date', right_on='datekey', suffixes=('_d', '')) The list of columns is now: ticker, date, lastupdated, ev_d, evebit_d, evebitda_d, marketcap_d, pb_d, pe_d, ps_d, datekey, accoci, ...
Make a function to return length based on a condition
I have 2 DataFrames - 1 contains Stock Tickers and a maximum/minimum price range along with other columns. The other DataFrame has dates as indexes and is grouped by tickers with various metrics like open,close,high low, etc. Now, I want to take a count of days from this DataFrame when for a given stock, close price was higher than minimum price. I am stuck here:Now I want to find for example how many days AMZN was trading below the period max price. I want to take count of days from the second dataframe based on values from the first dataframe,count of days where closing price was lesser/greater than Max/Min period price. I have added the code to reproduce the DataFrames. Please check the screen shots. import pandas as pd import datetime from dateutil.relativedelta import relativedelta import yfinance as yf start=datetime.datetime.today()-relativedelta(years=2) end=datetime.datetime.today() us_stock_list='FB AMZN BABA' data_metric = yf.download(us_stock_list, start=start, end=end,group_by='column',auto_adjust=True) data_ticker= yf.download(us_stock_list, start=start, end=end,group_by='ticker',auto_adjust=True) stock_list=[stock for stock in data_ticker.stack()] # max_price max_values=pd.DataFrame(data_ticker.max().unstack()['High']) # min_price min_values=pd.DataFrame(data_ticker.min().unstack()['Low']) # latest_price latest_day=pd.DataFrame(data_ticker.tail(1).unstack()) latest_day=latest_day.unstack().unstack().unstack().reset_index() # latest_day=latest_day.unstack().reset_index() latest_day=latest_day.drop(columns=['level_0','Date']) latest_day.set_index('level_3',inplace=True) latest_day.rename(columns={0:'Values'},inplace=True) latest_day=latest_day.groupby(by=['level_3','level_2']).max().unstack() latest_day.columns=[ '_'.join(x) for x in latest_day.columns ] latest_day=latest_day.join(max_values,how='inner') latest_day=latest_day.join(min_values,how='inner') latest_day.rename(columns={'High':'Period_High_Max','Low':'Period_Low_Min'},inplace=True) close_price_data=pd.DataFrame(data_metric['Close'].unstack().reset_index()) close_price_data= close_price_data.rename(columns={'level_0':'Stock',0:'Close_price'}) close_price_data.set_index('Stock',inplace=True) Use this to reproduce: {"Values_Close":{"AMZN":2286.0400390625,"BABA":194.4799957275,"FB":202.2700042725},"Values_High":{"AMZN":2362.4399414062,"BABA":197.3800048828,"FB":207.2799987793},"Values_Low":{"AMZN":2258.1899414062,"BABA":192.8600006104,"FB":199.0500030518},"Values_Open":{"AMZN":2336.8000488281,"BABA":195.75,"FB":201.6000061035},"Values_Volume":{"AMZN":9754900.0,"BABA":22268800.0,"FB":30399600.0},"Period_High_Max":{"AMZN":2475.0,"BABA":231.1399993896,"FB":224.1999969482},"Period_Low_Min":{"AMZN":1307.0,"BABA":129.7700042725,"FB":123.0199966431},"%_Position":{"AMZN":0.8382192115,"BABA":0.6383544892,"FB":0.7832576338}} {"Stock":{ "0":"AMZN", "1":"AMZN", "2":"AMZN", "3":"AMZN", "4":"AMZN", "5":"AMZN", "6":"AMZN", "7":"AMZN", "8":"AMZN", "9":"AMZN", "10":"AMZN", "11":"AMZN", "12":"AMZN", "13":"AMZN", "14":"AMZN", "15":"AMZN", "16":"AMZN", "17":"AMZN", "18":"AMZN", "19":"AMZN"}, "Date":{ "0":1525305600000, "1":1525392000000, "2":1525651200000, "3":1525737600000, "4":1525824000000, "5":1525910400000, "6":1525996800000, "7":1526256000000, "8":1526342400000, "9":1526428800000, "10":1526515200000, "11":1526601600000, "12":1526860800000, "13":1526947200000, "14":1527033600000, "15":1527120000000, "16":1527206400000, "17":1527552000000, "18":1527638400000, "19":1527724800000 }, "Close_price":{ "0":1572.0799560547, "1":1580.9499511719, "2":1600.1400146484, "3":1592.3900146484, "4":1608.0, "5":1609.0799560547, "6":1602.9100341797, "7":1601.5400390625, "8":1576.1199951172, "9":1587.2800292969, "10":1581.7600097656, "11":1574.3699951172, "12":1585.4599609375, "13":1581.4000244141, "14":1601.8599853516, "15":1603.0699462891, "16":1610.1500244141, "17":1612.8699951172, "18":1624.8900146484, "19":1629.6199951172}}
Do a merge between both dataframes, groupby company (index level=0) and apply a custom function: df_merge = close_price_data.merge( latest_day[['Period_High_Max', 'Period_Low_Min']], left_index=True, right_index=True) def fun(df): d = {} d['days_above_min'] = (df.Close_price > df.Period_Low_Min).sum() d['days_below_max'] = (df.Close_price < df.Period_High_Max).sum() return pd.Series(d) df_merge.groupby(level=0).apply(fun) Period_Low_Min and Period_High_Max are the min and max respectively so all closing prices will be in that range, if this is not what you are trying to accomplish let me know.
transform raw date format into pandas date object
I have a CSV file which looks like this: time, Numbers [30/Apr/1998:21:30:17,24736 [30/Apr/1998:21:30:53,24736 [30/Apr/1998:21:31:12,24736 [30/Apr/1998:21:31:19,3781 [30/Apr/1998:21:31:22,- [30/Apr/1998:21:31:27,24736 [30/Apr/1998:21:31:29,- [30/Apr/1998:21:31:29,- [30/Apr/1998:21:31:32,929 [30/Apr/1998:21:31:43,- [30/Apr/1998:21:31:44,1139 [30/Apr/1998:21:31:52,24736 [30/Apr/1998:21:31:52,3029 [30/Apr/1998:21:32:06,24736 [30/Apr/1998:21:32:16,- [30/Apr/1998:21:32:16,- [30/Apr/1998:21:32:17,- [30/Apr/1998:21:32:30,14521 [30/Apr/1998:21:32:33,11324 [30/Apr/1998:21:32:35,24736 [30/Apr/1998:21:32:3l8,671 [30/Apr/1998:21:32:38,1512 [30/Apr/1998:21:32:38,1136 [30/Apr/1998:21:32:38,1647 [30/Apr/1998:21:32:38,1271 [30/Apr/1998:21:32:52,5933 [30/Apr/1998:21:32:58,- [30/Apr/1998:21:32:59,231 upto one billion, forget about numbers column, I have a concern to convert this time-date format in my CSV file to pandas time stamp, so I can plot my dataset and visualize it according to time, as I am new in datascience,here is my approach: step 1: take all the time colum from my CSV file into an array, step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time, step 3: remove "[" from date array, step 4: replace all forward slash into dashes in the date array, step 5: and then append date and time array to make a single pandas format, which will be looks like this, 2017-03-22 15:16:45 as you known that I am new and my approach is naive and also wrong, if someone can help me with providing me code snippet, I will be really happy, thanks
You can pass a format to pd.to_datetime(), in this case: [%d/%b/%Y:%H:%M:%S. Be careful with erroneous data though as seen in row 3 in sample data below ([30/Apr/1998:21:32:3l8,671). To not get an error you can pass errors=coerce, will return Not a Time (NaT). The other way would be to replace those rows manually or write some sort of regex/replace funtion first. import pandas as pd data = '''\ time, Numbers [30/Apr/1998:21:30:17,24736 [30/Apr/1998:21:30:53,24736 [30/Apr/1998:21:32:3l8,671 [30/Apr/1998:21:32:38,1512 [30/Apr/1998:21:32:38,1136 [30/Apr/1998:21:32:58,- [30/Apr/1998:21:32:59,231''' fileobj = pd.compat.StringIO(data) df = pd.read_csv(fileobj, sep=',', na_values=['-']) df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce') print(df) Returns: time Numbers 0 1998-04-30 21:30:17 24736.0 1 1998-04-30 21:30:53 24736.0 2 NaT 671.0 3 1998-04-30 21:32:38 1512.0 4 1998-04-30 21:32:38 1136.0 5 1998-04-30 21:32:58 NaN 6 1998-04-30 21:32:59 231.0 Note that: na_values=['-'] was used here to help pandas understand the Numbers column is actually numbers and not strings. And now we can perform actions like grouping (on minute for instance): print(df.groupby(df.time.dt.minute)['Numbers'].mean()) #time #30.0 24736.000000 #32.0 959.666667
Difficult adding up elements in a pandas DataFrame
I'm currently having trouble adding up the rows for the following DataFrame which I have constructed for the returns for six companies' stocks: def importdata(data): returns=pd.read_excel(data) # Imports the data from Excel returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index return returns_with_dates which outputs: Out[345]: Company 1 Company 2 Company 3 Company 4 Company 5 Company 6 Dates 1997-01-02 31.087620 3.094705 24.058686 31.694404 37.162890 13.462241 1997-01-03 31.896592 3.109631 22.423629 32.064378 37.537013 13.511706 1997-01-06 31.723241 3.184358 18.803148 32.681000 37.038183 13.684925 1997-01-07 31.781024 3.199380 19.503886 33.544272 37.038183 13.660193 1997-01-08 31.607673 3.169431 19.387096 32.927650 37.537013 13.585995 1997-01-09 31.492106 3.199380 19.737465 33.420948 37.038183 13.759214 1997-01-10 32.589996 3.184358 19.270307 34.284219 37.661721 13.858235 1997-01-13 32.416645 3.199380 19.153517 35.147491 38.035844 13.660193 1997-01-14 32.301077 3.184358 19.503886 35.517465 39.407629 13.783946 1997-01-15 32.127726 3.199380 19.387096 35.887438 38.409967 13.759214 1997-01-16 32.532212 3.229232 19.737465 36.257412 39.282921 13.635460 1997-01-17 33.167833 3.259180 20.087835 37.490657 39.033505 13.858235 1997-01-20 33.456751 3.229232 20.438204 35.640789 39.657044 14.377892 1997-01-21 33.225616 3.244158 20.671783 36.010763 40.779413 14.179940 1997-01-22 33.110049 3.289033 21.489312 36.010763 40.654705 14.254138 1997-01-23 32.705563 3.199380 20.905363 35.394140 40.904121 14.229405 1997-01-24 32.127726 3.139579 20.204624 35.764114 40.405290 13.957165 1997-01-27 32.127726 3.094705 20.204624 35.270816 40.779413 13.882968 1997-01-28 31.781024 3.079778 20.788573 34.407544 41.153536 13.684925 1997-01-29 32.185510 3.094705 21.138942 34.654193 41.278244 13.858235 1997-01-30 32.647779 3.094705 21.022153 34.407544 41.652367 13.981898 1997-01-31 32.532212 3.064757 20.204624 34.037570 42.275905 13.858235 For countless hours I have tried summing them up in such a way that I add up the rows from 1997-01-02 to 1997-01-08, 1997-01-09 to 1997-01-15 etc., thus adding up the first five rows, and then the following five rows. Furthermore, I seek to keep the date as an index for the 5th element, so in the case of adding up the elements from 1997-01-02 to 1997-01-08 I seek to keep 1997-01-08 as the index corresponding to the summed up element. It is worth mentioning that I have been using the five row addition as an example, but ideally I seek to add up every n rows, and then the following n rows, whilst maintaining the date in the same way said previously. I have figured out a way - shown in the code below - of doing it in array form, but I don't get to keep the dates in this situation. returns=pd.read_excel(data) # Imports the data from Excel returns_with_dates=returns.set_index('Dates') # Sets the Dates as the df index returns_mat=returns_with_dates.as_matrix() ndays=int(len(returns_mat)/n) # Number of "ndays" in our time-period nday_returns=np.empty((ndays,min(np.shape(returns_mat)))) # Creates an empty array to fill # and accommodate the n-day log-returns for i in range(1,asset_number+1): for j in range(1,ndays+1): nday_returns[j-1,i-1]=np.sum(returns_mat[(n*j)-n:n*j,i-1]) return nday_returns Is there any way of doing this but in a DataFrame context whilst maintaining the dates in the way I said before? I've been trying to do this for sooo long without any kind of success and it's really stressing me out! For some reason everyone finds Pandas extremely useful and easy to use, but I happen to find it the opposite. Any kind of help would be very much appreciated. Thanks in advance.
groupby df.groupby(np.arange(len(df)) // 5).sum() To include the date index as requested g = np.arange(len(df)) // 5 i = df.index.to_series().groupby(g).last() df.groupby(g).sum().set_index(i)
If you have the same number of missing dates you can resample it by the number of days you desire. Using resample keeps the dates in the index. You can also use the loffset parameter to shift the dates. df.resample('7D', loffset='6D').sum() Company 1 Company 2 Company 3 Company 4 Company 5 \ Dates 1997-01-08 158.096150 15.757505 104.176445 162.911704 186.313282 1997-01-15 160.927550 15.966856 97.052271 174.257561 190.553344 1997-01-22 165.492461 16.250835 102.424599 181.410384 199.407588 1997-01-29 160.927549 15.608147 103.242126 175.490807 204.520604 1997-02-05 65.179991 6.159462 41.226777 68.445114 83.928272 Company 6 Dates 1997-01-08 67.905060 1997-01-15 68.820802 1997-01-22 70.305665 1997-01-29 69.612698 1997-02-05 27.840133